Skip to content

feat: v0.4a deterministic retrieval core#91

Merged
chocks merged 1 commit into
mainfrom
feat/v04a-deterministic-retrieval
Jun 20, 2026
Merged

feat: v0.4a deterministic retrieval core#91
chocks merged 1 commit into
mainfrom
feat/v04a-deterministic-retrieval

Conversation

@chocks

@chocks chocks commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Implements v0.4 phase A — deterministic retrieval core from the v0.4 codebase intelligence design.

Adds a persistent codebase index (file tree + symbols) and a deterministic context retriever that runs before the LLM-driven ANALYZE phase, reducing token usage and latency for file/symbol lookups.

What's new

src/index/ — codebase indexing

  • FileIndex — scans the repo respecting .gitignore + config ignore patterns; records path, language, size, SHA-256 hash. Supports incremental updates (add/remove/change via hash comparison) and save/load to .locode/index/.
  • SymbolIndex + RegexSymbolExtractor — extracts functions, classes, methods, interfaces, types, and enums from TypeScript, JavaScript, and Python. Uses a swappable SymbolExtractor interface so web-tree-sitter can replace the regex extractor later without API changes.
  • CodebaseIndexer — orchestrates file + symbol indexes, handles full builds and incremental updates, persists/loads from disk.

src/context/ — smart context retrieval

  • BudgetManager — priority-weighted token allocation across files (direct_match > symbol_match > semantic_match > dependency > git_context). Respects max_total_tokens, max_tokens_per_file, and max_files.
  • ContextRetriever — deterministic pipeline: mentioned-file resolution → symbol search → sibling-test discovery → recent-file injection → rank → truncate to budget. Returns RetrievedContext with confidence score and strategies used.

src/tools/definitions/symbol-lookup.ts

  • New symbol_lookup tool (factory function, registered when index is available). Searches the symbol index by name + optional type filter.

Config additions (src/config/schema.ts + locode.yaml)

index:
  enabled: true
  ignore: [node_modules, dist, .git, coverage, "*.min.js", "*.lock"]
  languages: [typescript, javascript, python, go, rust]
  chunk_size: 50
  storage_dir: .locode/index
  auto_update: true

context_retrieval:
  max_files: 5
  max_tokens_per_file: 2000
  max_total_tokens: 8000
  strategy: deterministic-first
  confidence_threshold: 0.7

Orchestrator + CodingAgent integration

  • Orchestrator loads a saved index from .locode/index/ on startup (non-blocking, falls back gracefully if absent).
  • symbol_lookup tool registered when index is available.
  • ContextRetriever optionally injected into CodingAgent — when confidence ≥ 0.7, ANALYZE skips the LLM entirely (zero tokens for context gathering). Low confidence falls through to the existing LLM-driven ANALYZE.
  • New orchestrator.buildCodebaseIndex() method for a future locode index CLI command.

Design decisions

  • Regex extraction instead of tree-sitter (for now): Keeps the bundle small (AGENTS.md). The SymbolExtractor interface means tree-sitter is a drop-in replacement later.
  • find_references deferred to v0.4b: Depends on the dependency graph (import tracking), which is v0.4b scope.
  • Index loading is lazy: No slow full-build on startup. A saved index loads instantly; without one, everything works as before.
  • Backwards-compatible: All new behavior is opt-in via config defaults. No index → no retriever → existing ANALYZE behavior unchanged.

Findings documented

See docs/plans/2026-06-20-v04a-impl-progress.md for implementation decisions and findings:

  1. Zod .default({}) does not recursively apply inner defaults — must use pre-parsed constants
  2. CONFIG_TEMPLATE duplication risk (known, noted in misc-todos.md)
  3. CodingAgent.analyze fast-path is ripe for delegating to ContextRetriever (deferred to avoid changing working behavior)

Test plan

  • 60 new tests (file-index: 11, symbol-index: 20, indexer: 9, budget-manager: 8, context-retriever: 12, symbol-lookup: 5 — minus overlap)
  • All 398 tests pass (npm test)
  • npm run build succeeds
  • npm run lint passes

Add file index, symbol index, context retriever, and budget manager
for repo-aware code intelligence. Includes symbol_lookup tool and
optional ContextRetriever integration in CodingAgent ANALYZE phase.

New modules:
- src/index/ (file-index, symbol-index, indexer)
- src/context/ (budget-manager, context-retriever)
- src/tools/definitions/symbol-lookup.ts

Config additions (schema.ts + locode.yaml):
- index: enabled, ignore, languages, chunk_size, storage_dir, auto_update
- context_retrieval: max_files, max_tokens_per_file, max_total_tokens,
  strategy, confidence_threshold

60 new tests, all 398 pass. Build and lint clean.
@chocks chocks merged commit aba25d7 into main Jun 20, 2026
4 checks passed
@chocks chocks mentioned this pull request Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant