Skip to content

fix(#224): ingest CLI infers content_type so code is code-chunked, not prose#225

Open
mbachaud wants to merge 1 commit into
masterfrom
fix-224-content-type
Open

fix(#224): ingest CLI infers content_type so code is code-chunked, not prose#225
mbachaud wants to merge 1 commit into
masterfrom
fix-224-content-type

Conversation

@mbachaud

Copy link
Copy Markdown
Owner

What

The ingest CLI never set content_type, so helix ingest chunked code as prose paragraphs -- the AST/regex code chunker (CodonChunker._chunk_code, gated on content_type=="code") was unreachable. This infers content_type from the file extension and records the source path.

Why (#224)

  • cmd_ingest.py called sess.ingest(content) (no content_type -> default "text").
  • encoding/fragments.py routes to _chunk_code only when content_type=="code"; otherwise prose _chunk_text.
  • Net: every code corpus (incl. the RepoBench-R BM25-parity run) used prose chunking on code.

Change

  • _content_type_for(path) maps code extensions (.py/.ts/.js/.rs/.go/.java/.c/.cpp/.rb/.lua/.sh/.sql/...) -> "code", else "text".
  • sess.ingest(..., content_type=_content_type_for(f), metadata={"path", "source_id"}) -- also fixes missing source attribution for CLI-ingested docs.
  • tests/test_ingest_content_type.py (5 tests, passing).

Verified

Ingesting helix_context/*.py: 552 chunks (text) -> 574 (code); source_id 100% populated.

Out of scope (tracked separately)

  • tree-sitter as a core dep (currently ast/all extras -> silent regex fallback on default installs).
  • cAST recursive split-then-merge (code-structure PRD). Note: on this corpus AST ~= regex today; the cAST work is what unlocks AST's edge.

…ode-chunked, not prose

`helix ingest` called `sess.ingest(content)` with no content_type, so it
defaulted to "text" and code files were chunked as prose paragraphs -- the
AST/regex code chunker (CodonChunker._chunk_code, gated on content_type=="code")
was never reached. This infers content_type from the file extension
(.py/.ts/.js/.rs/... -> "code") and records the source path so retrieval can
attribute + match by file.

Verified: ingesting helix_context/*.py goes from 552 chunks (prose/text) to 574
(code chunks). Adds tests/test_ingest_content_type.py (5 tests).

Addresses #224 (the CLI path). tree-sitter-as-core-dep and the cAST recursive
split-then-merge are tracked separately (code-structure PRD).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant