fix(#224): ingest CLI infers content_type so code is code-chunked, not prose#225
Open
mbachaud wants to merge 1 commit into
Open
fix(#224): ingest CLI infers content_type so code is code-chunked, not prose#225mbachaud wants to merge 1 commit into
mbachaud wants to merge 1 commit into
Conversation
…ode-chunked, not prose `helix ingest` called `sess.ingest(content)` with no content_type, so it defaulted to "text" and code files were chunked as prose paragraphs -- the AST/regex code chunker (CodonChunker._chunk_code, gated on content_type=="code") was never reached. This infers content_type from the file extension (.py/.ts/.js/.rs/... -> "code") and records the source path so retrieval can attribute + match by file. Verified: ingesting helix_context/*.py goes from 552 chunks (prose/text) to 574 (code chunks). Adds tests/test_ingest_content_type.py (5 tests). Addresses #224 (the CLI path). tree-sitter-as-core-dep and the cAST recursive split-then-merge are tracked separately (code-structure PRD).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The ingest CLI never set
content_type, sohelix ingestchunked code as prose paragraphs -- the AST/regex code chunker (CodonChunker._chunk_code, gated oncontent_type=="code") was unreachable. This inferscontent_typefrom the file extension and records the source path.Why (#224)
cmd_ingest.pycalledsess.ingest(content)(no content_type -> default"text").encoding/fragments.pyroutes to_chunk_codeonly whencontent_type=="code"; otherwise prose_chunk_text.Change
_content_type_for(path)maps code extensions (.py/.ts/.js/.rs/.go/.java/.c/.cpp/.rb/.lua/.sh/.sql/...) ->"code", else"text".sess.ingest(..., content_type=_content_type_for(f), metadata={"path", "source_id"})-- also fixes missing source attribution for CLI-ingested docs.tests/test_ingest_content_type.py(5 tests, passing).Verified
Ingesting
helix_context/*.py: 552 chunks (text) -> 574 (code);source_id100% populated.Out of scope (tracked separately)
ast/allextras -> silent regex fallback on default installs).