Fix IndexError in replace_blank on boundary whitespace#320
Open
KennyUMN wants to merge 1 commit into
Open
Conversation
replace_blank() indexed text[i + 1] and text[i - 1] unconditionally when it hit a space. A trailing space (i == len(text) - 1) therefore raised IndexError, and a leading space (i == 0) let text[i - 1] wrap around to text[-1] (the last character), which could spuriously preserve the blank. Guard both neighbour lookups so boundary spaces are dropped instead of crashing, mirroring the bounds check already present in split_paragraph(). Adds tests/test_text_normalize.py covering leading/trailing/interior cases.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes replace_blank() edge cases for leading/trailing spaces and adds regression tests to prevent index/wraparound issues.
Changes:
- Add bounds-checked neighbor logic in
replace_blank()to avoidIndexErrorandtext[-1]wraparound. - Introduce a focused test suite covering leading/trailing spaces, ASCII spacing rules, CJK adjacency, and empty input.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_text_normalize.py | Adds regression tests for replace_blank() and loads the module with stubbed third‑party deps. |
| src/voxcpm/utils/text_normalize.py | Updates replace_blank() to safely check neighbors at string boundaries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+11
to
+18
| # Stub heavy/third-party imports so the module loads without them. We only | ||
| # exercise ``replace_blank``, which depends on nothing beyond the stdlib. | ||
| for _name in ("regex", "inflect"): | ||
| sys.modules.setdefault(_name, types.ModuleType(_name)) | ||
|
|
||
| _wetext_stub = types.ModuleType("wetext") | ||
| _wetext_stub.Normalizer = object | ||
| sys.modules.setdefault("wetext", _wetext_stub) |
Comment on lines
+20
to
+23
| spec = importlib.util.spec_from_file_location("voxcpm.utils.text_normalize", TEXT_NORMALIZE_PATH) | ||
| text_normalize = importlib.util.module_from_spec(spec) | ||
| assert spec.loader is not None | ||
| spec.loader.exec_module(text_normalize) |
Comment on lines
+120
to
+122
| prev_ok = i > 0 and text[i - 1].isascii() and text[i - 1] != " " | ||
| next_ok = i + 1 < len(text) and text[i + 1].isascii() and text[i + 1] != " " | ||
| if prev_ok and next_ok: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
replace_blank()insrc/voxcpm/utils/text_normalize.pyraisedIndexErroron text ending in a space, and silently mishandled text starting with a space.Why
When the loop hits a space it inspects both neighbours:
i == len(text) - 1) makestext[i + 1]index out of range ->IndexError: string index out of range.i == 0) makestext[i - 1]evaluate totext[-1](the last character) via Python's negative indexing, so the blank can be spuriously kept instead of dropped.The sibling function
split_paragraph()in the same file already guards this exact access withif i + 1 < len(text), so the missing bound here is an oversight.Change
Boundary spaces are now dropped (they have no neighbour to sit between), matching the function's documented intent ("remove blank between chinese character"). Behaviour on all interior inputs is unchanged.
Testing
Adds
tests/test_text_normalize.py(mirrors the existing importlib + stubbed-deps style intests/test_model_utils.py) covering trailing space, leading space, ASCII-interior spaces, CJK boundaries, and empty string. All pass; verified the pre-fix code raisedIndexErroron the trailing-space cases.Note
replace_blankis currently called right afterwetext's normalizer, which strips boundary whitespace, so this is a latent defect rather than a user-visible crash today. The fix makes the helper correct and crash-safe in isolation (and robust to any future caller / normalizer change).