Skip to content

Fix IndexError in replace_blank on boundary whitespace#320

Open
KennyUMN wants to merge 1 commit into
OpenBMB:mainfrom
KennyUMN:fix/replace-blank-boundary-index
Open

Fix IndexError in replace_blank on boundary whitespace#320
KennyUMN wants to merge 1 commit into
OpenBMB:mainfrom
KennyUMN:fix/replace-blank-boundary-index

Conversation

@KennyUMN
Copy link
Copy Markdown

What

replace_blank() in src/voxcpm/utils/text_normalize.py raised IndexError on text ending in a space, and silently mishandled text starting with a space.

Why

When the loop hits a space it inspects both neighbours:

if (text[i + 1].isascii() and text[i + 1] != " ") and (text[i - 1].isascii() and text[i - 1] != " "):
  • A trailing space (i == len(text) - 1) makes text[i + 1] index out of range -> IndexError: string index out of range.
  • A leading space (i == 0) makes text[i - 1] evaluate to text[-1] (the last character) via Python's negative indexing, so the blank can be spuriously kept instead of dropped.

The sibling function split_paragraph() in the same file already guards this exact access with if i + 1 < len(text), so the missing bound here is an oversight.

Change

prev_ok = i > 0 and text[i - 1].isascii() and text[i - 1] != " "
next_ok = i + 1 < len(text) and text[i + 1].isascii() and text[i + 1] != " "
if prev_ok and next_ok:
    out_str.append(c)

Boundary spaces are now dropped (they have no neighbour to sit between), matching the function's documented intent ("remove blank between chinese character"). Behaviour on all interior inputs is unchanged.

Testing

Adds tests/test_text_normalize.py (mirrors the existing importlib + stubbed-deps style in tests/test_model_utils.py) covering trailing space, leading space, ASCII-interior spaces, CJK boundaries, and empty string. All pass; verified the pre-fix code raised IndexError on the trailing-space cases.

Note

replace_blank is currently called right after wetext's normalizer, which strips boundary whitespace, so this is a latent defect rather than a user-visible crash today. The fix makes the helper correct and crash-safe in isolation (and robust to any future caller / normalizer change).

replace_blank() indexed text[i + 1] and text[i - 1] unconditionally when
it hit a space. A trailing space (i == len(text) - 1) therefore raised
IndexError, and a leading space (i == 0) let text[i - 1] wrap around to
text[-1] (the last character), which could spuriously preserve the blank.

Guard both neighbour lookups so boundary spaces are dropped instead of
crashing, mirroring the bounds check already present in split_paragraph().
Adds tests/test_text_normalize.py covering leading/trailing/interior cases.
Copilot AI review requested due to automatic review settings May 31, 2026 04:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes replace_blank() edge cases for leading/trailing spaces and adds regression tests to prevent index/wraparound issues.

Changes:

  • Add bounds-checked neighbor logic in replace_blank() to avoid IndexError and text[-1] wraparound.
  • Introduce a focused test suite covering leading/trailing spaces, ASCII spacing rules, CJK adjacency, and empty input.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/test_text_normalize.py Adds regression tests for replace_blank() and loads the module with stubbed third‑party deps.
src/voxcpm/utils/text_normalize.py Updates replace_blank() to safely check neighbors at string boundaries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +11 to +18
# Stub heavy/third-party imports so the module loads without them. We only
# exercise ``replace_blank``, which depends on nothing beyond the stdlib.
for _name in ("regex", "inflect"):
sys.modules.setdefault(_name, types.ModuleType(_name))

_wetext_stub = types.ModuleType("wetext")
_wetext_stub.Normalizer = object
sys.modules.setdefault("wetext", _wetext_stub)
Comment on lines +20 to +23
spec = importlib.util.spec_from_file_location("voxcpm.utils.text_normalize", TEXT_NORMALIZE_PATH)
text_normalize = importlib.util.module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(text_normalize)
Comment on lines +120 to +122
prev_ok = i > 0 and text[i - 1].isascii() and text[i - 1] != " "
next_ok = i + 1 < len(text) and text[i + 1].isascii() and text[i + 1] != " "
if prev_ok and next_ok:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants