Skip to content

fix(codex-executor): handle response.incomplete + raise output-token budget#38

Merged
sfreudenthaler merged 1 commit into
mainfrom
fix/codex-incomplete-token-budget
Jun 13, 2026
Merged

fix(codex-executor): handle response.incomplete + raise output-token budget#38
sfreudenthaler merged 1 commit into
mainfrom
fix/codex-incomplete-token-budget

Conversation

@sfreudenthaler

Copy link
Copy Markdown
Member

Problem

~43% of recent GPT-5.5 automatic reviews in dotCMS/core posted "❌ Codex Review failed — job failed before producing output" (e.g. core PRs #36150, #36149, #36144, #36130, #36124, #36112). Those PRs got no review, not a false "clean".

Root cause

On the OpenAI Responses API, max_output_tokens caps the combined reasoning + visible-answer tokens — not just the answer. At reasoning_effort: medium, GPT-5.5 sometimes spends the entire 2048 budget reasoning, and the stream ends with status=incomplete (incomplete_details.reason=max_output_tokens) and zero output_text.delta events.

mantle_review.py captured usage only on response.completed and only logged errors on response.failed/error, so an incomplete terminal event fell through everything → empty review → sys.exit(1) → the generic failure sticky. The failure signature in the logs is Tokens: in: ? · out: ? (usage None) with no ::error:: line. It's non-deterministic by reasoning load, not diff size (a 137-line diff failed while a 208-line diff passed).

Fix

  • Raise max_output_tokens default 2048 → 8000 so medium-effort reasoning + the answer both fit.
  • Capture usage/status from response.incomplete (and response.failed), not just response.completed.
  • Retry once when the answer is empty because of max_output_tokens: bump the budget (≥16000) and drop reasoning_effort to low so the visible answer fits.
  • Graceful diagnostic: if still empty, post a clear truncation message and exit 0 (sticky shows the reason) instead of a generic failure. Partial answers are kept and flagged.
  • Correct the stale "max_output_tokens does NOT cap reasoning tokens" comments.

No consumer interface change. → release as v3.1.3.

Validation

  • YAML parses; embedded mantle_review.py compiles.
  • E2E on dotCMS/steve-quarterly-planning (linked after the tag is cut): reproduce an incomplete/empty review, confirm v3.1.3 produces a real review.

…budget

~43% of recent GPT-5.5 reviews in dotCMS/core were posting "❌ Codex Review
failed — job failed before producing output." Root cause: on the Responses
API, max_output_tokens caps the COMBINED reasoning + visible-answer tokens, not
just the answer. At reasoning_effort=medium, GPT-5.5 sometimes spends the whole
2048-token budget thinking and the stream ends status=incomplete
(incomplete_details.reason=max_output_tokens) with ZERO output_text.delta. The
executor captured usage only on response.completed and only logged errors on
response.failed/error, so an incomplete response fell through to an empty review
-> sys.exit(1) -> generic failure sticky. Non-deterministic by reasoning load,
not diff size (a 137-line diff failed while a 208-line diff passed).

Fix:
- Raise max_output_tokens default 2048 -> 8000 so medium-effort reasoning plus
  the answer both fit.
- Capture usage/status from the response.incomplete (and response.failed)
  terminal events, not just response.completed.
- Retry once when the answer is empty *because* of max_output_tokens: bump the
  budget (>=16000) and drop reasoning_effort to low so the visible answer fits.
- If still empty, post a clear truncation diagnostic and exit 0 (sticky shows the
  reason) instead of a generic "review failed". Keep + flag partial answers.
- Correct the stale "max_output_tokens does NOT cap reasoning tokens" comments.

No consumer interface change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sfreudenthaler sfreudenthaler requested review from a team as code owners June 13, 2026 14:29
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@sfreudenthaler sfreudenthaler merged commit 8f58de2 into main Jun 13, 2026
3 checks passed
@sfreudenthaler sfreudenthaler deleted the fix/codex-incomplete-token-budget branch June 13, 2026 14:32
riccardoruocco pushed a commit to riccardoruocco/core that referenced this pull request Jun 16, 2026
## Summary

Bumps the `dotCMS/ai-workflows` pin in the AI review workflows from
`@v3.1.2` to **`@v3.1.4`** (orchestrator + backend reviewer). Supersedes
the intermediate v3.1.3 bump — v3.1.4 includes everything in v3.1.3 plus
the outcome-signaling improvements, so we go straight to it.

## What's in v3.1.3 + v3.1.4

**v3.1.3 — silent-failure fix
([ai-workflows#38](dotCMS/ai-workflows#38
~43% of GPT-5.5 reviews were posting "❌ Codex Review failed — job failed
before producing output." Root cause: `max_output_tokens` caps
reasoning+answer combined, so medium-effort GPT-5.5 sometimes spent the
whole budget reasoning and returned `incomplete` with no text. Fix:
budget 2048→8000, handle `response.incomplete`, retry once with a bigger
budget + lighter reasoning.

**v3.1.4 — clear outcome signaling
([ai-workflows#39](dotCMS/ai-workflows#39
- Sticky header reflects the outcome: `🤖 Codex Review` / `⚠️ truncated`
/ `❌ no output` / `⏱️ canceled`
- The job **fails (red ✗ in checks)** when no review is produced —
surfaces the outcome without gating merges (advisory review)
- Canceled / timed-out runs rewrite the sticky to `⏱️ Codex Review
canceled` instead of leaving it stuck on `🔄 in progress`

## Validation

- v3.1.3 before/after e2e: steve-quarterly-planning dotCMS#105 (@v3.1.2
failed, @v3.1.3 recovered)
- v3.1.4 signaling e2e: steve-quarterly-planning dotCMS#106 (🤖+green
confirmed; ⏱️ cancellation confirmed; retry makes the ❌ path a hardened
safety net)

Closes: dotCMS#36158

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant