Skip to content

feat(codex-executor): surface response.failed error reason + retry once#40

Merged
sfreudenthaler merged 1 commit into
mainfrom
fix/codex-surface-error-retry
Jun 16, 2026
Merged

feat(codex-executor): surface response.failed error reason + retry once#40
sfreudenthaler merged 1 commit into
mainfrom
fix/codex-surface-error-retry

Conversation

@sfreudenthaler

Copy link
Copy Markdown
Member

Context

While investigating a reproducible "❌ Codex Review failed — job failed before producing output" on dotCMS/core PR #35987, a direct probe showed the real cause: openai.gpt-5.5 on bedrock-mantle is currently 500-ing on every request (even a trivial "hello world"), in both us-east-1 and us-east-2, with response.error = server_error. gpt-5.4 and gpt-oss-120b are healthy in the same account/region/auth — so it's an AWS-side outage of the gpt-5.5 model, not the PR, the diff, or our executor.

The executor did surface it as a red ✗ + failure sticky (thanks to v3.1.4), but the sticky said nothing about why — it discarded the response.error. That's the gap this fixes.

Changes

  • Surface response.error (code + message) from the response.failed event — printed as a ::error job annotation and shown in the ❌ sticky, with a note that a server_error is typically an AWS/mantle-side issue, not the PR.
  • Retry once on a transient response.failed (same params). Won't rescue a full model outage (every call fails then), but recovers one-off 5xx blips.
  • New failed outcome → ## ❌ Codex Review — model service error header + reason; job fails (red ✗). The fail-job step now fires for any non-(ok|truncated) outcome.
  • Exit 0 whenever there's a terminal response to diagnose; exit 1 only when no terminal event arrived (resp is None).

No consumer interface change. → release as v3.1.5.

Out of scope (follow-up)

Auto-fallback to a secondary model when the primary is down — tracked separately.

Validation

  • YAML parses; mantle_review.py compiles.
  • E2E against the live gpt-5.5 outage (linked after tag): expect the ❌ sticky to now show server_error: ... and the job to go red.

When the mantle Responses API returns a response.failed terminal event, the
executor discarded response.error and posted a generic "failed before producing
output" sticky — giving no clue whether it was the PR, our code, or AWS. (Seen
live: gpt-5.5 on mantle 500ing on every request with server_error in both
regions, while gpt-5.4 / gpt-oss-120b were healthy — an AWS-side model outage.)

Changes:
- Capture response.error (code + message) from the failed event; print it as a
  ::error job annotation AND show it in the ❌ sticky, with a note that this is
  usually an AWS/mantle service-side issue, not the PR.
- Retry once on a transient response.failed (same params). Won't rescue a full
  model outage, but recovers one-off 5xx blips.
- New outcome "failed" → ❌ header + reason; job fails (red ✗). The fail-job step
  now fires for any non-(ok|truncated) outcome, not just incomplete-empty.
- Exit 0 whenever there's a terminal response to diagnose; exit 1 only when no
  terminal event arrived at all (resp is None → generic failure path).

No consumer interface change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sfreudenthaler sfreudenthaler requested review from a team as code owners June 16, 2026 00:47
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@sfreudenthaler sfreudenthaler merged commit 294c25a into main Jun 16, 2026
3 checks passed
@sfreudenthaler sfreudenthaler deleted the fix/codex-surface-error-retry branch June 16, 2026 00:50
riccardoruocco pushed a commit to riccardoruocco/core that referenced this pull request Jun 16, 2026
… DeepSeek V3.2 (dotCMS#36182)

Two changes to get automatic PR reviews working again and keep them
legible:

## 1. Switch automatic reviewer GPT-5.5 → DeepSeek V3.2
`openai.gpt-5.5` is failing on Bedrock Mantle in **both** us-east-1 and
us-east-2 as of 2026-06-15 (`invalid_prompt: 404 Not Found: Engine not
found` — AWS-side; the alias *and* its `-2026-04-23` snapshot both 404
while still listed in `/v1/models`). Raised with our AWS TAM.

`deepseek.v3.2` is healthy on **bedrock-runtime** (Converse) and is
still a non-Claude reviewer, so it preserves the model-diversity
rationale (Claude writes the code, a different family reviews it).
`deepseek.*` routes to the existing `bedrock-generic` (Converse)
executor — **no IAM or ai-workflows change** needed
(`BedrockInvokeReviewModels` already allows `bedrock:Converse` on
`foundation-model/deepseek.*`).

- Renamed `gpt-automatic-review` → `ai-automatic-review`
(model-agnostic); dropped `reasoning_effort` (mantle-only).
- Interactive `@claude` and the backend reviewer stay on Anthropic
Claude.
- Validated e2e on steve-quarterly-planning dotCMS#108 (routed to
bedrock-generic, posted a real review catching all planted bugs).

## 2. Bump pin v3.1.4 → v3.1.5
[ai-workflows#40](dotCMS/ai-workflows#40):
surfaces the actual failure reason on `response.failed` (e.g.
`server_error` / `Engine not found`) in the sticky + job log instead of
a generic "failed before producing output", and retries once on
transient failures. This is exactly what made the gpt-5.5 outage
diagnosable.

Switch back to a mantle model by setting `model_id` to an `openai.*` id
once AWS restores gpt-5.5.

Closes: dotCMS#36181

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant