FE-884: Recover from cook epic-verification failure instead of halting by kostandinang · Pull Request #232 · hashintel/brunch

kostandinang · 2026-06-18T13:19:00Z

Stack Context

Top of the cook/brunch-serve stack, on the operational-hardening PR (#224). The pipeline below it can already detect, cook, verify, and promote a brownfield feature; #224 made that durable. This PR removes the last brittle stop in a cook run: a single epic's verification failure no longer halts the whole run.

What?

Slice A — recoverable epic verification. When an epic's verification fails, the run recovers and keeps going instead of halting the entire cook. The orchestration net gains a recovery route at the epic-verdict transition (net-compiler, net-blueprint, petri-net, topology), and the run artifact records the recovered/failed epic outcome rather than aborting (run-artifact). Covered end-to-end by a new epic-recovery.integration.test.ts.

Slice B — classify infra/timeout failures at the epic verdict. Lift the infra-vs-test failure split (established for slice runs in FE-872) up to the epic-verdict level: an epic that fails on infra or timeout is classified distinctly from a genuine test failure (test-runner, net-compiler), so recovery and reporting can treat "the harness fell over" differently from "the code is wrong."

Plus SPEC/PLAN/CARDS reconciliation for both slices.

Why?

A cook run composes many epics. Halting the entire run because one epic's verification failed — especially when the failure is infra or a timeout, not a real defect — throws away all the work the other epics completed and makes long brownfield runs fragile. Making epic verification recoverable, and classifying why an epic verdict failed, lets a run finish and report partial success honestly instead of collapsing on the first stumble.

Builds directly on the operational-hardening work in #224 (epic verify deps, idle deadlines, git-merge fold composition).

cursor · 2026-06-18T13:19:07Z

PR Summary

High Risk
Changes core cook Petri-net epic-verify topology and budget routing in net-compiler.ts; mistakes could mis-route halts, remediation, or promotion. Brownfield-only gates limit blast radius; extensive tests mitigate but real-agent fix quality remains un-oracled.

Overview
Epic verification is no longer a dead end in codebase mode. A failing verify-epic routes to a new epic-remediate chain (dispatch/complete + epic-retry-budget) instead of straight to halt. A code agent runs on the folded __epic__/<id>/ tree; detect-and-reject discards attempts that touch epic integration tests; dual re-verify requires epic integration and slice suites on the folded tree. Accepted fixes transferFoldedFixToSlice onto the representative slice branch so harvestCookRun can fold them. Greenfield still halts on first epic failure.

Infra vs logic at the epic verdict (Slice B): failures split on failureKind — ETIMEDOUT (and ENOENT) are infra, verify subprocess timeout raised 60s → 180s, and infra blips re-verify under infraRetryCount / maxInfraRetries without consuming remediation budget or invoking the agent.

Docs & tests: memory/CARDS.md, PLAN.md, and SPEC.md (D170-K, I138-K, verification oracle row) updated; epic-recovery.integration.test.ts plus topology, run-artifact, and test-runner coverage.

^{Reviewed by Cursor Bugbot for commit 193b0d5. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4457d97. Configure here.}

A failed epic verification now dispatches a remediation code agent against the folded __epic__ tree and re-verifies, reaching the halt sink only after an epic-retry-budget is exhausted — mirroring the slice-level run-tests loop. The fail-sibling routes to a new epic-remediate dispatch/complete chain instead of straight to halt. Oracle integrity: a remediation that edits the epic integration test is rejected (detect-and-reject) and counts against budget. Dual re-verify: acceptance requires the epic integration test AND the slice suites to pass on the folded tree, with the combined verdict carried on the routed token. Round-trip: harvestCookRun folds only slice worktrees, so the folded-tree fix is diff-transferred and committed to the representative slice branch (transferFoldedFixToSlice) to reach the promoted artifact. Verified by topology goldens, run-artifact unit tests, and a seeded scripted- agent e2e (fixable / reject / exhaustion). Slices B (epic infra/timeout classification) and C (partial promotion) remain.

SPEC: add D170-K (a failed epic verification is recoverable, not terminal — remediation loop + detect-and-reject + dual re-verify + diff-transfer round-trip) and I138-K; the validated round-trip assumption retires into the decision rather than a standalone row. Fold the oracle strategy into §Verification Design (topology + scripted-agent recovery e2e) and add the LLM-remediation-competence blind spot. PLAN: register epic-verify-recovery (FE-884) as a frontier under Sequencing + Frontier Definitions — Slice A done, B (epic infra/timeout classification) and C (partial promotion) remaining.

The verify-epic fail path now routes on failureKind: an infra/timeout failure re-verifies (bounded by a separate infraRetryCount / RunPolicy.maxInfraRetries), reaching the halt sink with an honest infra reason on exhaustion — never the remediation code agent. A test/logic failure still drives the Slice-A remediation loop. Correctness fix: spawnSync surfaces a verify timeout as ETIMEDOUT, but only ENOENT was classified infra, so a timeout was misclassified as `test` and (with slice A) would have fed the remediation agent a non-bug. isInfraSpawnError now treats ENOENT and ETIMEDOUT as infra, and the verify ceiling is raised from 60s to VERIFY_TIMEOUT_MS=180s (npx + code-split warmup). Distinct from FE-864's pi session deadline.

Update I138-K + D170-K in place (same verify-epic seam): the fail path routes on failureKind — infra/timeout re-verifies under a separate infraRetryCount/ maxInfraRetries budget, ETIMEDOUT classified infra under a 180s ceiling, while test/logic still remediates. Mark Slice B done in the epic-verify-recovery frontier; only Slice C (partial promotion) remains.

kostandinang · 2026-06-18T16:48:30Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

FE-884: Recover from cook epic-verification failure instead of halting #232 👈 (View in Graphite)
FE-883: Cook operational hardening — runtime stability, agent skills, git-merge composition, run GC #224 : 1 other dependent PR (#227 )
FE-864: Brownfield feature delivery from spec — detect, classify, probe, oracle, promote, serve #212 : 2 other dependent PRs (#213 , #223 )
FE-841: Pi agent foundations — in-process SDK, toolchain profiles #194 : 2 other dependent PRs (#198 , #211 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/orchestrator/src/net-compiler.ts

Comment thread src/orchestrator/src/net-compiler.ts

kostandinang force-pushed the ka/fe-883-worktree-gc branch from 918bd08 to 593032a Compare June 18, 2026 16:14

kostandinang force-pushed the ka/fe-884-epic-verify-recovery branch from e7541a6 to 4457d97 Compare June 18, 2026 16:15

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/orchestrator/src/net-compiler.ts

kostandinang added 4 commits June 18, 2026 17:34

kostandinang changed the base branch from ka/fe-883-worktree-gc to graphite-base/232 June 18, 2026 16:47

kostandinang force-pushed the graphite-base/232 branch from 593032a to 89d1d72 Compare June 18, 2026 16:47

kostandinang force-pushed the ka/fe-884-epic-verify-recovery branch from 4457d97 to 193b0d5 Compare June 18, 2026 16:47

kostandinang changed the base branch from graphite-base/232 to ka/fe-864-pi-timeout-600s June 18, 2026 16:48

kostandinang mentioned this pull request Jun 18, 2026

FE-883: Cook operational hardening — runtime stability, agent skills, git-merge composition, run GC #224

Open

This was referenced Jun 18, 2026

FE-881: Cook agent loads the target repo's sandbox-scoped skills #227

Closed

FE-883: Cook artifact lifecycle — git-merge slice composition #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FE-884: Recover from cook epic-verification failure instead of halting#232

FE-884: Recover from cook epic-verification failure instead of halting#232
kostandinang wants to merge 4 commits into
ka/fe-864-pi-timeout-600sfrom
ka/fe-884-epic-verify-recovery

kostandinang commented Jun 18, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

kostandinang commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kostandinang commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack Context

What?

Why?

Uh oh!

cursor Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kostandinang commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kostandinang commented Jun 18, 2026 •

edited

Loading

cursor Bot commented Jun 18, 2026 •

edited

Loading

kostandinang commented Jun 18, 2026 •

edited

Loading