FE-884: Recover from cook epic-verification failure instead of halting#232
FE-884: Recover from cook epic-verification failure instead of halting#232kostandinang wants to merge 4 commits into
Conversation
PR SummaryHigh Risk Overview Infra vs logic at the epic verdict (Slice B): failures split on Docs & tests: Reviewed by Cursor Bugbot for commit 193b0d5. Bugbot is set up for automated code reviews on this repo. Configure here. |
918bd08 to
593032a
Compare
e7541a6 to
4457d97
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4457d97. Configure here.
A failed epic verification now dispatches a remediation code agent against the folded __epic__ tree and re-verifies, reaching the halt sink only after an epic-retry-budget is exhausted — mirroring the slice-level run-tests loop. The fail-sibling routes to a new epic-remediate dispatch/complete chain instead of straight to halt. Oracle integrity: a remediation that edits the epic integration test is rejected (detect-and-reject) and counts against budget. Dual re-verify: acceptance requires the epic integration test AND the slice suites to pass on the folded tree, with the combined verdict carried on the routed token. Round-trip: harvestCookRun folds only slice worktrees, so the folded-tree fix is diff-transferred and committed to the representative slice branch (transferFoldedFixToSlice) to reach the promoted artifact. Verified by topology goldens, run-artifact unit tests, and a seeded scripted- agent e2e (fixable / reject / exhaustion). Slices B (epic infra/timeout classification) and C (partial promotion) remain.
SPEC: add D170-K (a failed epic verification is recoverable, not terminal — remediation loop + detect-and-reject + dual re-verify + diff-transfer round-trip) and I138-K; the validated round-trip assumption retires into the decision rather than a standalone row. Fold the oracle strategy into §Verification Design (topology + scripted-agent recovery e2e) and add the LLM-remediation-competence blind spot. PLAN: register epic-verify-recovery (FE-884) as a frontier under Sequencing + Frontier Definitions — Slice A done, B (epic infra/timeout classification) and C (partial promotion) remaining.
The verify-epic fail path now routes on failureKind: an infra/timeout failure re-verifies (bounded by a separate infraRetryCount / RunPolicy.maxInfraRetries), reaching the halt sink with an honest infra reason on exhaustion — never the remediation code agent. A test/logic failure still drives the Slice-A remediation loop. Correctness fix: spawnSync surfaces a verify timeout as ETIMEDOUT, but only ENOENT was classified infra, so a timeout was misclassified as `test` and (with slice A) would have fed the remediation agent a non-bug. isInfraSpawnError now treats ENOENT and ETIMEDOUT as infra, and the verify ceiling is raised from 60s to VERIFY_TIMEOUT_MS=180s (npx + code-split warmup). Distinct from FE-864's pi session deadline.
Update I138-K + D170-K in place (same verify-epic seam): the fail path routes on failureKind — infra/timeout re-verifies under a separate infraRetryCount/ maxInfraRetries budget, ETIMEDOUT classified infra under a 180s ceiling, while test/logic still remediates. Mark Slice B done in the epic-verify-recovery frontier; only Slice C (partial promotion) remains.
593032a to
89d1d72
Compare
4457d97 to
193b0d5
Compare
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |


Stack Context
Top of the cook/brunch-serve stack, on the operational-hardening PR (#224). The pipeline below it can already detect, cook, verify, and promote a brownfield feature; #224 made that durable. This PR removes the last brittle stop in a cook run: a single epic's verification failure no longer halts the whole run.
What?
Slice A — recoverable epic verification. When an epic's verification fails, the run recovers and keeps going instead of halting the entire cook. The orchestration net gains a recovery route at the epic-verdict transition (
net-compiler,net-blueprint,petri-net,topology), and the run artifact records the recovered/failed epic outcome rather than aborting (run-artifact). Covered end-to-end by a newepic-recovery.integration.test.ts.Slice B — classify infra/timeout failures at the epic verdict. Lift the infra-vs-test failure split (established for slice runs in FE-872) up to the epic-verdict level: an epic that fails on infra or timeout is classified distinctly from a genuine test failure (
test-runner,net-compiler), so recovery and reporting can treat "the harness fell over" differently from "the code is wrong."Plus SPEC/PLAN/CARDS reconciliation for both slices.
Why?
A cook run composes many epics. Halting the entire run because one epic's verification failed — especially when the failure is infra or a timeout, not a real defect — throws away all the work the other epics completed and makes long brownfield runs fragile. Making epic verification recoverable, and classifying why an epic verdict failed, lets a run finish and report partial success honestly instead of collapsing on the first stumble.
Builds directly on the operational-hardening work in #224 (epic verify deps, idle deadlines, git-merge fold composition).