Skip to content

FE-884: Recover from cook epic-verification failure instead of halting#232

Open
kostandinang wants to merge 4 commits into
ka/fe-864-pi-timeout-600sfrom
ka/fe-884-epic-verify-recovery
Open

FE-884: Recover from cook epic-verification failure instead of halting#232
kostandinang wants to merge 4 commits into
ka/fe-864-pi-timeout-600sfrom
ka/fe-884-epic-verify-recovery

Conversation

@kostandinang

@kostandinang kostandinang commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Stack Context

Top of the cook/brunch-serve stack, on the operational-hardening PR (#224). The pipeline below it can already detect, cook, verify, and promote a brownfield feature; #224 made that durable. This PR removes the last brittle stop in a cook run: a single epic's verification failure no longer halts the whole run.

What?

Slice A — recoverable epic verification. When an epic's verification fails, the run recovers and keeps going instead of halting the entire cook. The orchestration net gains a recovery route at the epic-verdict transition (net-compiler, net-blueprint, petri-net, topology), and the run artifact records the recovered/failed epic outcome rather than aborting (run-artifact). Covered end-to-end by a new epic-recovery.integration.test.ts.

Slice B — classify infra/timeout failures at the epic verdict. Lift the infra-vs-test failure split (established for slice runs in FE-872) up to the epic-verdict level: an epic that fails on infra or timeout is classified distinctly from a genuine test failure (test-runner, net-compiler), so recovery and reporting can treat "the harness fell over" differently from "the code is wrong."

Plus SPEC/PLAN/CARDS reconciliation for both slices.

Why?

A cook run composes many epics. Halting the entire run because one epic's verification failed — especially when the failure is infra or a timeout, not a real defect — throws away all the work the other epics completed and makes long brownfield runs fragile. Making epic verification recoverable, and classifying why an epic verdict failed, lets a run finish and report partial success honestly instead of collapsing on the first stumble.

Builds directly on the operational-hardening work in #224 (epic verify deps, idle deadlines, git-merge fold composition).

@cursor

cursor Bot commented Jun 18, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes core cook Petri-net epic-verify topology and budget routing in net-compiler.ts; mistakes could mis-route halts, remediation, or promotion. Brownfield-only gates limit blast radius; extensive tests mitigate but real-agent fix quality remains un-oracled.

Overview
Epic verification is no longer a dead end in codebase mode. A failing verify-epic routes to a new epic-remediate chain (dispatch/complete + epic-retry-budget) instead of straight to halt. A code agent runs on the folded __epic__/<id>/ tree; detect-and-reject discards attempts that touch epic integration tests; dual re-verify requires epic integration and slice suites on the folded tree. Accepted fixes transferFoldedFixToSlice onto the representative slice branch so harvestCookRun can fold them. Greenfield still halts on first epic failure.

Infra vs logic at the epic verdict (Slice B): failures split on failureKindETIMEDOUT (and ENOENT) are infra, verify subprocess timeout raised 60s → 180s, and infra blips re-verify under infraRetryCount / maxInfraRetries without consuming remediation budget or invoking the agent.

Docs & tests: memory/CARDS.md, PLAN.md, and SPEC.md (D170-K, I138-K, verification oracle row) updated; epic-recovery.integration.test.ts plus topology, run-artifact, and test-runner coverage.

Reviewed by Cursor Bugbot for commit 193b0d5. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread src/orchestrator/src/net-compiler.ts
Comment thread src/orchestrator/src/net-compiler.ts
@kostandinang kostandinang force-pushed the ka/fe-883-worktree-gc branch from 918bd08 to 593032a Compare June 18, 2026 16:14
@kostandinang kostandinang force-pushed the ka/fe-884-epic-verify-recovery branch from e7541a6 to 4457d97 Compare June 18, 2026 16:15

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4457d97. Configure here.

Comment thread src/orchestrator/src/net-compiler.ts
A failed epic verification now dispatches a remediation code agent against the
folded __epic__ tree and re-verifies, reaching the halt sink only after an
epic-retry-budget is exhausted — mirroring the slice-level run-tests loop. The
fail-sibling routes to a new epic-remediate dispatch/complete chain instead of
straight to halt.

Oracle integrity: a remediation that edits the epic integration test is rejected
(detect-and-reject) and counts against budget. Dual re-verify: acceptance
requires the epic integration test AND the slice suites to pass on the folded
tree, with the combined verdict carried on the routed token. Round-trip:
harvestCookRun folds only slice worktrees, so the folded-tree fix is
diff-transferred and committed to the representative slice branch
(transferFoldedFixToSlice) to reach the promoted artifact.

Verified by topology goldens, run-artifact unit tests, and a seeded scripted-
agent e2e (fixable / reject / exhaustion). Slices B (epic infra/timeout
classification) and C (partial promotion) remain.
SPEC: add D170-K (a failed epic verification is recoverable, not terminal —
remediation loop + detect-and-reject + dual re-verify + diff-transfer
round-trip) and I138-K; the validated round-trip assumption retires into the
decision rather than a standalone row. Fold the oracle strategy into
§Verification Design (topology + scripted-agent recovery e2e) and add the
LLM-remediation-competence blind spot.

PLAN: register epic-verify-recovery (FE-884) as a frontier under Sequencing +
Frontier Definitions — Slice A done, B (epic infra/timeout classification) and
C (partial promotion) remaining.
The verify-epic fail path now routes on failureKind: an infra/timeout failure
re-verifies (bounded by a separate infraRetryCount / RunPolicy.maxInfraRetries),
reaching the halt sink with an honest infra reason on exhaustion — never the
remediation code agent. A test/logic failure still drives the Slice-A
remediation loop.

Correctness fix: spawnSync surfaces a verify timeout as ETIMEDOUT, but only
ENOENT was classified infra, so a timeout was misclassified as `test` and (with
slice A) would have fed the remediation agent a non-bug. isInfraSpawnError now
treats ENOENT and ETIMEDOUT as infra, and the verify ceiling is raised from 60s
to VERIFY_TIMEOUT_MS=180s (npx + code-split warmup). Distinct from FE-864's pi
session deadline.
Update I138-K + D170-K in place (same verify-epic seam): the fail path routes
on failureKind — infra/timeout re-verifies under a separate infraRetryCount/
maxInfraRetries budget, ETIMEDOUT classified infra under a 180s ceiling, while
test/logic still remediates. Mark Slice B done in the epic-verify-recovery
frontier; only Slice C (partial promotion) remains.
@kostandinang kostandinang changed the base branch from ka/fe-883-worktree-gc to graphite-base/232 June 18, 2026 16:47
@kostandinang kostandinang force-pushed the ka/fe-884-epic-verify-recovery branch from 4457d97 to 193b0d5 Compare June 18, 2026 16:47
@kostandinang kostandinang changed the base branch from graphite-base/232 to ka/fe-864-pi-timeout-600s June 18, 2026 16:48

kostandinang commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant