FE-883: Orchestrator operational improvements (serve reliability + Opus 4.8)#224
FE-883: Orchestrator operational improvements (serve reliability + Opus 4.8)#224kostandinang wants to merge 50 commits into
Conversation
PR SummaryMedium Risk Overview Git & promotion: Run/slice branches move from Run artifact (new): Sandboxes: Slice git worktrees are lazy ( Verification & grid: Test runner adds CLI robustness: Cook rejects unknown flags; Petrinaut/plan/sandbox errors throw instead of Reviewed by Cursor Bugbot for commit 7a7e1e5. Bugbot is set up for automated code reviews on this repo. Configure here. |
90bb5ef to
40a9d88
Compare
5b145b2 to
7d39fe7
Compare
a5bfc10 to
ac4e47c
Compare
40a9d88 to
05b471a
Compare
05b471a to
89e7850
Compare
ac4e47c to
3e02bfb
Compare
Each cook agent action (write-tests, write-code, verify-epic) runs under a per-action wall-clock budget enforced in pi-actions.ts. Raise the default from 300s to 600s so Sonnet agents have headroom on larger slices and on brownfield repos where setup/discovery eats into the turn. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…eam, clean failures Iterating on the live TUI from real-terminal feedback: - One global run timer in the footer instead of a per-item clock on every pending row (and whole-second, no jittery decimals). - "brunch" wordmark is now a big lowercase figlet (Slant) in a warm orange gradient, replacing the egg. - Activity log + wordmark stream through Ink <Static> so the full run lands in scrollback instead of collapsing in a redrawn bounded box; line cap removed. - Brigade tracker no longer lights "taste" mid-cook — per-slice verify actions fire during cooking, so taste stays unlit until a real end-of-cook signal. - Failures throw instead of process.exit, so withCookBus disposes (unmounts Ink) before the error prints — no more frozen "prep ◐" hang. cook validates args before mounting the TUI and rejects unknown flags (e.g. --spec-id). check + presenter/cook/pi-actions tests green; full build deferred (active graphite stack navigation). Co-Authored-By: Claude <noreply@anthropic.com>
Wires the two remaining kitchen-brigade phases faithfully to the orchestrator-arcs mapping (verify→taste, ship→serve): - taste lights on the epic-verification verdict (action `epic <id> → …`), not on per-slice `verify <target>` lines — those fire mid-cook and previously lit taste while still cooking. - serve lights on a new `cook-done` event emitted at the end of runCook (after promotion); a halted run never ships, so it never lights serve. phase.test covers both signals + the full prep→cook→taste→plate→serve walk; check + presenter/cook tests green. Co-Authored-By: Claude <noreply@anthropic.com>
Each cook pi session was a black box in the pending panel — just a KB count.
runPi already subscribes to the session's text stream; instead of bytes,
surface the agent's latest non-empty line (tail-truncated, throttled every
2 KB) as the activity-progress detail, so a wait reads as live work ("agent
writing tests · …adds the RefreshToken guard") rather than "still going".
Kept headless createAgentSession — no pi InteractiveMode, no new pi API: pi's
tool-call events come via an extension hook (on('tool_call')), not the
subscribe stream, so a richer "editing <file> / running <tool>" heartbeat is a
separate follow-up that needs the extension-registration path verified.
check + pi-actions/presenter tests green.
Co-Authored-By: Claude <noreply@anthropic.com>
Richer "what the agent is doing" in the pending panel (the spike's Option A, full tier): instead of only the agent's latest line, show the tool calls — "edit src/auth/token.ts", "bash bun test", "grep RefreshToken". pi exposes no tool-call hook on session.subscribe (text/lifecycle only), so buildSessionOptions now supplies the built-in tools itself via customTools + noTools:'builtin': each createXToolDefinition(cwd) is wrapped to emit a label from its params, then delegates unchanged. The builders bake in the real config (withFileMutationQueue, truncation defaults), so behavior is preserved — confirmed in pi's edit.js. Observation is fail-safe (emit in try/catch). toolLabel + instrumentToolDefinition are pure/unit-tested (label mapping; wrap delegates same args + result; observer error can't break a tool call). Caveat: the customTools/noTools runtime wiring isn't covered by tests (they stub createSession, bypassing buildSessionOptions) — needs a real cook run to confirm the agent receives the instrumented tools and they emit live. check + pi-actions tests green. Co-Authored-By: Claude <noreply@anthropic.com>
Brownfield cook provisioned every slice's git worktree eagerly in wireHandlers — N `git worktree add` + N recursive node_modules CoW copies paid synchronously at startup before any slice fired. - Move slice-worktree creation into resolveSliceCwd via idempotent ensureSliceWorktree, so a slice's worktree is materialized on first fire. A run touching 2 of 8 slices pays for 2 worktrees, not 8. Synchronous provisioning serializes concurrent fires on the JS thread, so parallel-policy worktree adds never overlap. - Symlink each slice's node_modules to the parent worktree's single copy instead of CoW-copying per slice (SHAREABLE_TOP_LEVEL_ENTRIES). walkFiles already skips symlinks, so the shared tree is never re-walked during dep seeding, merge, or promotion. Other gitignored dirs still copy per slice. Correctness-neutral: same worktrees/branches, just lazy; deps resolve through the symlink. npm run verify green; adds symlink + idempotency unit tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
363cba6 to
91a281d
Compare
Opus 4.8 rejects thinking.type=enabled ("Use thinking.type.adaptive and
output_config.effort"). Swap the interviewer's enabled/budgetTokens block
for { type: 'adaptive' } + effort: 'medium'. Only interview.ts sets explicit
thinking provider options; secondary-chat and plan-architect don't.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The grid made the happy path legible; this does the unhappy path.
- events: slice + cook-done gain `reason?`.
- pi-actions: failed-slice emits carry a reason — evaluate-done maps failureKind
('infra error' / 'tests failed'); write-tests/write-code catches use
'test authoring failed' / 'code authoring failed'.
- cook-cli: cook-done carries result.reason on halt.
- run-store: slice rows store `reason`; RunState.haltReason set from
cook-done(ok:false), unset on completion.
- ink: a failed slice row shows its reason (red); a HaltSummary footer pins
"✗ halted · <reason>" (truncated) when the run halts.
Retry-attempt counts ("3/3") DEFERRED — retryCount lives on the petri-net
tokens, not the action context; that's a separate slice.
Verified in an isolated worktree off the TUI branch: check clean; presenter +
pi-actions + cook tests green (141). The one full-suite failure is
build-boundary.test (client production-chunk manifest) — unrelated to this
server-only change; a worktree/symlinked-node_modules artifact, re-verify on a
normal checkout.
Co-Authored-By: Claude <noreply@anthropic.com>
Completes the retry half of failure legibility — without net-token plumbing. A slice loops failed→running on retry, so the attempt count is derivable from the slice-event stream the store already folds: queued→running is attempt 1, a step change mid-run keeps it, failed→running bumps it. The grid shows "attempt N" only for N≥2 (no clutter on the first run). No engine/net changes: retryCount on the petri tokens isn't needed — the observable slice transitions carry the same information. check + presenter tests green (46). Co-Authored-By: Claude <noreply@anthropic.com>
The attempt count now reads "attempt 2/4" — n over the total attempts allowed (retry budget + 1). cook-cli emits maxRetries on run-shape; the store derives maxAttempts = maxRetries + 1; the grid shows it only once a slice has retried (n ≥ 2), and falls back to a bare "attempt n" when the budget is unknown. Denominator confirmed against the net: retryCount halts at `>= maxRetries` (starts 0), so a slice gets maxRetries + 1 total attempts. check + presenter tests green (49). Co-Authored-By: Claude <noreply@anthropic.com>
Opus 4.8 otherwise splits a single `ask_question` into several parallel partial calls (one option each, or leaked tool-call XML), which land as `output-error` parts. - interview.ts: set `disableParallelToolUse` so the model emits one tool call per step. - app.ts: `stripFailedToolParts` drops echoed `output-error` tool parts before `validateUIMessages`. They carry `rawInput` but no `input`, so the output-error schema rejects them and bricks the whole follow-up turn; the retried call is what matters, the dead attempt carries no history value. - app.test.ts: covers a follow-up turn whose history contains a failed attempt followed by a successful retry. verify green (check + test + build). 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
Evaluate gate (phantom failure fix):
- Add `absent` TestFailureKind for vitest "No test files found" (nothing built
yet), distinct from a genuine `test` red ("Cannot find module" stays `test`).
- runVerification aggregate precedence infra > test > absent.
- evaluate-done no longer emits `failed` for the unbuilt greenfield gate — the
slice stays running and flows to write-tests as the SAME attempt, killing the
phantom n/max + ✗ NEEDS WORK on every clean slice.
- Strip `[agent-tail]` harness noise from captured runner output.
Promotion ref namespace + composer:
- run-refs.ts: centralize the run git-ref namespace (brunch/run/<id>,
brunch/slice/<id>/<sid>), replacing duplicated cook/ + cook-slice/ strings;
prune stale worktree registrations before `git worktree add`. On-disk
.brunch/cook/runs/ path unchanged (FE-743 compat).
- run-artifact.ts: fold per-slice branches into the run branch via
`git merge-tree` plumbing in dependency order — real conflicts surface
fail-closed, one merge node per slice, `squash` granularity retained. Pure
plumbing: no working tree / index / active branch touched. Not yet wired into
cook-cli (needs a live-run check of the dependency-seed interaction).
🍳 Built with brunch
Co-Authored-By: claude-opus-4-8 <noreply@anthropic.com>
Cook runs write per-slice worktrees under .brunch/cook/runs/<id>/worktree/, which Vite's default file watcher picks up — every slice edit triggered a page reload and tsconfig-change full reload in the host dev server. - vite-client.ts: server.watch.ignored = ['**/.brunch/**']. Merged with Vite's built-in ignores (.git, node_modules, cacheDir), so run artifacts no longer drive HMR. 🍳 Built with brunch Co-Authored-By: claude-opus-4-8 <noreply@anthropic.com>
project-detect.test.ts had the "resolves the runner from workspace packages in a monorepo" describe block verbatim twice — remove the copy. 🍳 Built with brunch Co-Authored-By: claude-opus-4-8 <noreply@anthropic.com>
Greenfield serves accepted --land but never landed: landing only runs in the brownfield (codebase) promotion path. Greenfield has no repo branch to land onto, so surface the no-op instead of silently swallowing the flag. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
write-tests and write-code emit a failed slice event before rethrowing on runPi failure; verify-epic did not, so epic-verification failures left the slice grid without a failed row or reason. Mirror the wrapper on the epic's representative slice. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
mergeSlicesIntoEpicSandbox builds __epic__ trees via walkFiles, which skips symlinks, so the parent's symlinked node_modules never reached the merged cwd. verify-epic and probes then ran without resolvable deps even though per-slice verification passed. Re-link the shareable entries after the merge. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
The `withTestDir relocates test targets…` describe block was copied verbatim a second time. Remove the duplicate. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
Cook actions (write-tests/write-code/verify-epic) and the interview already run on claude-opus-4-8; the architect default lagged at claude-opus-4-6. Lift it to 4-8 so plan, cook, and interview paths share the latest Opus. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
… detail Three serve/plan presentation fixes the dogfood run surfaced: - Recipe banner: the header said "brunch plan" / "✓ plan" while the phase tracker right below it reads "recipe". Rename both to "recipe" so the planning surface speaks one brigade term (the CLI command stays `plan`; this is the phase label). - out folder: plan-start showed the launch root (e.g. the repo dir). Emit the spec plan dir (.brunch/cook/specs/<id>) — the exact place plan.yaml lands. - Verbose warnings: --verbose now appends a one-sentence plain-English account after each terse warning code (explainEmitterWarning / explainReconciliationWarning), so a reviewer needn't know the code vocabulary. Default verbosity keeps the terse line. Also fix a pre-existing typecheck break in run-artifact.test.ts (mode 'codebase' → 'greenfield'; PlanMode is greenfield|brownfield) that was reddening the gate. 🍳 Built with brunch Co-Authored-By: claude-opus-4-8 <noreply@anthropic.com>
The cook dogfood run halted after the graph-canvas test-author hit the 600s PI_TIMEOUT_MS. It wasn't hung — it was continuously writing files, reasoning, and re-running vitest right up to the wall (the heaviest slice in the plan, the React Flow canvas composing every upstream graph slice). A single fatal pi timeout there halted the whole run; nothing was promoted. Reframe PI_TIMEOUT_MS as an idle deadline rather than a wall-clock cap. The timer is re-armed (armIdleTimer) on every session event — text, tool call, tool-execution progress, thinking — so the budget bounds dead air, not total work. A genuinely hung session (no events for the budget) still aborts and rejects exactly as before; a heavy-but-progressing slice no longer trips it. Covered by a new test that emits non-text activity every 15ms past a 40ms budget and asserts the session survives (a wall-clock cap would abort it); the existing hang/setup-timeout tests still confirm true dead air trips it. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
Formatter-only churn surfaced by npm run fix (oxfmt) — long call/import lines wrapped to print width. No logic changes. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
| }, | ||
| ]; | ||
| } | ||
| input.emit?.({ kind: 'slice', id: sliceId, epicId, status: 'failed' }); |
There was a problem hiding this comment.
Verify failures omit slice reason
Low Severity
Net-level verification emits slice failed events without a reason, while evaluate-done and writer abort paths set reasons such as tests failed or infra error. Failed rows from the test-runner transition therefore show no failure reason in the slice grid despite the presenter supporting reason.
Reviewed by Cursor Bugbot for commit 3f6ca8d. Configure here.
…acle The cook dogfood showed every test-writer agent running the same dance: write tests → confirm red → scaffold a reference implementation → confirm green → delete it. That duplicates the code-writer that runs right after, inside the most expensive phase (~120-250s/slice authoring dominates the critical path), yet the prompt never asked for it — it's emergent from the heavy 'your tests are the sole oracle' pressure, and the agents satisfied 'write tests only' by deleting the throwaway impl afterward. Name the pattern and ban it: green-confidence comes from the documented contract and reading the surrounding code, not from implementing the slice to watch it pass. Red-for-the-right-reason (run the tests, see absent behavior) stays. verify-epic shares this prompt, so the epic-integration authoring gate benefits too. Prompt-level nudge against emergent behavior, not a hard guarantee — the proof is the next run's authoring times. 🍳 Built with brunch Co-Authored-By: Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7a7e1e5. Configure here.
| this.clock = createElapsedClock(now); | ||
| this.state = { command, phase: 'prep', lines: [], pending: [] }; | ||
| this.state = { command, phase: 'prep', lines: [], pending: [], epics: [], slices: [], runStart: now() }; | ||
| } |
There was a problem hiding this comment.
Ink footer timer ignores cook-start
Medium Severity
The Ink footer elapsed clock uses RunStore’s constructor runStart, while action log lines use ElapsedClock seeded from cook-start. On brunch serve, recipe/plan time is included in the footer but not in per-action timestamps, so two timers diverge in one screen.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7a7e1e5. Configure here.



Stack Context
Stacks on FE-878 (#222). FE-883 — operational improvements under the FE-864 brownfield-orchestration umbrella (planning PR #212). No longer the single timeout tweak it started as — this branch collects the operational improvements that make
brunch serveusable on real brownfield runs.What?
DEFAULT_ARCHITECT_MODEL_ID) all runclaude-opus-4-8; the interviewer uses adaptive thinking + effort, and a fix stops 4.8 fragmentingask_questioninto failed tool parts.verify-epic), keep the grid current during verification, surface the failure reason + a pinned halt summary, show per-slice attempt count, and a "not-started" evaluate gate so a greenfield slice isn't painted ✗ before its tests exist.serve --land— merge the promoted cook branch into the active branch as the final serve step; greenfield runs (no repo branch to land onto) warn instead of silently ignoring the flag.run-artifact.ts/run-refs.ts) composes the completed run.__epic__tree re-links the parent's symlinkednode_modulessoverify-epicand probes have resolvable deps..brunchcook worktrees.Why?
The spec-23 run exposed serve reliability as a broad issue, not a single timeout: actions timed out on real repos, the architect's model default produced weaker fallback plans that amplified slice counts, and slice failures — especially writer and epic-verify aborts — vanished from the grid with no reason. This branch is the FE-864 umbrella's implementation seam for those concrete operational fixes.
Tests
Per-seam unit coverage for the changes above:
serve --landarg mapping + greenfield warning,verify-epicfailed-row emission, the epic-mergenode_modulesre-link, the architect default model id, the run-artifact composer, and the slice-grid presenter.