fix(harness): change strategy on repeated unproductive tool results (#4089)#4389
fix(harness): change strategy on repeated unproductive tool results (#4089)#4389M3gA-Mind wants to merge 3 commits into
Conversation
…inyhumansai#4089) Agents that hit an unproductive tool result tended to retry the same strategy (identical tool + args + error) instead of adapting. The turn loop's only no-progress handling was RepeatedToolFailureMiddleware, which went straight from repeated failure to a hard halt — it never fed a signal back so the model could change approach. Add a reusable no-progress primitive (NoProgressTracker) that tracks recent (tool, args) -> outcome and returns a Continue / Nudge / Halt verdict via an escalation ladder that caps same-strategy retries before forcing an alternative: - identical repeat == 2 -> Nudge: inject a structured "no progress since step X" corrective as a user turn so the model uses a different tool or arguments (enumerate before reading, correct the query) or reports back - identical repeat >= 3 -> Halt (retries exhausted) - varied failures == 4 -> Nudge; >= 6 -> Halt backstop - hard policy-reject >= 2 -> Halt; any success resets RepeatedToolFailureMiddleware becomes a thin driver: it feeds each tool outcome into the tracker and translates the verdict onto the shared steering handle (Nudge -> InjectMessage fed back before the next model call; Halt -> halt summary + Pause). The primitive is harness-agnostic and unit-tested so issues tinyhumansai#4090/tinyhumansai#4092 can layer on it. Add unit tests for the tracker (nudge-then-halt, success reset, changed-args clears the streak, hard-reject, varied backstop, unknown-tool exclusion, threshold clamp) and update the middleware tests to assert the InjectMessage(nudge) -> Pause(halt) escalation.
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 24 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7508a3c840
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…umansai#4089) The nudge fed a User-role message into the loop. Turn-outcome extraction slices the persisted suffix after the last user message (convert::messages_since_last_user), so on a run that nudges and then recovers instead of halting, every assistant/tool cycle before the nudge was dropped from the persisted conversation and the post-turn tool records. Inject the corrective as a system (scaffold) message instead: it still reaches the model at the next steering checkpoint, but the original user turn stays the persistence boundary so the full turn suffix is preserved. Add a regression test asserting the nudge is a system message.
|
Addressed the review feedback: Codex P2 — nudge dropped pre-nudge tool cycles from the persisted turn ( Also pushed a rustfmt-only commit (3f3cdf4) to satisfy the Rust Quality lane. @coderabbitai review |
🧩 Analysis chain🏁 Script executed: #!/bin/bash
cd /tmp || exit
rm -rf openhuman_repo
git clone --depth 50 https://github.com/tinyhumansai/openhuman.git openhuman_repo 2>&1 | tail -20
cd openhuman_repo
gh pr checkout 4389 2>&1 | tail -20 || git fetch origin pull/4389/head:pr-4389 && git checkout pr-4389
git log --oneline -10
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
fd no_progress.rs
fd middleware.rs
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>Length of output: 2031 🏁 Script executed: #!/bin/bash
cd /tmp/openhuman_repo || exit
fd no_progress.rs
fd middleware.rs
git show --stat 0acdc8f | head -30
</br>Length of output: 1595 🏁 Script executed: #!/bin/bash
cd /tmp/openhuman_repo || exit
git show 0acdc8f -- src/openhuman/tinyagents/middleware.rsLength of output: 4581 🏁 Script executed: #!/bin/bash
cd /tmp/openhuman_repo || exit
rg -n "fn system" src/openhuman/tinyagents/*.rs
rg -n "messages_since_last_user" -A 15 src/openhuman/tinyagents/convert.rs | head -60
rg -n "enum TaMessage" -A 15 src/openhuman/tinyagents/*.rsLength of output: 2118 🏁 Script executed: #!/bin/bash
cd /tmp/openhuman_repo || exit
rg -n "fn system\b|fn user\b|enum Message\b" src/openhuman/tinyagents/*.rs -C2 | head -60Length of output: 271
The fix is sound. I confirmed in pub(super) fn messages_since_last_user(messages: &[Message]) -> &[Message] {
let start = messages
.iter()
.rposition(|m| matches!(m, Message::User(_)))
.map(|i| i + 1)
.unwrap_or(0);
&messages[start..]
}Since this only matches on the Nothing else stands out from this delta — the change is minimal and well-targeted. I'll go ahead and do a full pass over the PR now. (⌐■_■) nice catch-and-fix loop ✅ Action performedReview finished.
|
sanil-23
left a comment
There was a problem hiding this comment.
Review: fix(harness): change strategy on repeated unproductive tool results (#4089)
Walkthrough
This extracts the repeated-tool-failure logic out of RepeatedToolFailureMiddleware into a new, harness-free no_progress::NoProgressTracker and adds an escalation ladder: instead of only halting, a repeated-identical (or piling-up-varied) failure now first injects a structured "no progress since step X" nudge as a system scaffold message so the model can change strategy, and only halts once the same-strategy retries are exhausted. Halt timings are preserved at legacy parity (identical → 3, varied backstop → 6, hard-reject → 2), with nudges landing one step earlier (identical → 2, varied → 4). The refactor is clean, the primitive is nicely isolated + unit-tested, and the Codex feedback (system vs. user message so the persistence boundary doesn't move) was correctly addressed. Overall this is close to mergeable — the findings below are mostly design questions, not defects.
I verified the Codex fix is correct: convert::messages_since_last_user slices on rposition(Message::User), so a System nudge genuinely preserves the original user turn as the boundary. Good regression test.
Changed files
| File | Change |
|---|---|
src/openhuman/tinyagents/no_progress.rs |
New. NoProgressTracker + NoProgress verdict ladder, thresholds, nudge/halt copy, unit tests. |
src/openhuman/tinyagents/middleware.rs |
RepeatedToolFailureMiddleware now drives the tracker; translates Nudge → InjectMessage(system) / Halt → Pause. Tests rewritten. |
src/openhuman/tinyagents/mod.rs |
Registers no_progress module; REPEATED_TOOL_FAILURE_THRESHOLD now aliases DEFAULT_IDENTICAL_HALT_THRESHOLD; wiring comment updated. |
Findings
1. [major / question · medium confidence] Batched tool calls in one model response can halt before the nudge is ever seen
middleware.rs:1167 no_progress.rs:128
In tinyagents' agent loop, all tool calls from a single model response are executed serially in one for loop, and after_tool fires per call while model_calls (your step) stays constant for the whole batch. Steering commands are only drained at the next checkpoint (top of the next loop iteration).
Consequence: if a model emits ≥3 identical-failing calls (or ≥6 varied-failing calls) in a single response, same_count/consecutive climb across that one batch and reach NoProgress::Halt in the same model step — and any nudge queued mid-batch via InjectMessage isn't applied until after the whole batch has run. So the "change strategy on your next step" corrective can't influence the batch that tripped it, and [no progress since step X] repeats the same X.
The halt-in-one-batch behaviour predates this PR (the legacy counters worked the same way), so this isn't a regression — but the new nudge is premised on sequential retries, and that premise doesn't hold for parallel/batched tool calls. Worth either:
- documenting the "one tool result per step" assumption on
record(), or - gating the nudge/halt escalation on distinct
stepvalues (only advance the ladder once per model step) so a single fanned-out batch can't skip the nudge and self-halt.
Not blocking, but please confirm the intended behaviour when a turn issues several failing calls at once.
2. [minor / question] Hard-reject path never nudges
no_progress.rs:172
HARD_REJECT_HALT_THRESHOLD == 2 == IDENTICAL_NUDGE_THRESHOLD, and the halt checks run before the nudge checks — so a repeated [policy-blocked] call halts at the 2nd identical repeat with no intervening nudge. That looks intentional (a policy block re-issued unchanged can't succeed) and matches legacy, but it's a silent asymmetry with the rest of the ladder. A one-line comment on the ordering ("hard rejects intentionally skip the nudge") would prevent a future reader from "fixing" it.
3. [minor] hard_reject only covers SecurityPolicy markers, not tool-policy / channel denials
middleware.rs:1150
hard_reject keys off POLICY_BLOCKED_MARKER ([policy-blocked]), which is emitted by SecurityPolicy (command/path/read-only checks). The ToolPolicyMiddleware channel-permission / tool-policy denials do not carry that marker, so a model looping on a channel-permission block takes the slower generic identical-repeat halt (3) rather than the fast hard-reject halt (2). Pre-existing and arguably fine — flagging only because sibling #4390 restructures exactly those denial messages (PolicyDenial::render() → "Blocked: … Reason: …"), so it's a good moment to decide whether tool-policy denials should also be treated as hard rejects.
4. [minor] Sibling-PR overlap — likely merge conflict with #4390 in the same file
middleware.rs
#4390 also edits tinyagents/middleware.rs, but in the ToolPolicyMiddleware region (~L662–870) and adds a use at L38 — no conflict with #4389's RepeatedToolFailureMiddleware region (~L1052–1200). However, both PRs add new tests into the same mod tests block: #4390 at @@ -1577,+82 and #4389's rewrite at @@ -1596,…. Whichever merges second will almost certainly hit a conflict there. Worth coordinating merge order (this and #4390 are otherwise logically independent). #4386/#4387 touch agent/harness/** only — no overlap.
5. [nitpick] NoProgressTracker::reset() is pub but currently unused
no_progress.rs:120
record() self-resets on halt, so nothing calls reset(). The module doc says it's a forward API for #4090/#4092, which is reasonable — just calling it out so it isn't mistaken for dead code (it won't warn since it's pub).
Questions for the author
- Batched tool calls (finding #1): is the ladder meant to advance per-tool-result or per-model-step? Today it's per-result, which can bypass the nudge.
- A transiently-flaky tool that returns the identical error twice then would succeed on the 3rd try is nudged at 2 and halted at 3 — the halt pre-empts the successful retry. This matches legacy halt timing (also 3), so no change here, but is the identical-error-retry case common enough in practice to warrant an escape hatch (e.g. not counting known-retriable transient errors)?
Looks good
- Extraction into a harness-type-free, independently unit-tested primitive is the right call and reads cleanly; the threshold constants are well-documented.
- Halt timings preserve legacy
ProgressGuardparity; nudges are purely additive one step earlier. - Success-resets and changed-argument-resets are covered by tests;
nudged_sig/nudged_streakcorrectly bound each corrective to once-per-signature / once-per-streak. - The Codex fix (system vs. user message) is correct and has a targeted regression test; I confirmed
messages_since_last_userbehaviour backs it up. - Clamping
identical_halt_thresholdto> IDENTICAL_NUDGE_THRESHOLDguarantees a nudge always precedes a halt, with a test for the degeneratenew(1)case.
Read-only review — no code changed.
sanil-23
left a comment
There was a problem hiding this comment.
Approving with comments. The escalation ladder is well-factored, halt timings preserve legacy parity, and the Codex system-vs-user-message fix is correct (verified against messages_since_last_user). Nothing here blocks merge.
One design question worth a look before/after merge (non-blocking): batched tool calls in a single model response advance the ladder per-tool-result while model_calls stays constant, so a fanned-out batch of failing calls can reach Halt without the queued nudge ever being applied — see finding #1 in my detailed review comment. The remaining items are minor/nitpick.
Also flagging the likely mod tests merge conflict with sibling PR #4390 (both add tests to the same block) — coordinate merge order.
Closes #4089
Problem
When a tool call fails or returns nothing useful, agents retry the same strategy (same tool, same arguments) rather than adapting. The turn loop (now on the
tinyagentsharness, #4249) had onlyRepeatedToolFailureMiddleware, which jumped straight from repeated failure to a hard halt — it never fed a "no progress" signal back so the model could change approach first.Enforcement approach
A new reusable no-progress primitive —
NoProgressTracker(src/openhuman/tinyagents/no_progress.rs) — tracks recent(tool, args) → outcomeand returns aContinue/Nudge(signal)/Halt(summary)verdict. It's harness-agnostic (no tinyagents types) so it's unit-testable and reusable by the follow-up reliability work (#4090 self-suspend poll loops, #4092 escalate-after-N) as the foundational primitive.Escalation ladder — caps same-strategy retries before forcing an alternative:
"[no progress since step X] … Do NOT repeat it. Change strategy: use a different tool or arguments (enumerate the directory first / correct the query), or report back"RepeatedToolFailureMiddlewarebecomes a thin driver: it builds aToolAttemptfrom each tool result (+ the arg fingerprint captured inbefore_tool), feeds it to the tracker at loop stepctx.limits.model_calls(), and translates the verdict onto the shared steering handle —Nudge → SteeringCommand::InjectMessage(applied at the next steering checkpoint, before the next model call, same path asforward_collects);Halt → halt summary + Pause(unchanged behavior). Constructor signature is unchanged.This satisfies the issue's acceptance criteria:
InjectMessage).Tests
RepeatedToolFailureMiddlewaretests to assert theInjectMessage(nudge) →Pause(halt) escalation.cargo check --libpasses locally (pre-existing warnings only). Relying on CI for the full test/coverage run.Caveats
upstream/mainif one of those merges first.