Skip to content

fix(harness): change strategy on repeated unproductive tool results (#4089)#4389

Open
M3gA-Mind wants to merge 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4089-no-progress-strategy
Open

fix(harness): change strategy on repeated unproductive tool results (#4089)#4389
M3gA-Mind wants to merge 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4089-no-progress-strategy

Conversation

@M3gA-Mind

Copy link
Copy Markdown
Collaborator

Closes #4089

Problem

When a tool call fails or returns nothing useful, agents retry the same strategy (same tool, same arguments) rather than adapting. The turn loop (now on the tinyagents harness, #4249) had only RepeatedToolFailureMiddleware, which jumped straight from repeated failure to a hard halt — it never fed a "no progress" signal back so the model could change approach first.

Enforcement approach

A new reusable no-progress primitiveNoProgressTracker (src/openhuman/tinyagents/no_progress.rs) — tracks recent (tool, args) → outcome and returns a Continue / Nudge(signal) / Halt(summary) verdict. It's harness-agnostic (no tinyagents types) so it's unit-testable and reusable by the follow-up reliability work (#4090 self-suspend poll loops, #4092 escalate-after-N) as the foundational primitive.

Escalation ladder — caps same-strategy retries before forcing an alternative:

Condition Verdict
identical (tool+args+error) repeat == 2 Nudge — inject "[no progress since step X] … Do NOT repeat it. Change strategy: use a different tool or arguments (enumerate the directory first / correct the query), or report back"
identical repeat ≥ 3 Halt (retries exhausted)
varied consecutive failures == 4 Nudge (step back)
varied consecutive failures ≥ 6 Halt (backstop)
hard policy-reject ≥ 2 Halt
any success reset all counters

RepeatedToolFailureMiddleware becomes a thin driver: it builds a ToolAttempt from each tool result (+ the arg fingerprint captured in before_tool), feeds it to the tracker at loop step ctx.limits.model_calls(), and translates the verdict onto the shared steering handle — Nudge → SteeringCommand::InjectMessage (applied at the next steering checkpoint, before the next model call, same path as forward_collects); Halt → halt summary + Pause (unchanged behavior). Constructor signature is unchanged.

This satisfies the issue's acceptance criteria:

  • ✅ Feed a structured "no progress since step X" signal back into the loop (the nudge InjectMessage).
  • ✅ On repeated failure, require a different next action than the last (the nudge text mandates a change of tool/args).
  • ✅ Cap same-strategy retries before forcing an alternative or handoff (nudge at N-1, halt at N).

Tests

  • 7 unit tests on the tracker: nudge-then-halt, success reset, changed-args clears the identical streak, hard-reject halt, varied-failure nudge→backstop, unknown-tool exclusion from the backstop, halt-threshold clamp.
  • Updated the 3 RepeatedToolFailureMiddleware tests to assert the InjectMessage(nudge) → Pause(halt) escalation.

cargo check --lib passes locally (pre-existing warnings only). Relying on CI for the full test/coverage run.

Caveats

…inyhumansai#4089)

Agents that hit an unproductive tool result tended to retry the same
strategy (identical tool + args + error) instead of adapting. The turn
loop's only no-progress handling was RepeatedToolFailureMiddleware, which
went straight from repeated failure to a hard halt — it never fed a signal
back so the model could change approach.

Add a reusable no-progress primitive (NoProgressTracker) that tracks recent
(tool, args) -> outcome and returns a Continue / Nudge / Halt verdict via an
escalation ladder that caps same-strategy retries before forcing an
alternative:

  - identical repeat == 2 -> Nudge: inject a structured "no progress since
    step X" corrective as a user turn so the model uses a different tool or
    arguments (enumerate before reading, correct the query) or reports back
  - identical repeat >= 3 -> Halt (retries exhausted)
  - varied failures == 4 -> Nudge; >= 6 -> Halt backstop
  - hard policy-reject >= 2 -> Halt; any success resets

RepeatedToolFailureMiddleware becomes a thin driver: it feeds each tool
outcome into the tracker and translates the verdict onto the shared steering
handle (Nudge -> InjectMessage fed back before the next model call; Halt ->
halt summary + Pause). The primitive is harness-agnostic and unit-tested so
issues tinyhumansai#4090/tinyhumansai#4092 can layer on it.

Add unit tests for the tracker (nudge-then-halt, success reset, changed-args
clears the streak, hard-reject, varied backstop, unknown-tool exclusion,
threshold clamp) and update the middleware tests to assert the
InjectMessage(nudge) -> Pause(halt) escalation.
@M3gA-Mind M3gA-Mind requested a review from a team July 1, 2026 18:10
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 24 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2dc0209-bbe5-40a6-aced-f31db2552147

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 0acdc8f.

📒 Files selected for processing (3)
  • src/openhuman/tinyagents/middleware.rs
  • src/openhuman/tinyagents/mod.rs
  • src/openhuman/tinyagents/no_progress.rs

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7508a3c840

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/openhuman/tinyagents/middleware.rs Outdated
M3gA-Mind added 2 commits July 1, 2026 23:49
…umansai#4089)

The nudge fed a User-role message into the loop. Turn-outcome extraction
slices the persisted suffix after the last user message
(convert::messages_since_last_user), so on a run that nudges and then
recovers instead of halting, every assistant/tool cycle before the nudge
was dropped from the persisted conversation and the post-turn tool records.

Inject the corrective as a system (scaffold) message instead: it still
reaches the model at the next steering checkpoint, but the original user
turn stays the persistence boundary so the full turn suffix is preserved.
Add a regression test asserting the nudge is a system message.
@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Addressed the review feedback:

Codex P2 — nudge dropped pre-nudge tool cycles from the persisted turn (middleware.rs): the no-progress nudge was injected as a User message, and convert::messages_since_last_user slices the persisted turn suffix after the last user message — so on a run that nudged and then recovered (rather than halting), every assistant/tool cycle before the nudge was omitted from outcome.conversation and the post-turn tool records. Fixed by injecting the corrective as a system (scaffold) message instead of a user turn (0acdc8f). It still reaches the model at the next steering checkpoint, but the original user message stays the turn's persistence boundary, so the full suffix is preserved. Added a regression test (the_nudge_is_a_system_message_so_it_does_not_move_the_user_boundary).

Also pushed a rustfmt-only commit (3f3cdf4) to satisfy the Rust Quality lane.

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cd /tmp || exit
rm -rf openhuman_repo
git clone --depth 50 https://github.com/tinyhumansai/openhuman.git openhuman_repo 2>&1 | tail -20
cd openhuman_repo
gh pr checkout 4389 2>&1 | tail -20 || git fetch origin pull/4389/head:pr-4389 && git checkout pr-4389
git log --oneline -10
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

fd no_progress.rs
fd middleware.rs
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

Length of output: 2031


🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
fd no_progress.rs
fd middleware.rs
git show --stat 0acdc8f | head -30
</br>

Length of output: 1595


🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
git show 0acdc8f -- src/openhuman/tinyagents/middleware.rs

Length of output: 4581


🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
rg -n "fn system" src/openhuman/tinyagents/*.rs
rg -n "messages_since_last_user" -A 15 src/openhuman/tinyagents/convert.rs | head -60
rg -n "enum TaMessage" -A 15 src/openhuman/tinyagents/*.rs

Length of output: 2118


🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
rg -n "fn system\b|fn user\b|enum Message\b" src/openhuman/tinyagents/*.rs -C2 | head -60

Length of output: 271


@M3gA-Mind Thanks for the detailed writeup — I reviewed commit 0acdc8f.

The fix is sound. I confirmed in convert.rs that messages_since_last_user (Line 311) does an rposition search specifically for Message::User(_) and slices the suffix after it:

pub(super) fn messages_since_last_user(messages: &[Message]) -> &[Message] {
    let start = messages
        .iter()
        .rposition(|m| matches!(m, Message::User(_)))
        .map(|i| i + 1)
        .unwrap_or(0);
    &messages[start..]
}

Since this only matches on the User variant, switching the nudge injection from TaMessage::user(signal) to TaMessage::system(signal) correctly avoids shifting the turn's persistence boundary — the original user message stays the anchor, and the previously-dropped assistant/tool cycles before the nudge will now be retained in outcome.conversation on recover-then-continue runs. The new regression test (the_nudge_is_a_system_message_so_it_does_not_move_the_user_boundary) directly asserts the injected message is TaMessage::System(_) and carries the "no progress since step" text, which is the right guard against regressing this again.

Nothing else stands out from this delta — the change is minimal and well-targeted. I'll go ahead and do a full pass over the PR now.

(⌐■_■) nice catch-and-fix loop

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@sanil-23 sanil-23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(harness): change strategy on repeated unproductive tool results (#4089)

Walkthrough

This extracts the repeated-tool-failure logic out of RepeatedToolFailureMiddleware into a new, harness-free no_progress::NoProgressTracker and adds an escalation ladder: instead of only halting, a repeated-identical (or piling-up-varied) failure now first injects a structured "no progress since step X" nudge as a system scaffold message so the model can change strategy, and only halts once the same-strategy retries are exhausted. Halt timings are preserved at legacy parity (identical → 3, varied backstop → 6, hard-reject → 2), with nudges landing one step earlier (identical → 2, varied → 4). The refactor is clean, the primitive is nicely isolated + unit-tested, and the Codex feedback (system vs. user message so the persistence boundary doesn't move) was correctly addressed. Overall this is close to mergeable — the findings below are mostly design questions, not defects.

I verified the Codex fix is correct: convert::messages_since_last_user slices on rposition(Message::User), so a System nudge genuinely preserves the original user turn as the boundary. Good regression test.

Changed files

File Change
src/openhuman/tinyagents/no_progress.rs New. NoProgressTracker + NoProgress verdict ladder, thresholds, nudge/halt copy, unit tests.
src/openhuman/tinyagents/middleware.rs RepeatedToolFailureMiddleware now drives the tracker; translates Nudge → InjectMessage(system) / Halt → Pause. Tests rewritten.
src/openhuman/tinyagents/mod.rs Registers no_progress module; REPEATED_TOOL_FAILURE_THRESHOLD now aliases DEFAULT_IDENTICAL_HALT_THRESHOLD; wiring comment updated.

Findings

1. [major / question · medium confidence] Batched tool calls in one model response can halt before the nudge is ever seen

middleware.rs:1167 no_progress.rs:128

In tinyagents' agent loop, all tool calls from a single model response are executed serially in one for loop, and after_tool fires per call while model_calls (your step) stays constant for the whole batch. Steering commands are only drained at the next checkpoint (top of the next loop iteration).

Consequence: if a model emits ≥3 identical-failing calls (or ≥6 varied-failing calls) in a single response, same_count/consecutive climb across that one batch and reach NoProgress::Halt in the same model step — and any nudge queued mid-batch via InjectMessage isn't applied until after the whole batch has run. So the "change strategy on your next step" corrective can't influence the batch that tripped it, and [no progress since step X] repeats the same X.

The halt-in-one-batch behaviour predates this PR (the legacy counters worked the same way), so this isn't a regression — but the new nudge is premised on sequential retries, and that premise doesn't hold for parallel/batched tool calls. Worth either:

  • documenting the "one tool result per step" assumption on record(), or
  • gating the nudge/halt escalation on distinct step values (only advance the ladder once per model step) so a single fanned-out batch can't skip the nudge and self-halt.

Not blocking, but please confirm the intended behaviour when a turn issues several failing calls at once.

2. [minor / question] Hard-reject path never nudges

no_progress.rs:172

HARD_REJECT_HALT_THRESHOLD == 2 == IDENTICAL_NUDGE_THRESHOLD, and the halt checks run before the nudge checks — so a repeated [policy-blocked] call halts at the 2nd identical repeat with no intervening nudge. That looks intentional (a policy block re-issued unchanged can't succeed) and matches legacy, but it's a silent asymmetry with the rest of the ladder. A one-line comment on the ordering ("hard rejects intentionally skip the nudge") would prevent a future reader from "fixing" it.

3. [minor] hard_reject only covers SecurityPolicy markers, not tool-policy / channel denials

middleware.rs:1150

hard_reject keys off POLICY_BLOCKED_MARKER ([policy-blocked]), which is emitted by SecurityPolicy (command/path/read-only checks). The ToolPolicyMiddleware channel-permission / tool-policy denials do not carry that marker, so a model looping on a channel-permission block takes the slower generic identical-repeat halt (3) rather than the fast hard-reject halt (2). Pre-existing and arguably fine — flagging only because sibling #4390 restructures exactly those denial messages (PolicyDenial::render()"Blocked: … Reason: …"), so it's a good moment to decide whether tool-policy denials should also be treated as hard rejects.

4. [minor] Sibling-PR overlap — likely merge conflict with #4390 in the same file

middleware.rs

#4390 also edits tinyagents/middleware.rs, but in the ToolPolicyMiddleware region (~L662–870) and adds a use at L38 — no conflict with #4389's RepeatedToolFailureMiddleware region (~L1052–1200). However, both PRs add new tests into the same mod tests block: #4390 at @@ -1577,+82 and #4389's rewrite at @@ -1596,…. Whichever merges second will almost certainly hit a conflict there. Worth coordinating merge order (this and #4390 are otherwise logically independent). #4386/#4387 touch agent/harness/** only — no overlap.

5. [nitpick] NoProgressTracker::reset() is pub but currently unused

no_progress.rs:120

record() self-resets on halt, so nothing calls reset(). The module doc says it's a forward API for #4090/#4092, which is reasonable — just calling it out so it isn't mistaken for dead code (it won't warn since it's pub).


Questions for the author

  • Batched tool calls (finding #1): is the ladder meant to advance per-tool-result or per-model-step? Today it's per-result, which can bypass the nudge.
  • A transiently-flaky tool that returns the identical error twice then would succeed on the 3rd try is nudged at 2 and halted at 3 — the halt pre-empts the successful retry. This matches legacy halt timing (also 3), so no change here, but is the identical-error-retry case common enough in practice to warrant an escape hatch (e.g. not counting known-retriable transient errors)?

Looks good

  • Extraction into a harness-type-free, independently unit-tested primitive is the right call and reads cleanly; the threshold constants are well-documented.
  • Halt timings preserve legacy ProgressGuard parity; nudges are purely additive one step earlier.
  • Success-resets and changed-argument-resets are covered by tests; nudged_sig/nudged_streak correctly bound each corrective to once-per-signature / once-per-streak.
  • The Codex fix (system vs. user message) is correct and has a targeted regression test; I confirmed messages_since_last_user behaviour backs it up.
  • Clamping identical_halt_threshold to > IDENTICAL_NUDGE_THRESHOLD guarantees a nudge always precedes a halt, with a test for the degenerate new(1) case.

Read-only review — no code changed.

@sanil-23 sanil-23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with comments. The escalation ladder is well-factored, halt timings preserve legacy parity, and the Codex system-vs-user-message fix is correct (verified against messages_since_last_user). Nothing here blocks merge.

One design question worth a look before/after merge (non-blocking): batched tool calls in a single model response advance the ladder per-tool-result while model_calls stays constant, so a fanned-out batch of failing calls can reach Halt without the queued nudge ever being applied — see finding #1 in my detailed review comment. The remaining items are minor/nitpick.

Also flagging the likely mod tests merge conflict with sibling PR #4390 (both add tests to the same block) — coordinate merge order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Harness] Change strategy after repeated unproductive results instead of retrying

2 participants