fix(harness): change strategy on repeated unproductive tool results (#4089) by M3gA-Mind · Pull Request #4389 · tinyhumansai/openhuman

M3gA-Mind · 2026-07-01T18:10:04Z

Problem

When a tool call fails or returns nothing useful, agents retry the same strategy (same tool, same arguments) rather than adapting. The turn loop (now on the tinyagents harness, #4249) had only RepeatedToolFailureMiddleware, which jumped straight from repeated failure to a hard halt — it never fed a "no progress" signal back so the model could change approach first.

Enforcement approach

A new reusable no-progress primitive — NoProgressTracker (src/openhuman/tinyagents/no_progress.rs) — tracks recent (tool, args) → outcome and returns a Continue / Nudge(signal) / Halt(summary) verdict. It's harness-agnostic (no tinyagents types) so it's unit-testable and reusable by the follow-up reliability work (#4090 self-suspend poll loops, #4092 escalate-after-N) as the foundational primitive.

Escalation ladder — caps same-strategy retries before forcing an alternative:

Condition	Verdict
identical (tool+args+error) repeat == 2	Nudge — inject `"[no progress since step X] … Do NOT repeat it. Change strategy: use a different tool or arguments (enumerate the directory first / correct the query), or report back"`
identical repeat ≥ 3	Halt (retries exhausted)
varied consecutive failures == 4	Nudge (step back)
varied consecutive failures ≥ 6	Halt (backstop)
hard policy-reject ≥ 2	Halt
any success	reset all counters

RepeatedToolFailureMiddleware becomes a thin driver: it builds a ToolAttempt from each tool result (+ the arg fingerprint captured in before_tool), feeds it to the tracker at loop step ctx.limits.model_calls(), and translates the verdict onto the shared steering handle — Nudge → SteeringCommand::InjectMessage (applied at the next steering checkpoint, before the next model call, same path as forward_collects); Halt → halt summary + Pause (unchanged behavior). Constructor signature is unchanged.

This satisfies the issue's acceptance criteria:

✅ Feed a structured "no progress since step X" signal back into the loop (the nudge InjectMessage).
✅ On repeated failure, require a different next action than the last (the nudge text mandates a change of tool/args).
✅ Cap same-strategy retries before forcing an alternative or handoff (nudge at N-1, halt at N).

Tests

7 unit tests on the tracker: nudge-then-halt, success reset, changed-args clears the identical streak, hard-reject halt, varied-failure nudge→backstop, unknown-tool exclusion from the backstop, halt-threshold clamp.
Updated the 3 RepeatedToolFailureMiddleware tests to assert the InjectMessage(nudge) → Pause(halt) escalation.

cargo check --lib passes locally (pre-existing warnings only). Relying on CI for the full test/coverage run.

Caveats

The tracker resets on any tool success, so a genuinely varied, progressing multi-tool turn is never nudged/halted.
Collision note: [Harness] Always emit a final answer/summary before a turn ends #4093 and [Harness] Required structured output (JSON 'thoughts' block) frequently omitted #4117 also touch the harness turn engine — will rebase onto upstream/main if one of those merges first.

…inyhumansai#4089) Agents that hit an unproductive tool result tended to retry the same strategy (identical tool + args + error) instead of adapting. The turn loop's only no-progress handling was RepeatedToolFailureMiddleware, which went straight from repeated failure to a hard halt — it never fed a signal back so the model could change approach. Add a reusable no-progress primitive (NoProgressTracker) that tracks recent (tool, args) -> outcome and returns a Continue / Nudge / Halt verdict via an escalation ladder that caps same-strategy retries before forcing an alternative: - identical repeat == 2 -> Nudge: inject a structured "no progress since step X" corrective as a user turn so the model uses a different tool or arguments (enumerate before reading, correct the query) or reports back - identical repeat >= 3 -> Halt (retries exhausted) - varied failures == 4 -> Nudge; >= 6 -> Halt backstop - hard policy-reject >= 2 -> Halt; any success resets RepeatedToolFailureMiddleware becomes a thin driver: it feeds each tool outcome into the tracker and translates the verdict onto the shared steering handle (Nudge -> InjectMessage fed back before the next model call; Halt -> halt summary + Pause). The primitive is harness-agnostic and unit-tested so issues tinyhumansai#4090/tinyhumansai#4092 can layer on it. Add unit tests for the tracker (nudge-then-halt, success reset, changed-args clears the streak, hard-reject, varied backstop, unknown-tool exclusion, threshold clamp) and update the middleware tests to assert the InjectMessage(nudge) -> Pause(halt) escalation.

coderabbitai · 2026-07-01T18:10:14Z

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 24 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2dc0209-bbe5-40a6-aced-f31db2552147

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 0acdc8f.

📒 Files selected for processing (3)

src/openhuman/tinyagents/middleware.rs
src/openhuman/tinyagents/mod.rs
src/openhuman/tinyagents/no_progress.rs

_{Comment @coderabbitai help to get the list of available commands.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7508a3c840

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…umansai#4089) The nudge fed a User-role message into the loop. Turn-outcome extraction slices the persisted suffix after the last user message (convert::messages_since_last_user), so on a run that nudges and then recovers instead of halting, every assistant/tool cycle before the nudge was dropped from the persisted conversation and the post-turn tool records. Inject the corrective as a system (scaffold) message instead: it still reaches the model at the next steering checkpoint, but the original user turn stays the persistence boundary so the full turn suffix is preserved. Add a regression test asserting the nudge is a system message.

M3gA-Mind · 2026-07-01T18:44:17Z

Addressed the review feedback:

Codex P2 — nudge dropped pre-nudge tool cycles from the persisted turn (middleware.rs): the no-progress nudge was injected as a User message, and convert::messages_since_last_user slices the persisted turn suffix after the last user message — so on a run that nudged and then recovered (rather than halting), every assistant/tool cycle before the nudge was omitted from outcome.conversation and the post-turn tool records. Fixed by injecting the corrective as a system (scaffold) message instead of a user turn (0acdc8f). It still reaches the model at the next steering checkpoint, but the original user message stays the turn's persistence boundary, so the full suffix is preserved. Added a regression test (the_nudge_is_a_system_message_so_it_does_not_move_the_user_boundary).

Also pushed a rustfmt-only commit (3f3cdf4) to satisfy the Rust Quality lane.

@coderabbitai review

coderabbitai · 2026-07-01T18:45:27Z

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cd /tmp || exit
rm -rf openhuman_repo
git clone --depth 50 https://github.com/tinyhumansai/openhuman.git openhuman_repo 2>&1 | tail -20
cd openhuman_repo
gh pr checkout 4389 2>&1 | tail -20 || git fetch origin pull/4389/head:pr-4389 && git checkout pr-4389
git log --oneline -10
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

fd no_progress.rs
fd middleware.rs
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

Length of output: 2031

🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
fd no_progress.rs
fd middleware.rs
git show --stat 0acdc8f | head -30
</br>

Length of output: 1595

🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
git show 0acdc8f -- src/openhuman/tinyagents/middleware.rs

Length of output: 4581

🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
rg -n "fn system" src/openhuman/tinyagents/*.rs
rg -n "messages_since_last_user" -A 15 src/openhuman/tinyagents/convert.rs | head -60
rg -n "enum TaMessage" -A 15 src/openhuman/tinyagents/*.rs

Length of output: 2118

🏁 Script executed:

#!/bin/bash
cd /tmp/openhuman_repo || exit
rg -n "fn system\b|fn user\b|enum Message\b" src/openhuman/tinyagents/*.rs -C2 | head -60

Length of output: 271

@M3gA-Mind Thanks for the detailed writeup — I reviewed commit 0acdc8f.

The fix is sound. I confirmed in convert.rs that messages_since_last_user (Line 311) does an rposition search specifically for Message::User(_) and slices the suffix after it:

pub(super) fn messages_since_last_user(messages: &[Message]) -> &[Message] {
    let start = messages
        .iter()
        .rposition(|m| matches!(m, Message::User(_)))
        .map(|i| i + 1)
        .unwrap_or(0);
    &messages[start..]
}

Since this only matches on the User variant, switching the nudge injection from TaMessage::user(signal) to TaMessage::system(signal) correctly avoids shifting the turn's persistence boundary — the original user message stays the anchor, and the previously-dropped assistant/tool cycles before the nudge will now be retained in outcome.conversation on recover-then-continue runs. The new regression test (the_nudge_is_a_system_message_so_it_does_not_move_the_user_boundary) directly asserts the injected message is TaMessage::System(_) and carries the "no progress since step" text, which is the right guard against regressing this again.

Nothing else stands out from this delta — the change is minimal and well-targeted. I'll go ahead and do a full pass over the PR now.

(⌐■_■) nice catch-and-fix loop

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

sanil-23

Review: `fix(harness): change strategy on repeated unproductive tool results (#4089)`

Walkthrough

This extracts the repeated-tool-failure logic out of RepeatedToolFailureMiddleware into a new, harness-free no_progress::NoProgressTracker and adds an escalation ladder: instead of only halting, a repeated-identical (or piling-up-varied) failure now first injects a structured "no progress since step X" nudge as a system scaffold message so the model can change strategy, and only halts once the same-strategy retries are exhausted. Halt timings are preserved at legacy parity (identical → 3, varied backstop → 6, hard-reject → 2), with nudges landing one step earlier (identical → 2, varied → 4). The refactor is clean, the primitive is nicely isolated + unit-tested, and the Codex feedback (system vs. user message so the persistence boundary doesn't move) was correctly addressed. Overall this is close to mergeable — the findings below are mostly design questions, not defects.

I verified the Codex fix is correct: convert::messages_since_last_user slices on rposition(Message::User), so a System nudge genuinely preserves the original user turn as the boundary. Good regression test.

Changed files

File	Change
`src/openhuman/tinyagents/no_progress.rs`	New. `NoProgressTracker` + `NoProgress` verdict ladder, thresholds, nudge/halt copy, unit tests.
`src/openhuman/tinyagents/middleware.rs`	`RepeatedToolFailureMiddleware` now drives the tracker; translates `Nudge → InjectMessage(system)` / `Halt → Pause`. Tests rewritten.
`src/openhuman/tinyagents/mod.rs`	Registers `no_progress` module; `REPEATED_TOOL_FAILURE_THRESHOLD` now aliases `DEFAULT_IDENTICAL_HALT_THRESHOLD`; wiring comment updated.

Findings

1. [major / question · medium confidence] Batched tool calls in one model response can halt before the nudge is ever seen

middleware.rs:1167 no_progress.rs:128

In tinyagents' agent loop, all tool calls from a single model response are executed serially in one for loop, and after_tool fires per call while model_calls (your step) stays constant for the whole batch. Steering commands are only drained at the next checkpoint (top of the next loop iteration).

Consequence: if a model emits ≥3 identical-failing calls (or ≥6 varied-failing calls) in a single response, same_count/consecutive climb across that one batch and reach NoProgress::Halt in the same model step — and any nudge queued mid-batch via InjectMessage isn't applied until after the whole batch has run. So the "change strategy on your next step" corrective can't influence the batch that tripped it, and [no progress since step X] repeats the same X.

The halt-in-one-batch behaviour predates this PR (the legacy counters worked the same way), so this isn't a regression — but the new nudge is premised on sequential retries, and that premise doesn't hold for parallel/batched tool calls. Worth either:

documenting the "one tool result per step" assumption on record(), or
gating the nudge/halt escalation on distinct step values (only advance the ladder once per model step) so a single fanned-out batch can't skip the nudge and self-halt.

Not blocking, but please confirm the intended behaviour when a turn issues several failing calls at once.

2. [minor / question] Hard-reject path never nudges

no_progress.rs:172

HARD_REJECT_HALT_THRESHOLD == 2 == IDENTICAL_NUDGE_THRESHOLD, and the halt checks run before the nudge checks — so a repeated [policy-blocked] call halts at the 2nd identical repeat with no intervening nudge. That looks intentional (a policy block re-issued unchanged can't succeed) and matches legacy, but it's a silent asymmetry with the rest of the ladder. A one-line comment on the ordering ("hard rejects intentionally skip the nudge") would prevent a future reader from "fixing" it.

3. [minor] `hard_reject` only covers `SecurityPolicy` markers, not tool-policy / channel denials

middleware.rs:1150

hard_reject keys off POLICY_BLOCKED_MARKER ([policy-blocked]), which is emitted by SecurityPolicy (command/path/read-only checks). The ToolPolicyMiddleware channel-permission / tool-policy denials do not carry that marker, so a model looping on a channel-permission block takes the slower generic identical-repeat halt (3) rather than the fast hard-reject halt (2). Pre-existing and arguably fine — flagging only because sibling #4390 restructures exactly those denial messages (PolicyDenial::render() → "Blocked: … Reason: …"), so it's a good moment to decide whether tool-policy denials should also be treated as hard rejects.

4. [minor] Sibling-PR overlap — likely merge conflict with #4390 in the same file

middleware.rs

#4390 also edits tinyagents/middleware.rs, but in the ToolPolicyMiddleware region (~L662–870) and adds a use at L38 — no conflict with #4389's RepeatedToolFailureMiddleware region (~L1052–1200). However, both PRs add new tests into the same mod tests block: #4390 at @@ -1577,+82 and #4389's rewrite at @@ -1596,…. Whichever merges second will almost certainly hit a conflict there. Worth coordinating merge order (this and #4390 are otherwise logically independent). #4386/#4387 touch agent/harness/** only — no overlap.

5. [nitpick] `NoProgressTracker::reset()` is `pub` but currently unused

no_progress.rs:120

record() self-resets on halt, so nothing calls reset(). The module doc says it's a forward API for #4090/#4092, which is reasonable — just calling it out so it isn't mistaken for dead code (it won't warn since it's pub).

Questions for the author

Batched tool calls (finding #1): is the ladder meant to advance per-tool-result or per-model-step? Today it's per-result, which can bypass the nudge.
A transiently-flaky tool that returns the identical error twice then would succeed on the 3rd try is nudged at 2 and halted at 3 — the halt pre-empts the successful retry. This matches legacy halt timing (also 3), so no change here, but is the identical-error-retry case common enough in practice to warrant an escape hatch (e.g. not counting known-retriable transient errors)?

Looks good

Extraction into a harness-type-free, independently unit-tested primitive is the right call and reads cleanly; the threshold constants are well-documented.
Halt timings preserve legacy ProgressGuard parity; nudges are purely additive one step earlier.
Success-resets and changed-argument-resets are covered by tests; nudged_sig/nudged_streak correctly bound each corrective to once-per-signature / once-per-streak.
The Codex fix (system vs. user message) is correct and has a targeted regression test; I confirmed messages_since_last_user behaviour backs it up.
Clamping identical_halt_threshold to > IDENTICAL_NUDGE_THRESHOLD guarantees a nudge always precedes a halt, with a test for the degenerate new(1) case.

Read-only review — no code changed.

sanil-23

Approving with comments. The escalation ladder is well-factored, halt timings preserve legacy parity, and the Codex system-vs-user-message fix is correct (verified against messages_since_last_user). Nothing here blocks merge.

One design question worth a look before/after merge (non-blocking): batched tool calls in a single model response advance the ladder per-tool-result while model_calls stays constant, so a fanned-out batch of failing calls can reach Halt without the queued nudge ever being applied — see finding #1 in my detailed review comment. The remaining items are minor/nitpick.

Also flagging the likely mod tests merge conflict with sibling PR #4390 (both add tests to the same block) — coordinate merge order.

M3gA-Mind requested a review from a team July 1, 2026 18:10

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/openhuman/tinyagents/middleware.rs Outdated

M3gA-Mind added 2 commits July 1, 2026 23:49

style: rustfmt the tinyhumansai#4089 no-progress tests

3f3cdf4

coderabbitai Bot approved these changes Jul 2, 2026

View reviewed changes

This was referenced Jul 2, 2026

fix(tinyagents): keep attempted tool name in the timeline for unavailable tools #4419

Merged

fix(harness): always emit a final answer before a turn ends #4386

Open

fix(agent-harness): structured policy-block messages with workaround (#4094) #4390

Open

sanil-23 reviewed Jul 2, 2026

View reviewed changes

sanil-23 approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harness): change strategy on repeated unproductive tool results (#4089)#4389

fix(harness): change strategy on repeated unproductive tool results (#4089)#4389
M3gA-Mind wants to merge 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4089-no-progress-strategy

M3gA-Mind commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

M3gA-Mind commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

sanil-23 left a comment

Uh oh!

sanil-23 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

M3gA-Mind commented Jul 1, 2026

Problem

Enforcement approach

Tests

Caveats

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

M3gA-Mind commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanil-23 left a comment

Choose a reason for hiding this comment

Review: fix(harness): change strategy on repeated unproductive tool results (#4089)

Walkthrough

Changed files

Findings

1. [major / question · medium confidence] Batched tool calls in one model response can halt before the nudge is ever seen

2. [minor / question] Hard-reject path never nudges

3. [minor] hard_reject only covers SecurityPolicy markers, not tool-policy / channel denials

4. [minor] Sibling-PR overlap — likely merge conflict with #4390 in the same file

5. [nitpick] NoProgressTracker::reset() is pub but currently unused

Questions for the author

Looks good

Uh oh!

sanil-23 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Review: `fix(harness): change strategy on repeated unproductive tool results (#4089)`

3. [minor] `hard_reject` only covers `SecurityPolicy` markers, not tool-policy / channel denials

5. [nitpick] `NoProgressTracker::reset()` is `pub` but currently unused