Skip to content

feat(agent): classify tool-call failures + surface on web event — rework of #4413 (backend)#4445

Merged
senamakel merged 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4254-tool-failures-backend-rework
Jul 3, 2026
Merged

feat(agent): classify tool-call failures + surface on web event — rework of #4413 (backend)#4445
senamakel merged 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4254-tool-failures-backend-rework

Conversation

@M3gA-Mind

@M3gA-Mind M3gA-Mind commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Rework of #4413's backend onto post-#4399 main. Classifies a failed tool call into a user-facing explanation (plain-language cause + what to do next) and surfaces it on the tool_result socket event, so the chat "View processing" timeline can render "why / next" instead of a raw error.

Why a rework

#4399 (the TinyAgents harness migration) deleted the original classification hook (session/agent_tool_exec.rs) and the ProgressReporter trait, and moved the tool loop onto the tinyagents crate. So the hook is re-homed on the adapter seam.

Changes

  • New src/openhuman/tool_status/classify(error_text, timed_out) + describe(class)ClassifiedFailure (class, category, cause, next action). Pure, 16 unit tests.
  • failure field on AgentProgress::ToolCallCompleted and WebChannelEvent.
  • Core hook + latent-bug fix: the crate's AgentEvent::ToolCompleted carries only call_id + tool_name, so the bridge previously hardcoded success: true — a failed tool never surfaced. ToolOutcomeCaptureMiddleware::after_tool now classifies each result and writes a call_id → (success, failure) side-channel (mirroring the existing ToolNameMap); the bridge reads it to emit real success + failure (fallback (true, None) if absent).
  • progress_bridge serializes failure onto the socket event + run ledger.

Scope note — backend only

The frontend consumer (timeline render) + i18n keys are a companion. They rebase cleanly onto current main from #4413 (only the Rust hook conflicted with #4399); taking the whole i18n files here would revert unrelated locale drift on main, so they're intentionally excluded. See #4413.

Validation

Related

Summary by CodeRabbit

  • New Features

    • Tool failures now include structured, human-readable classification details when available.
    • Tool result and chat events can now carry optional failure information for downstream display.
    • Improved handling of tool completion events to preserve failure context across the app.
  • Bug Fixes

    • Better detection of common tool error cases such as timeouts, blocked actions, missing permissions, and unavailable services.
    • Event processing is now more tolerant of additional tool completion data.

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 22 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3bfda4af-b218-4dfb-86c2-2c59da138c51

📥 Commits

Reviewing files that changed from the base of the PR and between 96b4414 and 72cf9d2.

📒 Files selected for processing (17)
  • src/bin/harness_subagent_audit.rs
  • src/core/socketio.rs
  • src/openhuman/agent/harness/session/tool_progress.rs
  • src/openhuman/agent/progress.rs
  • src/openhuman/agent/progress_tracing/tests.rs
  • src/openhuman/channels/proactive.rs
  • src/openhuman/channels/providers/presentation.rs
  • src/openhuman/channels/providers/web/progress_bridge.rs
  • src/openhuman/mod.rs
  • src/openhuman/threads/turn_state/mirror_tests.rs
  • src/openhuman/tinyagents/middleware.rs
  • src/openhuman/tinyagents/mod.rs
  • src/openhuman/tinyagents/observability.rs
  • src/openhuman/tool_status/mod.rs
  • src/openhuman/tool_status/ops.rs
  • src/openhuman/tool_status/types.rs
  • tests/memory_threads_raw_coverage_e2e.rs
📝 Walkthrough

Walkthrough

Introduces a new tool_status module with failure classification types and logic, adds an optional failure field to AgentProgress::ToolCallCompleted and WebChannelEvent, threads a ToolFailureMap side-channel through TinyAgents middleware and the observability bridge, and forwards classified failures into web-channel and ledger events.

Changes

Tool failure classification pipeline

Layer / File(s) Summary
Tool status classification types and logic
src/openhuman/tool_status/*, src/openhuman/mod.rs
New ToolLifecycleState, ToolFailureClass, FailureCategory, ClassifiedFailure types and classify/describe functions are added and exposed via a new public tool_status module with unit tests.
Progress event and WebChannelEvent failure field
src/openhuman/agent/progress.rs, src/core/socketio.rs
AgentProgress::ToolCallCompleted and WebChannelEvent gain an optional failure field, populated only on failure and omitted when absent.
Middleware captures failure classification
src/openhuman/tinyagents/middleware.rs
ToolOutcomeCaptureMiddleware now classifies tool errors (policy-blocked, timeout, or generic) and records success/failure per call_id into a new failure_map.
Observability bridge and harness wiring of failure_map
src/openhuman/tinyagents/observability.rs, src/openhuman/tinyagents/mod.rs
A shared ToolFailureMap is threaded through OpenhumanEventBridge and AssembledTurnHarness, and used to derive success/failure when projecting ToolCallCompleted events instead of always reporting success.
Web progress bridge forwards failure to socket events
src/openhuman/channels/providers/web/progress_bridge.rs
Serializes the classified failure and forwards it into ledger tool_call_completed payloads and emitted tool_result web events.
Downstream event constructors updated for new failure field
src/bin/harness_subagent_audit.rs, src/openhuman/agent/harness/session/tool_progress.rs, src/openhuman/agent/progress_tracing/tests.rs, src/openhuman/channels/proactive.rs, src/openhuman/channels/providers/presentation.rs, src/openhuman/threads/turn_state/mirror_tests.rs
Existing event construction sites and tests are updated to set failure: None or ignore the new field for the updated struct shapes.

Estimated code review effort: 3 (Moderate) | ~30 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Middleware as ToolOutcomeCaptureMiddleware
    participant FailureMap as ToolFailureMap
    participant Bridge as OpenhumanEventBridge
    participant ProgressBridge as spawn_progress_bridge
    participant Socket as WebChannelEvent

    Middleware->>Middleware: after_tool detects result.error
    Middleware->>Middleware: classify error via tool_status::classify/describe
    Middleware->>FailureMap: insert(call_id, (success, failure))
    Bridge->>FailureMap: remove(call_id) on ToolCompleted
    FailureMap-->>Bridge: (success, failure) or absent
    Bridge->>Bridge: build AgentProgress::ToolCallCompleted{success, failure}
    Bridge->>ProgressBridge: emit AgentProgress event
    ProgressBridge->>ProgressBridge: serde_json::to_value(failure) -> failure_json
    ProgressBridge->>Socket: emit tool_result{failure: failure_json}
Loading

Suggested labels: rust-core, bug

Poem

A rabbit hopped through failed tool calls,
Classifying why each error falls. 🐰
Timeout, policy, or missing key,
Now wrapped up nice for all to see.
Through map and bridge the failure flows,
Straight to the socket, and off it goes! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR does not address #4442’s build-fix items; it adds tool-failure classification instead of restoring SqlRunLedgerCheckpointer, graph_fn, or the doc link. Update the PR to fix the TinyAgents migration build break by repointing to SqliteCheckpointer, wrapping graph_fn in Some(...), and correcting the stale doc link.
Out of Scope Changes check ⚠️ Warning Most changes are unrelated to #4442, adding new tool-status types, middleware, and web-event failure payloads instead of the requested mechanical build fix. Trim the diff to the orchestration compile fix or split the failure-classification work into a separate PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately summarizes the main change: classifying tool-call failures and surfacing them on web events.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot added bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. labels Jul 3, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96b4414dcb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/openhuman/tool_status/ops.rs Outdated
Comment on lines +68 to +69
"403",
"forbidden",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't map external 403s to policy blocks

When an integration or API-backed tool returns a normal authorization failure such as 403 insufficient authentication scopes (this exact Gmail/Composio case is exercised in tests/composio_raw_coverage_e2e.rs), these broad needles classify it as BlockedByPolicy, so the UI tells the user to change Agent access settings and marks it non-recoverable instead of prompting them to reconnect/grant scopes. Please reserve BlockedByPolicy for OpenHuman policy-specific markers/text and classify bare HTTP 403/Forbidden as credentials or user-permission related.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/tinyagents/middleware.rs (1)

1399-1409: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Timeout detection via string match on "timed out" is fragile.

Coupling classification to a literal substring in content will silently stop matching if the message that produces it is ever reworded elsewhere in the codebase. A structured timed_out: bool carried on the tool result (or a dedicated error variant) would be more robust than text sniffing, though the current heuristic is explicitly called out as intentional in the comment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/tinyagents/middleware.rs` around lines 1399 - 1409, The timeout
classification in the middleware is currently inferred by matching the literal
substring "timed out" in result.content, which is brittle. Update the logic in
tinyagents::middleware::classify to use a structured timeout indicator from the
tool result (for example a timed_out flag or dedicated error variant) instead of
text sniffing, while keeping the existing POLICY_BLOCKED_MARKER handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/tool_status/ops.rs`:
- Around line 60-75: The numeric status-code needles in the ToolFailureClass
detection logic can false-positive because contains_any does plain substring
matching; update the matching in the relevant classification branches (including
the BlockedByPolicy, BadCredentials, and ServiceUnavailable checks in the ops.rs
failure classifier) so codes like 403, 401, 502, 503, and 504 only match when
not surrounded by other digits. Use a boundary-aware check or equivalent guard
before classifying, while keeping the existing keyword-based needles as-is.

---

Nitpick comments:
In `@src/openhuman/tinyagents/middleware.rs`:
- Around line 1399-1409: The timeout classification in the middleware is
currently inferred by matching the literal substring "timed out" in
result.content, which is brittle. Update the logic in
tinyagents::middleware::classify to use a structured timeout indicator from the
tool result (for example a timed_out flag or dedicated error variant) instead of
text sniffing, while keeping the existing POLICY_BLOCKED_MARKER handling intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bf1c9617-ea91-4bb0-8793-d408a4150acc

📥 Commits

Reviewing files that changed from the base of the PR and between 8aab529 and 96b4414.

📒 Files selected for processing (16)
  • src/bin/harness_subagent_audit.rs
  • src/core/socketio.rs
  • src/openhuman/agent/harness/session/tool_progress.rs
  • src/openhuman/agent/progress.rs
  • src/openhuman/agent/progress_tracing/tests.rs
  • src/openhuman/channels/proactive.rs
  • src/openhuman/channels/providers/presentation.rs
  • src/openhuman/channels/providers/web/progress_bridge.rs
  • src/openhuman/mod.rs
  • src/openhuman/threads/turn_state/mirror_tests.rs
  • src/openhuman/tinyagents/middleware.rs
  • src/openhuman/tinyagents/mod.rs
  • src/openhuman/tinyagents/observability.rs
  • src/openhuman/tool_status/mod.rs
  • src/openhuman/tool_status/ops.rs
  • src/openhuman/tool_status/types.rs

Comment thread src/openhuman/tool_status/ops.rs Outdated
…ent (tinyhumansai#4254, backend)

Rework of tinyhumansai#4413's backend onto post-tinyhumansai#4399 code. tinyhumansai#4399 deleted the original
classification hook (session/agent_tool_exec.rs) and the ProgressReporter
trait; the tool loop now runs on the tinyagents crate, so the hook is
re-homed on the adapter seam.

- New `src/openhuman/tool_status/` module: `classify(error_text, timed_out)`
  and `describe(class)` map a tool outcome to a `ClassifiedFailure`
  (class + category + plain-language cause + next action). Pure, unit-tested
  (16 tests: all classes, timeout/unknown/precedence, category mapping,
  recoverability).
- `AgentProgress::ToolCallCompleted` + `WebChannelEvent` gain a `failure`
  field carrying that classification to the chat "View processing" timeline.
- Core hook + latent-bug fix: the crate's `AgentEvent::ToolCompleted` carries
  only call_id + tool_name, so the bridge previously hardcoded `success: true`
  (a failed tool never surfaced). `ToolOutcomeCaptureMiddleware::after_tool`
  now classifies each result and writes a `call_id → (success, failure)`
  side-channel (mirroring the existing `ToolNameMap`); the bridge reads it to
  emit real `success` + `failure`. Absent entry falls back to `(true, None)`.
- `progress_bridge` serializes `failure` onto the `tool_result` socket event
  and the run ledger.

Backend only. The frontend consumer (chat timeline "why / next" render) and
i18n keys are a companion — they rebase cleanly onto current `main` from
tinyhumansai#4413 (only the Rust hook conflicted with tinyhumansai#4399), so they are not re-included
here to avoid reverting unrelated locale drift.

NOTE: the middleware-before-event ordering that populates the side-channel is
crate-internal; the fallback keeps behavior no worse than before if the event
is ever projected first. Needs runtime validation once tinyhumansai#4442 lands (drive a
deliberately-failing tool and confirm `success:false` + `failure` on the wire).
@M3gA-Mind M3gA-Mind force-pushed the fix/GH-4254-tool-failures-backend-rework branch from 96b4414 to 487be7e Compare July 3, 2026 15:15
…tatus needles

Addresses CodeRabbit review on tinyhumansai#4445:
- Bare HTTP 403/Forbidden (e.g. Gmail/Composio "403 insufficient authentication
  scopes") was classified as BlockedByPolicy, telling the user to change Agent-access
  settings and marking it non-recoverable. It's an external authz failure → now
  BadCredentials (reconnect / grant scopes). OpenHuman's own policy blocks are tagged
  upstream with POLICY_BLOCKED_MARKER (never reach this heuristic); the OpenHuman-specific
  'forbidden path' marker stays policy, checked before credentials so it still wins.
- contains_any did unbounded substring matching, so '403'/'401'/'50x' false-matched
  inside longer digit runs (a '403' in '14033', a port, a byte count, a timestamp).
  New contains_code() requires non-digit boundaries on each side.
- Regression tests: external 403/401 → credentials, 'forbidden path' → policy,
  embedded 403/503 in longer numbers → Unknown, standalone 503 → service.
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jul 3, 2026
M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jul 3, 2026
…ed in tinyhumansai#4445

tinyhumansai#4413 originally shipped both the tool-failure classification backend and the
chat "View processing" render. The backend conflicted with tinyhumansai#4399 (harness
migration) and has been reworked separately in tinyhumansai#4445 (tool_status module +
ToolOutcomeCaptureMiddleware + WebChannelEvent.failure). This reduces tinyhumansai#4413 to
just the frontend that consumes tinyhumansai#4445's `failure` field off the tool_result
socket event, rebased cleanly onto current main.

Kept (21 files):
- chat timeline render: ProcessingTranscriptView (+test), ChatRuntimeProvider,
  chatService, chatRuntimeSlice (+toolFailure test), turnState.
- i18n: conversations.toolFailure.* keys across all 14 locales (3-way unioned
  onto current main so no existing translations are reverted).

Dropped (now owned by tinyhumansai#4445): tool_status/, agent progress/observability,
middleware, event_bus, socketio failure plumbing, and the deleted-file conflict
on agent_tool_exec.rs.

Verified: pnpm typecheck clean, i18n:check parity 0 missing/0 extra, 13 frontend
tests pass (chatRuntimeSlice.toolFailure + ProcessingTranscriptView).
Stacked on tinyhumansai#4445 (merge that first — it emits the field this renders).
The tool-failure feature added AgentProgress::ToolCallCompleted.failure; an
integration test (tests/, a separate compilation unit not covered by
`cargo check --lib`) constructed the event without it, failing the Rust Core
Coverage job (E0063). `cargo check --tests` is now clean.
@senamakel senamakel merged commit b5837ed into tinyhumansai:main Jul 3, 2026
15 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in Team Openhuman Jul 3, 2026
senamakel added a commit to senamakel/openhuman that referenced this pull request Jul 4, 2026
…om, terminal-inference halt, autonomous iter-cap lift (tinyhumansai#4463)

Restores loop-guard behaviors dropped with the legacy `tool_loop.rs` in the
TinyAgents migration; the crate `no_progress` ladder tracks *failures* only and
does not replace them. All fixes are seam-side (`src/openhuman/**`); vendored
crate untouched.

- RepeatProgressMiddleware (new, `after_model` + batch-aware `after_tool`):
  halts on 3× identical *successful* `(tool, args)` batches (tinyhumansai#4088) and 4×
  identical outputs (tinyhumansai#4095); `wait_subagent` polling exemption ported. Shares
  the halt-summary slot + steering handle with the failure breaker.
- RepeatedToolFailureMiddleware: differentiated headroom restored — recoverable
  classes (timeout/transient markers + tinyhumansai#4445 classifier: Timeout/
  ServiceUnavailable/ModelConnection) get 8 identical / 12 varied instead of the
  crate's fixed 3/6; POLICY_DENIED_MARKER rejoins POLICY_BLOCKED_MARKER for the
  2-repeat hard-reject fast-trip (tinyhumansai#4463 part 6).
- Terminal delegated-inference fast-halt (tinyhumansai#3104): budget-exhausted /
  provider-config-rejection delegation failures halt on first occurrence, gated
  on the delegated-inference envelope so recoverable tool stderr isn't caught.
- Autonomous iter-cap lift (part 7): `autonomous_iter_cap()` was a dead knob —
  wired back in via `subagent_iter_cap_with_autonomous_lift` at the subagent
  `effective_max_iterations()` sites, so task-dispatcher / skill subagents run
  up to TASK_RUN_MAX_ITERATIONS again instead of the normal per-agent cap.
- Updated the adapter-inventory middleware-count assertions (15→16, 11→12).

Tests deferred to the final parity test pass.

Claude-Session: https://claude.ai/code/session_019j5TLsRLHsM3kqAYFyH4hR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants