fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269) by sanil-23 · Pull Request #4393 · tinyhumansai/openhuman

sanil-23 · 2026-07-01T19:25:59Z

Summary

Adds a per-chunk SSE inactivity watchdog to the native streaming path (stream_native_chat) that aborts a stalled stream after a configurable idle window (default 90s) instead of parking indefinitely on bytes_stream.next().await.
The window resets on every received chunk, so a legitimately long response that keeps emitting tokens is never cut.
Applies the same idle bound to the downstream delta send, so a wedged UI/progress consumer can't hang the turn on a full delta channel either.
The abort classifies as retryable, so ReliableProvider replays the turn (degrading to non-streaming on retry) — matching existing recovery behaviour.
New env knob OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90, range 1–3600).

Problem

#4269 — the research agent intermittently hangs in the RESPONSE phase after tool calls complete.

Reproduced end-to-end against a staging build by driving the research agent via CDP and looping deep-research queries (arxiv paper + web search — the issue's own repro shape). The failure is an upstream SSE stall on the read side:

21:40:40  [stream] OpenHuman POST .../chat/completions (stream=true, tools=14)
21:42:40  [stream] streaming chat failed, falling back to non-streaming: error decoding response body

Exactly 120s apart: the SSE body goes silent mid-response and the reader parks on bytes_stream.next().await (compatible_stream_native.rs) until the whole-request timeout cuts it. On default config that is a ~2-minute freeze ("cursor blinks, no output"); but #3856 advises operators to raise OPENHUMAN_INFERENCE_TIMEOUT_SECS up to 3600s for long research turns — with that raised, this exact stall hangs for up to an hour = the reported indefinite hang. The stall class appeared on ~2 of every 3 research runs; two runs blew past a 10-minute cap.

The whole-request timeout is the wrong instrument: it cannot distinguish "stalled, zero tokens" from "valid long response still streaming", and operators are told to raise it. A per-token inactivity watchdog bounds the stall independent of that knob.

Solution

compatible_timeout.rs — new stream_idle_timeout() (env OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS, default 90s, range 1..=3600), reusing the existing resolver + OnceLock cache pattern.
compatible.rs — OpenAiCompatibleProvider carries the idle window (defaulted from config); a #[cfg(test)] with_stream_idle_timeout injects a small value in tests.
compatible_stream_native.rs — wrap the SSE read in a per-chunk tokio::time::timeout that resets each iteration; route every delta through a new forward_delta helper that applies the same idle bound to the send (dropped receiver = benign Ok; idle timeout = retryable bail). Both bail messages are crafted so reliable::is_non_retryable classifies them retryable.
.env.example — document the knob.

Scope note: the observed failure is the silent-stall case (no finish_reason / [DONE] arrives — the body just dies). Honoring the in-band terminal marker to end instantly when an upstream does send [DONE] but lingers the socket is an orthogonal correctness improvement, deferred as a follow-up.

Tests

stream_watchdog_trips_on_stalled_read — raw-TCP server flushes 200 SSE headers then goes silent; asserts a retryable watchdog abort (well before the whole-request timeout).
stream_watchdog_resets_on_each_chunk — chunks arriving under the idle window stream to completion (no false cut — the no-regression guarantee).
stream_watchdog_trips_on_wedged_delta_consumer — a capacity-1, never-drained delta channel trips the send-side watchdog.
compatible_timeout resolver tests for the new bound + boundaries.

Submission Checklist

If a section does not apply, mark the item N/A with a one-line reason. Do not delete items.

Tests added or updated (happy path + failure/edge case) — stalled-read, reset-on-chunk (happy), wedged-consumer (edge).
Diff coverage ≥ 80% — changed lines covered by the new Rust tests (verified locally via cargo-llvm-cov + diff-cover).
Coverage matrix updated — N/A: behaviour-only inference-runtime fix, no feature row added/removed/renamed.
All affected feature IDs listed under ## Related — N/A: no coverage-matrix feature ID applies.
No new external network dependencies introduced — the new tests use an in-process raw-TCP server + wiremock (already a dev-dep).
Manual smoke checklist updated if this touches release-cut surfaces — N/A: no release-manual-smoke surface changed.
Linked issue closed via Closes #NNN in ## Related.

Impact

Runtime impact: inference streaming path (all providers via the OpenAI-compatible provider), especially long research turns.
Performance impact: removes the indefinite RESPONSE-phase hang; adds one tokio::time::timeout per SSE chunk (negligible).
Tradeoff: a genuinely silent stream is now aborted+retried after the idle window instead of after the whole-request timeout. Retry degrades to non-streaming (existing behaviour). No token loss on the happy path (deltas still stream; the watchdog only fires on a full idle window).
No persistence, migration, security, or network-dependency changes.

Closes Research agent intermittently hangs mid-run #4269
Context: Feature Request: Add configuration options for all timeout values. #3856 (whole-request timeout raised for long research turns — the reason the stall presents as indefinite).
Alternative approach fix(agent): avoid RESPONSE stalls on progress backpressure #4356 (non-blocking delta forwarders) does not address the read-side stall reproduced here — see my review on that PR.

Summary by CodeRabbit

New Features
- Added a configurable “stream idle” watchdog timeout for streamed responses, with documented defaults and bounds.
Bug Fixes
- Prevented watchdog false-failures by immediately handling the terminal [DONE] sentinel.
- Improved stability under slow downstream consumption: dropped receivers are tolerated; sustained backpressure triggers a retryable watchdog error.
Tests
- Extended automated coverage for stream idle timeout, including reset-on-chunk behavior and mid-body upstream stalls.

coderabbitai · 2026-07-01T19:26:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d7ed9c0d-fd72-4e02-a12a-0998581c84e3

📥 Commits

Reviewing files that changed from the base of the PR and between e0e5652 and 35be03b.

📒 Files selected for processing (1)

.env.example

✅ Files skipped from review due to trivial changes (1)

.env.example

📝 Walkthrough

Walkthrough

Adds a configurable per-chunk streaming idle timeout, wires it into OpenAI-compatible streaming, applies watchdogs to SSE reads and delta forwarding, and adds coverage for stalled reads, backpressure, dropped receivers, and terminal completion.

Changes

Stream idle watchdog

Layer / File(s)	Summary
Stream idle timeout configuration `.env.example`, `src/openhuman/inference/provider/compatible_timeout.rs`	Adds `OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS` docs, bounds, cached resolution, and timeout tests.
Provider wiring for stream_idle_timeout `src/openhuman/inference/provider/compatible.rs`	Adds `stream_idle_timeout` to `OpenAiCompatibleProvider`, initializes it from config, and adds a test-only override.
SSE read and delta forwarding watchdog `src/openhuman/inference/provider/compatible_stream_native.rs`	Wraps SSE reads and delta sends in timeouts, adds `forward_delta`, and stops immediately on `[DONE]`.
Watchdog test coverage `src/openhuman/inference/provider/compatible_tests.rs`	Adds helpers and tests for stalled reads, chunk resets, wedged consumers, dropped receivers, and lingering sockets after `[DONE]`.

Estimated code review effort: 4 (Complex) | ~50 minutes

Sequence Diagram(s)

sequenceDiagram
  participant stream_native_chat
  participant bytes_stream
  participant forward_delta
  participant delta_tx

  loop per chunk
    stream_native_chat->>bytes_stream: next()
    alt chunk arrives before idle timeout
      bytes_stream-->>stream_native_chat: SSE chunk
      stream_native_chat->>forward_delta: send(delta)
      forward_delta->>delta_tx: send(delta)
      alt receiver closed
        delta_tx-->>forward_delta: dropped
        forward_delta-->>stream_native_chat: Ok(())
      else send succeeds
        delta_tx-->>forward_delta: sent
        forward_delta-->>stream_native_chat: Ok(())
      else backpressure timeout
        forward_delta-->>stream_native_chat: retryable error
      end
    else no chunk before idle timeout
      bytes_stream-->>stream_native_chat: timeout
      stream_native_chat-->>stream_native_chat: abort retryable watchdog error
    end
  end
  stream_native_chat->>stream_native_chat: stop on [DONE]

Poem

A rabbit watched the stream go still,
No token hopped across the hill.
Tick-tock went the idle clock,
And stalled streams found a safer dock. 🐇⏱️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The PR appears to address `#4269` with a streaming watchdog, retryable aborts, reset-on-chunk behavior, and coverage-focused tests, but coverage ≥80% is not verifiable here.	Provide the CI or coverage report confirming the changed-lines coverage gate is at least 80%.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: adding an SSE inactivity watchdog to address the RESPONSE hang.
Out of Scope Changes check	✅ Passed	The changes stay focused on the streaming watchdog, timeout config, and related tests/documentation with no obvious unrelated scope.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/inference/provider/compatible_stream_native.rs`:
- Around line 263-293: The SSE loop in compatible_stream_native.rs is still
treating the terminal `[DONE]` event as non-final, so it re-arms the idle
watchdog and can incorrectly fail a completed response if the socket stays open.
Update the stream-reading logic around the `bytes_stream.next()` loop and
`[DONE]` handling so the loop exits immediately on the terminal sentinel instead
of continuing to another read. Add a regression test in the same provider path
that emits `[DONE]` and then keeps the connection open to verify the stream
completes without tripping `stream_idle_timeout`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82583067-bccd-4429-9c9a-93da830c0165

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 5da7bfd.

📒 Files selected for processing (5)

.env.example
src/openhuman/inference/provider/compatible.rs
src/openhuman/inference/provider/compatible_stream_native.rs
src/openhuman/inference/provider/compatible_tests.rs
src/openhuman/inference/provider/compatible_timeout.rs

…e watchdog CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed the idle watchdog, so a provider that sends [DONE] but holds the socket open would fail an already-complete response as a retryable stall. Break on [DONE]; add a regression that sends [DONE] then keeps the connection open. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…SE hang (tinyhumansai#4269) The native streaming path parked on `bytes_stream.next().await` when an upstream flushed 200 then went silent mid-response — cut only by the blunt whole-request timeout, which tinyhumansai#3856 tells operators to raise up to 1h, turning the stall into an indefinite hang. Add a per-chunk inactivity watchdog that resets on every token and aborts a stalled stream with a retryable error (ReliableProvider replays it); also bound the delta send so a wedged consumer can't hang the turn on a full channel. New env knob OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90s, range 1-3600). Reproduced against staging via CDP: deep-research streams stalled the full 120s request-timeout window, surfacing as "error decoding response body". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e watchdog CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed the idle watchdog, so a provider that sends [DONE] but holds the socket open would fail an already-complete response as a retryable stall. Break on [DONE]; add a regression that sends [DONE] then keeps the connection open. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sanil-23 requested a review from a team July 1, 2026 19:26

coderabbitai Bot requested changes Jul 1, 2026

View reviewed changes

Comment thread src/openhuman/inference/provider/compatible_stream_native.rs

coderabbitai Bot approved these changes Jul 1, 2026

View reviewed changes

sanil-23 and others added 2 commits July 2, 2026 12:14

sanil-23 force-pushed the fix/4269-research-response-watchdog branch from e0e5652 to 35be03b Compare July 2, 2026 06:53

sanil-23 mentioned this pull request Jul 2, 2026

fix(tinyagents): keep attempted tool name in the timeline for unavailable tools #4419

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393
sanil-23 wants to merge 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/4269-research-response-watchdog

sanil-23 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sanil-23 commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Tests

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanil-23 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading