Skip to content

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393

Open
sanil-23 wants to merge 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/4269-research-response-watchdog
Open

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393
sanil-23 wants to merge 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/4269-research-response-watchdog

Conversation

@sanil-23

@sanil-23 sanil-23 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a per-chunk SSE inactivity watchdog to the native streaming path (stream_native_chat) that aborts a stalled stream after a configurable idle window (default 90s) instead of parking indefinitely on bytes_stream.next().await.
  • The window resets on every received chunk, so a legitimately long response that keeps emitting tokens is never cut.
  • Applies the same idle bound to the downstream delta send, so a wedged UI/progress consumer can't hang the turn on a full delta channel either.
  • The abort classifies as retryable, so ReliableProvider replays the turn (degrading to non-streaming on retry) — matching existing recovery behaviour.
  • New env knob OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90, range 1–3600).

Problem

#4269 — the research agent intermittently hangs in the RESPONSE phase after tool calls complete.

Reproduced end-to-end against a staging build by driving the research agent via CDP and looping deep-research queries (arxiv paper + web search — the issue's own repro shape). The failure is an upstream SSE stall on the read side:

21:40:40  [stream] OpenHuman POST .../chat/completions (stream=true, tools=14)
21:42:40  [stream] streaming chat failed, falling back to non-streaming: error decoding response body

Exactly 120s apart: the SSE body goes silent mid-response and the reader parks on bytes_stream.next().await (compatible_stream_native.rs) until the whole-request timeout cuts it. On default config that is a ~2-minute freeze ("cursor blinks, no output"); but #3856 advises operators to raise OPENHUMAN_INFERENCE_TIMEOUT_SECS up to 3600s for long research turns — with that raised, this exact stall hangs for up to an hour = the reported indefinite hang. The stall class appeared on ~2 of every 3 research runs; two runs blew past a 10-minute cap.

The whole-request timeout is the wrong instrument: it cannot distinguish "stalled, zero tokens" from "valid long response still streaming", and operators are told to raise it. A per-token inactivity watchdog bounds the stall independent of that knob.

Solution

  • compatible_timeout.rs — new stream_idle_timeout() (env OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS, default 90s, range 1..=3600), reusing the existing resolver + OnceLock cache pattern.
  • compatible.rsOpenAiCompatibleProvider carries the idle window (defaulted from config); a #[cfg(test)] with_stream_idle_timeout injects a small value in tests.
  • compatible_stream_native.rs — wrap the SSE read in a per-chunk tokio::time::timeout that resets each iteration; route every delta through a new forward_delta helper that applies the same idle bound to the send (dropped receiver = benign Ok; idle timeout = retryable bail). Both bail messages are crafted so reliable::is_non_retryable classifies them retryable.
  • .env.example — document the knob.

Scope note: the observed failure is the silent-stall case (no finish_reason / [DONE] arrives — the body just dies). Honoring the in-band terminal marker to end instantly when an upstream does send [DONE] but lingers the socket is an orthogonal correctness improvement, deferred as a follow-up.

Tests

  • stream_watchdog_trips_on_stalled_read — raw-TCP server flushes 200 SSE headers then goes silent; asserts a retryable watchdog abort (well before the whole-request timeout).
  • stream_watchdog_resets_on_each_chunk — chunks arriving under the idle window stream to completion (no false cut — the no-regression guarantee).
  • stream_watchdog_trips_on_wedged_delta_consumer — a capacity-1, never-drained delta channel trips the send-side watchdog.
  • compatible_timeout resolver tests for the new bound + boundaries.

Submission Checklist

If a section does not apply, mark the item N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + failure/edge case) — stalled-read, reset-on-chunk (happy), wedged-consumer (edge).
  • Diff coverage ≥ 80% — changed lines covered by the new Rust tests (verified locally via cargo-llvm-cov + diff-cover).
  • Coverage matrix updated — N/A: behaviour-only inference-runtime fix, no feature row added/removed/renamed.
  • All affected feature IDs listed under ## Related — N/A: no coverage-matrix feature ID applies.
  • No new external network dependencies introduced — the new tests use an in-process raw-TCP server + wiremock (already a dev-dep).
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: no release-manual-smoke surface changed.
  • Linked issue closed via Closes #NNN in ## Related.

Impact

  • Runtime impact: inference streaming path (all providers via the OpenAI-compatible provider), especially long research turns.
  • Performance impact: removes the indefinite RESPONSE-phase hang; adds one tokio::time::timeout per SSE chunk (negligible).
  • Tradeoff: a genuinely silent stream is now aborted+retried after the idle window instead of after the whole-request timeout. Retry degrades to non-streaming (existing behaviour). No token loss on the happy path (deltas still stream; the watchdog only fires on a full idle window).
  • No persistence, migration, security, or network-dependency changes.

Related

Summary by CodeRabbit

  • New Features

    • Added a configurable “stream idle” watchdog timeout for streamed responses, with documented defaults and bounds.
  • Bug Fixes

    • Prevented watchdog false-failures by immediately handling the terminal [DONE] sentinel.
    • Improved stability under slow downstream consumption: dropped receivers are tolerated; sustained backpressure triggers a retryable watchdog error.
  • Tests

    • Extended automated coverage for stream idle timeout, including reset-on-chunk behavior and mid-body upstream stalls.

@sanil-23 sanil-23 requested a review from a team July 1, 2026 19:26
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d7ed9c0d-fd72-4e02-a12a-0998581c84e3

📥 Commits

Reviewing files that changed from the base of the PR and between e0e5652 and 35be03b.

📒 Files selected for processing (1)
  • .env.example
✅ Files skipped from review due to trivial changes (1)
  • .env.example

📝 Walkthrough

Walkthrough

Adds a configurable per-chunk streaming idle timeout, wires it into OpenAI-compatible streaming, applies watchdogs to SSE reads and delta forwarding, and adds coverage for stalled reads, backpressure, dropped receivers, and terminal completion.

Changes

Stream idle watchdog

Layer / File(s) Summary
Stream idle timeout configuration
.env.example, src/openhuman/inference/provider/compatible_timeout.rs
Adds OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS docs, bounds, cached resolution, and timeout tests.
Provider wiring for stream_idle_timeout
src/openhuman/inference/provider/compatible.rs
Adds stream_idle_timeout to OpenAiCompatibleProvider, initializes it from config, and adds a test-only override.
SSE read and delta forwarding watchdog
src/openhuman/inference/provider/compatible_stream_native.rs
Wraps SSE reads and delta sends in timeouts, adds forward_delta, and stops immediately on [DONE].
Watchdog test coverage
src/openhuman/inference/provider/compatible_tests.rs
Adds helpers and tests for stalled reads, chunk resets, wedged consumers, dropped receivers, and lingering sockets after [DONE].

Estimated code review effort: 4 (Complex) | ~50 minutes

Sequence Diagram(s)

sequenceDiagram
  participant stream_native_chat
  participant bytes_stream
  participant forward_delta
  participant delta_tx

  loop per chunk
    stream_native_chat->>bytes_stream: next()
    alt chunk arrives before idle timeout
      bytes_stream-->>stream_native_chat: SSE chunk
      stream_native_chat->>forward_delta: send(delta)
      forward_delta->>delta_tx: send(delta)
      alt receiver closed
        delta_tx-->>forward_delta: dropped
        forward_delta-->>stream_native_chat: Ok(())
      else send succeeds
        delta_tx-->>forward_delta: sent
        forward_delta-->>stream_native_chat: Ok(())
      else backpressure timeout
        forward_delta-->>stream_native_chat: retryable error
      end
    else no chunk before idle timeout
      bytes_stream-->>stream_native_chat: timeout
      stream_native_chat-->>stream_native_chat: abort retryable watchdog error
    end
  end
  stream_native_chat->>stream_native_chat: stop on [DONE]
Loading

Poem

A rabbit watched the stream go still,
No token hopped across the hill.
Tick-tock went the idle clock,
And stalled streams found a safer dock. 🐇⏱️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR appears to address #4269 with a streaming watchdog, retryable aborts, reset-on-chunk behavior, and coverage-focused tests, but coverage ≥80% is not verifiable here. Provide the CI or coverage report confirming the changed-lines coverage gate is at least 80%.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: adding an SSE inactivity watchdog to address the RESPONSE hang.
Out of Scope Changes check ✅ Passed The changes stay focused on the streaming watchdog, timeout config, and related tests/documentation with no obvious unrelated scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/inference/provider/compatible_stream_native.rs`:
- Around line 263-293: The SSE loop in compatible_stream_native.rs is still
treating the terminal `[DONE]` event as non-final, so it re-arms the idle
watchdog and can incorrectly fail a completed response if the socket stays open.
Update the stream-reading logic around the `bytes_stream.next()` loop and
`[DONE]` handling so the loop exits immediately on the terminal sentinel instead
of continuing to another read. Add a regression test in the same provider path
that emits `[DONE]` and then keeps the connection open to verify the stream
completes without tripping `stream_idle_timeout`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82583067-bccd-4429-9c9a-93da830c0165

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 5da7bfd.

📒 Files selected for processing (5)
  • .env.example
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/compatible_stream_native.rs
  • src/openhuman/inference/provider/compatible_tests.rs
  • src/openhuman/inference/provider/compatible_timeout.rs

Comment thread src/openhuman/inference/provider/compatible_stream_native.rs
sanil-23 added a commit to sanil-23/openhuman that referenced this pull request Jul 1, 2026
…e watchdog

CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed
the idle watchdog, so a provider that sends [DONE] but holds the socket open
would fail an already-complete response as a retryable stall. Break on [DONE];
add a regression that sends [DONE] then keeps the connection open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sanil-23 and others added 2 commits July 2, 2026 12:14
…SE hang (tinyhumansai#4269)

The native streaming path parked on `bytes_stream.next().await` when an
upstream flushed 200 then went silent mid-response — cut only by the blunt
whole-request timeout, which tinyhumansai#3856 tells operators to raise up to 1h,
turning the stall into an indefinite hang. Add a per-chunk inactivity
watchdog that resets on every token and aborts a stalled stream with a
retryable error (ReliableProvider replays it); also bound the delta send so
a wedged consumer can't hang the turn on a full channel. New env knob
OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90s, range 1-3600).

Reproduced against staging via CDP: deep-research streams stalled the full
120s request-timeout window, surfacing as "error decoding response body".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e watchdog

CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed
the idle watchdog, so a provider that sends [DONE] but holds the socket open
would fail an already-complete response as a retryable stall. Break on [DONE];
add a regression that sends [DONE] then keeps the connection open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Research agent intermittently hangs mid-run

1 participant