Bug Description
The pipeline (STTβLLMβTTS) path and the Realtime path handle a recoverable model failure very differently, and the Realtime path has no automatic retry β the turn is silently dropped.
Pipeline path (has retry):
LLMStream._main_task wraps inference in a retry loop:
livekit/agents/llm/llm.py ~line 214
for i in range(self._conn_options.max_retry + 1):
try:
... # run inference
except APIError as e:
if self._conn_options.max_retry == 0 or not e.retryable:
raise
elif i == self._conn_options.max_retry:
raise APIConnectionError("failed to generate LLM completion after N attempts")
# else: sleep _interval_for_retry(i) and retry
So a transient LLM/API error is retried automatically (default max_retry=3 β 4 attempts) and the user still gets a reply.
Realtime path (no retry):
When a realtime response fails recoverably, the OpenAI plugin emits an error event:
RealtimeSession._handle_response_done_but_not_complete β on status == "failed" calls _emit_error(APIError(..., retryable=True), recoverable=True)
RealtimeSession._handle_error β on a server error event, _emit_error(..., recoverable=True)
β¦but the code that drives the turn just swallows it and returns, with no re-generation:
livekit/agents/voice/agent_activity.py
# _realtime_reply_task (~2927) and _realtime_generation_task (~3493)
try:
generation_ev = await self._rt_session.generate_reply(...)
except llm.RealtimeError as e:
logger.error("failed to generate a reply: %s", str(e))
self._session._update_agent_state("listening")
return # <-- turn dropped, never retried
Net effect: on a transient, explicitly-recoverable=True realtime failure (e.g. inference_rate_limit_exceeded, a transient response.failed), the SpeechHandle completes with no output, the agent goes silent for that turn, and the developer has to build their own retry/recovery on top of session.on("error"). There is no conn_options-style response-level retry equivalent to the pipeline LLM.
(The Realtime model does have a connection-level reconnect loop in RealtimeSession._main_task, but that only covers (re)establishing the websocket β it does not re-issue a failed response.create. See "Additional Context" for a related issue in that loop.)
### Expected Behavior
A recoverable Realtime response failure should be retried automatically β i.e. re-issue the response.create up to conn_options.max_retry (with conn_options retry interval/backoff), mirroring the pipeline LLM's behavior β instead of emitting a single "error" event and silently dropping the turn.
Concretely, parity with the pipeline:
transient/recoverable=True response failure -> re-generate the response (bounded by max_retry)
only after retries are exhausted should the turn be considered failed and surfaced as a non-recoverable error
developers should get the same "it just retries" guarantee regardless of whether they use a pipeline or a RealtimeModel
### Reproduction Steps
```bash
1.Create an AgentSession backed by openai.realtime.RealtimeModel (server VAD / audio modality).
2.Cause a recoverable response failure during a turn, e.g. hit inference_rate_limit_exceeded, or otherwise make the server return response.done with status="failed" (retryable), or send a server error event.
3.Observe: the SDK emits an "error" event with recoverable=True, the SpeechHandle finishes with NO output, and the reply is NEVER regenerated β the agent stays silent for that turn.
Contrast (control): run the SAME flow with a pipeline (STT-LLM-TTS) agent and force a retryable APIConnectionError from the LLM. The pipeline retries automatically per conn_options and the user still receives a reply.
Minimal sketch:
realtime: session.generate_reply() -> on recoverable failure -> no output, no retry
pipeline: session.generate_reply() -> on retryable failure -> auto-retried, reply produced
Operating System
macOS
Models Used
RealtimeModel: openai gpt-realtime (OpenAI Realtime API) (Pipeline control: Deepgram STT + OpenAI LLM + Cartesia TTS)
Package Versions
livekit==1.1.5
livekit-agents==1.5.4
livekit-plugins-openai==1.5.4
Session/Room/Call IDs
No response
Proposed Solution
Add response-level retry to the Realtime generate path, controlled by conn_options (max_retry / retry_interval), so a recoverable response failure re-issues response.create automatically β symmetric with LLMStream._main_task. The retry should:
interrupt/cancel the prior failed response and wait until has_active_generation is False before re-issuing (otherwise the next response.create collides with the still-active one: conversation_already_has_active_response),
clean up any partial assistant item committed by the failed response before regenerating,
stop and surface a terminal error once max_retry is exhausted.
Classify permanently-fatal errors as non-recoverable. Today _emit_error hardcodes recoverable=True for essentially all realtime failures (the comment even says "we assume optimistically all retryable/recoverable"). Errors like insufficient_quota / invalid_api_key are not recoverable and should not be retried (by the SDK or by user code).
Fix the reconnect loop's retry accounting in RealtimeSession._main_task: num_retries = 0 is reset after every successful reconnect, but for failures where connect/reconnect succeeds and only the generation fails (e.g. quota), the counter never reaches max_retry, so the give-up branch is unreachable and it reconnects unboundedly, emitting a recoverable error each cycle.
Additional Context
Related observations found while working around this (all in the OpenAI realtime plugin):
All realtime errors are emitted with recoverable=True (optimistic), so quota/auth failures look retryable.
A single root cause often produces TWO RealtimeModelError emissions in quick succession: the server error event (e.g. insufficient_quota) AND, when the server then closes the websocket, an APIConnectionError "S2S connection closed unexpectedly" from the recv loop β followed by the SDK reconnecting and repeating. Any per-turn error handler that keys off the first error can miss the second.
Because there's no built-in response-level retry, every team using RealtimeModel has to reimplement retry/recovery on top of session.on("error"), and doing so correctly is subtle (avoiding overlapping response.create, partial-item cleanup, fatal-vs-transient classification).
A built-in, conn_options-driven retry for the realtime response path would remove that burden and bring it to parity with the pipeline LLM.
Screenshots and Recordings
No response
Bug Description
The pipeline (STTβLLMβTTS) path and the Realtime path handle a recoverable model failure very differently, and the Realtime path has no automatic retry β the turn is silently dropped.
Pipeline path (has retry):
LLMStream._main_taskwraps inference in a retry loop:livekit/agents/llm/llm.py~line 214Operating System
macOS
Models Used
RealtimeModel: openai gpt-realtime (OpenAI Realtime API) (Pipeline control: Deepgram STT + OpenAI LLM + Cartesia TTS)
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
Related observations found while working around this (all in the OpenAI realtime plugin):
All realtime errors are emitted with recoverable=True (optimistic), so quota/auth failures look retryable.
A single root cause often produces TWO RealtimeModelError emissions in quick succession: the server error event (e.g. insufficient_quota) AND, when the server then closes the websocket, an APIConnectionError "S2S connection closed unexpectedly" from the recv loop β followed by the SDK reconnecting and repeating. Any per-turn error handler that keys off the first error can miss the second.
Because there's no built-in response-level retry, every team using RealtimeModel has to reimplement retry/recovery on top of session.on("error"), and doing so correctly is subtle (avoiding overlapping response.create, partial-item cleanup, fatal-vs-transient classification).
A built-in, conn_options-driven retry for the realtime response path would remove that burden and bring it to parity with the pipeline LLM.
Screenshots and Recordings
No response