Realtime models have no response-level retry for recoverable generate_reply failures (parity gap with the pipeline LLM)

### Bug Description

The pipeline (STT–LLM–TTS) path and the Realtime path handle a *recoverable* model failure very differently, and the Realtime path has no automatic retry — the turn is silently dropped.
**Pipeline path (has retry):**
`LLMStream._main_task` wraps inference in a retry loop:
`livekit/agents/llm/llm.py` ~line 214
```py
for i in range(self._conn_options.max_retry + 1):
    try:
        ... # run inference
    except APIError as e:
        if self._conn_options.max_retry == 0 or not e.retryable:
            raise
        elif i == self._conn_options.max_retry:
            raise APIConnectionError("failed to generate LLM completion after N attempts")
        # else: sleep _interval_for_retry(i) and retry
So a transient LLM/API error is retried automatically (default max_retry=3 → 4 attempts) and the user still gets a reply.

Realtime path (no retry):
When a realtime response fails recoverably, the OpenAI plugin emits an error event:

RealtimeSession._handle_response_done_but_not_complete → on status == "failed" calls _emit_error(APIError(..., retryable=True), recoverable=True)
RealtimeSession._handle_error → on a server error event, _emit_error(..., recoverable=True)
…but the code that drives the turn just swallows it and returns, with no re-generation:
livekit/agents/voice/agent_activity.py

# _realtime_reply_task (~2927) and _realtime_generation_task (~3493)
try:
    generation_ev = await self._rt_session.generate_reply(...)
except llm.RealtimeError as e:
    logger.error("failed to generate a reply: %s", str(e))
    self._session._update_agent_state("listening")
    return   # <-- turn dropped, never retried
Net effect: on a transient, explicitly-recoverable=True realtime failure (e.g. inference_rate_limit_exceeded, a transient response.failed), the SpeechHandle completes with no output, the agent goes silent for that turn, and the developer has to build their own retry/recovery on top of session.on("error"). There is no conn_options-style response-level retry equivalent to the pipeline LLM.

(The Realtime model does have a connection-level reconnect loop in RealtimeSession._main_task, but that only covers (re)establishing the websocket — it does not re-issue a failed response.create. See "Additional Context" for a related issue in that loop.)

### Expected Behavior

A recoverable Realtime response failure should be retried automatically — i.e. re-issue the response.create up to conn_options.max_retry (with conn_options retry interval/backoff), mirroring the pipeline LLM's behavior — instead of emitting a single "error" event and silently dropping the turn.

Concretely, parity with the pipeline:

transient/recoverable=True response failure -> re-generate the response (bounded by max_retry)
only after retries are exhausted should the turn be considered failed and surfaced as a non-recoverable error
developers should get the same "it just retries" guarantee regardless of whether they use a pipeline or a RealtimeModel

### Reproduction Steps

```bash
1.Create an AgentSession backed by openai.realtime.RealtimeModel (server VAD / audio modality).
2.Cause a recoverable response failure during a turn, e.g. hit inference_rate_limit_exceeded, or otherwise make the server return response.done with status="failed" (retryable), or send a server error event.
3.Observe: the SDK emits an "error" event with recoverable=True, the SpeechHandle finishes with NO output, and the reply is NEVER regenerated — the agent stays silent for that turn.
Contrast (control): run the SAME flow with a pipeline (STT-LLM-TTS) agent and force a retryable APIConnectionError from the LLM. The pipeline retries automatically per conn_options and the user still receives a reply.

Minimal sketch:

realtime: session.generate_reply() -> on recoverable failure -> no output, no retry
pipeline: session.generate_reply() -> on retryable failure -> auto-retried, reply produced
```

### Operating System

macOS

### Models Used

RealtimeModel: openai gpt-realtime (OpenAI Realtime API) (Pipeline control: Deepgram STT + OpenAI LLM + Cartesia TTS)

### Package Versions

```bash
livekit==1.1.5
livekit-agents==1.5.4
livekit-plugins-openai==1.5.4
```

### Session/Room/Call IDs

_No response_

### Proposed Solution

```python
Add response-level retry to the Realtime generate path, controlled by conn_options (max_retry / retry_interval), so a recoverable response failure re-issues response.create automatically — symmetric with LLMStream._main_task. The retry should:
interrupt/cancel the prior failed response and wait until has_active_generation is False before re-issuing (otherwise the next response.create collides with the still-active one: conversation_already_has_active_response),
clean up any partial assistant item committed by the failed response before regenerating,
stop and surface a terminal error once max_retry is exhausted.
Classify permanently-fatal errors as non-recoverable. Today _emit_error hardcodes recoverable=True for essentially all realtime failures (the comment even says "we assume optimistically all retryable/recoverable"). Errors like insufficient_quota / invalid_api_key are not recoverable and should not be retried (by the SDK or by user code).
Fix the reconnect loop's retry accounting in RealtimeSession._main_task: num_retries = 0 is reset after every successful reconnect, but for failures where connect/reconnect succeeds and only the generation fails (e.g. quota), the counter never reaches max_retry, so the give-up branch is unreachable and it reconnects unboundedly, emitting a recoverable error each cycle.
```

### Additional Context

Related observations found while working around this (all in the OpenAI realtime plugin):

All realtime errors are emitted with recoverable=True (optimistic), so quota/auth failures look retryable.
A single root cause often produces TWO RealtimeModelError emissions in quick succession: the server error event (e.g. insufficient_quota) AND, when the server then closes the websocket, an APIConnectionError "S2S connection closed unexpectedly" from the recv loop — followed by the SDK reconnecting and repeating. Any per-turn error handler that keys off the first error can miss the second.
Because there's no built-in response-level retry, every team using RealtimeModel has to reimplement retry/recovery on top of session.on("error"), and doing so correctly is subtle (avoiding overlapping response.create, partial-item cleanup, fatal-vs-transient classification).
A built-in, conn_options-driven retry for the realtime response path would remove that burden and bring it to parity with the pipeline LLM.

### Screenshots and Recordings

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Realtime models have no response-level retry for recoverable generate_reply failures (parity gap with the pipeline LLM) #6205

Bug Description

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Realtime models have no response-level retry for recoverable generate_reply failures (parity gap with the pipeline LLM) #6205

Description

Bug Description

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions