fix(room_io): replay unplayed audio tail on false interruptions#1884
Open
toubatbrian wants to merge 2 commits into
Open
fix(room_io): replay unplayed audio tail on false interruptions#1884toubatbrian wants to merge 2 commits into
toubatbrian wants to merge 2 commits into
Conversation
pause() cleared the entire native AudioSource queue, permanently dropping up to queueSizeMs of generated-but-unplayed audio. On a false interruption (pause then resume) those frames were never replayed, so up to ~1s of agent speech was lost mid-sentence from both the live call and the recording. Keep a rolling window of recently pushed frames, capture the unplayed tail on pause(), and replay it on resume(), while discarding it on a real interruption (clearBuffer()). Also cap the default room output queue to 200ms to match Python. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 53e6381 The changes in this PR will be included in the next version bump. This PR includes changesets to release 35 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c89a2ae04
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…izeMs Clear replayFrames unconditionally when a playback segment finishes. It was only cleared on interruption, so an end-of-utterance false interruption (which completes the segment non-interrupted) left the captured tail behind and prepended it to the next utterance. Mid-utterance false interruptions still recover their tail because the next captureFrame consumes it before flush. Also note the queueSizeMs default change (1000ms -> 200ms) in the docstring and changeset, and add a regression test for the end-of-utterance leak. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Problem
On a false interruption, the agent loses unplayed audio.
ParticipantAudioOutput.pause()callsthis.audioSource.clearQueue(), which permanently discards every frame already pushed to the native rtc-nodeAudioSourcequeue but not yet played. The agent pauses playback on any VAD-detected overlapping speech (interruptByAudioActivity). When that speech turns out not to be a real turn, the frameworkresume()s — but the cleared frames were never re-pushed, so that tail is gone (audible in both the live track and the recording, since both are fed from the sameAudioSource).The JS room output also never set
queueSizeMs, inheriting rtc-node's 1000ms default (Python uses 200ms), so the worst-case discardable tail was up to 1s.2. Analysis & fixes
pause()→clearQueue()is correct for a real interruption (the user cut the agent off, so the unplayed tail should die). The bug is that the same code path runs for a false interruption, where the audio should be preserved and resumed.Fix in
agents/src/voice/room_io/_output.ts:recentFrames, capped atqueueSizeMs + headroom).pause(), beforeclearQueue(), capture the unplayed tail (the lastaudioSource.queuedDurationworth of frames) intoreplayFrames.captureFrame()afterresume()(false interruption), replayreplayFramesbefore pushing new audio — zero loss. Replayed frames are not re-counted inpushedDuration.clearBuffer()) and at segment end, discardreplayFrames.Secondary fix in
agents/src/voice/room_io/room_io.ts:DEFAULT_ROOM_OUTPUT_OPTIONS.queueSizeMs = 200to match Python and bound the worst-case discardable tail.3. Validations
Unit tests (
_output.test.ts, +3): false interruption replays the exact unplayed tail (zero loss); real interruption (clearBuffer) discards it; no-op when nothing was queued.Live runtime validation (cue-cli voice mode, real
ParticipantAudioOutput+ rtc-nodeAudioSource), false vs real interruption mid-utterance:queuedMs:245.5→ captured296ms(3 frames)replayCount:3, replayMs:296clearBuffer discardedReplayFrames:3interrupted:false, replayFramesAtEnd:0(full audio)Scope / what this does NOT fix
This PR fixes the false-interruption audio loss (affects live + recording). It does not address the customer report in
RM_oPQpspNxqjtb, where the cut is observability-only ("not actual calls") and occurs on every turn regardless of interruption. Runtime A/B (1.4.7 vs this branch) rules out the queue cap as that cause (only ~130–200ms ever queued at interruption, ~no difference between 1000ms and 200ms) and rules out the handoff-drain teardown (audio drains fully,clearBuffernever fires). That per-turn recording clip traces torecorder_io's wall-clock clamp ofplaybackPosition(onPlaybackFinishedclamps to wall-clock elapsed, which runs slightly short of the actual audio duration) and is being investigated separately.