Summary
On outbound SIP calls with carrier early media, AMD emits uncertain / detection_timeout with speech_duration=0 before the callee answers, so the agent falls through and treats a voicemail (or a late-answering human) as a live human. This is the detection-timeout analog of the no-speech-timer bug already fixed in #5848.
Root cause
#5848 correctly deferred the no-speech timer to the SIP active state, and its own description notes the remainder: "Detection timeout is still armed when track is subscribed."
In voice/amd/detector.py _setup, start_detection_timer() is still called immediately after wait_for_track_publication, with the comment:
# outer budget runs from track-up so AMD bails out even if the
# call never reaches the active state
self._classifier.start_detection_timer()
With carrier early media (e.g. Twilio), the audio track is subscribed during ringback, not at answer, so the detection_timeout budget (default 20s) runs down during the ringing phase. US cell voicemail commonly answers at ~25-30s, which is after the 20s budget has already expired. The classifier then emits uncertain / detection_timeout with zero speech, before the SIP leg ever reaches active.
start_listening() (the no-speech timer + transcript processing) is correctly gated on sip.callStatus == "active" via _wait_for_sip_answer; only the outer detection timer is not. (Verified against main @ b2eefbe.)
Reproduction (real call)
| t (after dial) |
event |
| +6s |
detection budget armed (audio track up, early media) |
| +7s |
SIP ringing |
| +26s |
detection budget expires -> uncertain / detection_timeout, zero audio |
| +32s |
SIP active (mailbox answers, 6s too late) |
| +36s |
greeting transcribed: "...automated voice messaging system" |
The same early-media reasoning that justified deferring the no-speech timer in #5848 applies identically to the detection timer: a clock armed at track subscription is poisoned by the pre-answer ringing phase.
Proposed fix
Extend #5848's answer-anchoring to the detection timeout, symmetric to the no-speech timer: for SIP participants, start (or reset) the detection timer from _start_listening() (i.e. at active) rather than at track-up, so the budget measures post-answer detection time. The never-answered hang-guard that the track-up start currently provides can be preserved with a separate, longer pre-answer ring-wait bound (so a call that never reaches active still bails), rather than by consuming the detection budget during ringback.
Happy to open a PR for whichever shape you prefer (the pre-answer-bound-vs-reset choice is yours to call).
Workaround
Callers can pass detection_options={"timeout": 45.0} so the budget outlasts a realistic ring, but that is a ceiling-guess, not a fix: a longer ring re-breaks it.
Version
livekit-agents 1.5.17 (behavior confirmed unchanged on main @ b2eefbe, 2026-06-22).
Summary
On outbound SIP calls with carrier early media, AMD emits
uncertain/detection_timeoutwithspeech_duration=0before the callee answers, so the agent falls through and treats a voicemail (or a late-answering human) as a live human. This is the detection-timeout analog of the no-speech-timer bug already fixed in #5848.Root cause
#5848 correctly deferred the no-speech timer to the SIP
activestate, and its own description notes the remainder: "Detection timeout is still armed when track is subscribed."In
voice/amd/detector.py_setup,start_detection_timer()is still called immediately afterwait_for_track_publication, with the comment:With carrier early media (e.g. Twilio), the audio track is subscribed during ringback, not at answer, so the
detection_timeoutbudget (default 20s) runs down during the ringing phase. US cell voicemail commonly answers at ~25-30s, which is after the 20s budget has already expired. The classifier then emitsuncertain/detection_timeoutwith zero speech, before the SIP leg ever reachesactive.start_listening()(the no-speech timer + transcript processing) is correctly gated onsip.callStatus == "active"via_wait_for_sip_answer; only the outer detection timer is not. (Verified againstmain@ b2eefbe.)Reproduction (real call)
ringinguncertain/detection_timeout, zero audioactive(mailbox answers, 6s too late)The same early-media reasoning that justified deferring the no-speech timer in #5848 applies identically to the detection timer: a clock armed at track subscription is poisoned by the pre-answer ringing phase.
Proposed fix
Extend #5848's answer-anchoring to the detection timeout, symmetric to the no-speech timer: for SIP participants, start (or reset) the detection timer from
_start_listening()(i.e. atactive) rather than at track-up, so the budget measures post-answer detection time. The never-answered hang-guard that the track-up start currently provides can be preserved with a separate, longer pre-answer ring-wait bound (so a call that never reachesactivestill bails), rather than by consuming the detection budget during ringback.Happy to open a PR for whichever shape you prefer (the pre-answer-bound-vs-reset choice is yours to call).
Workaround
Callers can pass
detection_options={"timeout": 45.0}so the budget outlasts a realistic ring, but that is a ceiling-guess, not a fix: a longer ring re-breaks it.Version
livekit-agents1.5.17 (behavior confirmed unchanged onmain@ b2eefbe, 2026-06-22).