Streaming VAD and latency

In a live voice product, VAD is part of the user interface. It decides when the system listens, when it answers, when it interrupts itself, and how much silence the user has to wait through.

The one idea

Streaming VAD must make incomplete decisions on partial audio. Better responsiveness usually means less certainty, so endpointing is a latency and product design tradeoff.

Chunk size sets the floor

Streaming systems process audio in chunks. A 20 ms chunk can produce decisions faster than a 200 ms chunk, but it increases scheduling overhead and can be noisier. Larger chunks are easier to batch and classify, but they add delay before the system can react.

The right chunk size depends on the path around VAD: capture buffering, network transport, server queues, ASR streaming, LLM response time, and TTS playback. VAD latency is only one piece, but it is early in the chain, so its delay compounds. Frame duration ties directly to sample rate: at 16 kHz, a 320-sample frame is 20 ms (Audio Foundations, lesson 02). Buffer sizes and callback cadence on mobile and WebRTC stacks should match those frame boundaries or you will misalign VAD decisions with the audio you think you are classifying.

Endpointing delay is not only model inference. It includes chunking, silence duration, hangover, buffering, and downstream finalization.

Worked latency budget

Stack the delays instead of guessing. A typical streaming voice agent might budget:

20 ms capture frame (one 16 kHz chunk at 320 samples)
10 ms neural VAD inference on CPU (Silero-class ONNX on a server core)
30 ms endpoint lookahead (hangover plus silence confirmation before declaring turn end)

That is 60 ms of acoustic policy delay before downstream ASR finalization, network, and LLM time even start. If denoise adds 20 ms lookahead or your chunk size is 30 ms, the budget grows quickly. Write the sum explicitly in design docs so teams do not optimize only model inference.

Example budget for one endpoint decision. Measure your own path; do not treat 60 ms as a universal target.

Endpointing is not silence detection

Ending a user turn is harder than detecting that the current frame is quiet. People pause mid-sentence. They hesitate. They say "uh" and continue. A voice agent that responds on every short pause feels interruptive. A voice agent that waits too long feels sluggish.

Good endpointing uses silence duration, confidence, transcript state, punctuation hints from streaming ASR, and sometimes dialogue context. VAD supplies the acoustic signal, but the turn-taking decision can combine multiple signals.

Barge-in needs fast speech start

Barge-in means the user can interrupt the system while TTS is playing. The VAD must detect user speech quickly and distinguish it from the system's own audio leaking into the microphone. Echo cancellation helps, but it is not perfect.

For barge-in, missing the first word is more damaging than a small false alarm. Many products tune the speech-start threshold more sensitively during assistant playback and use other guards to avoid stopping TTS for every echo artifact.

Barge-in pattern

During assistant playback, run VAD on echo-cancelled mic audio, keep a short pre-roll buffer, require a few consecutive speech frames, then cancel TTS only if the speech-start event is recent enough to be trusted.

Backpressure and buffering

Streaming audio is a real pipeline. If network or downstream processing slows, buffers grow. If buffers grow, VAD decisions arrive late. If you drop audio to catch up, you may cut speech. Production systems need explicit policies for queue limits, cancellation, and stale chunks.

One useful rule: do not let old VAD decisions control current UI state without checking their timestamp. A speech-end event that arrives 700 ms late may already be wrong from the user's perspective.

Engineering reality

Measure endpoint delay from the user's last spoken sound to the system's decision to respond. Measuring only model inference time hides buffering, hangover, network, and ASR finalization delays.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

How does chunk size relate to sample rate and frame duration?
How would you budget frame, inference, and lookahead delay for a voice agent?
Why is endpointing more than detecting silence?
How can backpressure make correct VAD decisions arrive too late?

Quick check

Only the VAD model's inference time
Time from the user's last spoken sound to the system's turn-end decision
Only the TTS playback duration