What is voice activity detection?

Voice activity detection, usually called VAD, answers a narrow question: does this slice of audio contain speech? In production systems, that narrow answer decides when to record, transcribe, wake up, stop listening, or send a turn to an agent.

The one idea

VAD is not transcription. It is the gate before transcription. A good VAD lets speech through with low delay and blocks silence, noise, and non-speech audio without cutting real words.

The job of VAD

A microphone produces a continuous stream of samples. Most downstream speech systems do not want to treat every sample as equally important. Speech recognition, diarization, streaming agents, recording systems, and analytics pipelines need to know which parts likely contain human speech.

VAD converts raw audio into decisions over time. At the frame level it may output a score such as "speech probability is 0.82." At the system level it usually produces spans: speech started at 4.12 seconds, speech ended at 6.74 seconds. Those spans become chunks for ASR, turns for a voice agent, or regions to store and review.

The result sounds simple, but it carries product weight. If VAD misses speech, the user is unheard. If VAD fires on noise, the system wastes compute and may hallucinate from background audio. If VAD waits too long to decide, a realtime product feels slow.

Where it sits in the pipeline

A typical voice pipeline starts with capture, resampling, channel handling, echo cancellation, gain control, and sometimes denoising. VAD usually runs early, before expensive speech recognition. Its decision may control whether audio is buffered, discarded, streamed to ASR, or used to mark the end of a user turn.

In batch transcription, VAD can remove long silence and split recordings into manageable segments. In realtime assistants, it becomes part of turn-taking: start listening when the user speaks, keep listening through short pauses, and decide when the user is done. In call analytics, it may help measure talk time and avoid running heavier models on hold music.

VAD sits after denoise and before STT. When it is wrong, the rest of the voice stack receives the wrong audio at the wrong time.

Engineering reality

VAD bugs often get misdiagnosed as ASR bugs. A transcript that drops the first syllable, merges two speakers, or responds late may be caused by endpointing and segmentation, not the recognizer itself.

The core tradeoff

Every VAD system trades missed speech against false alarms. A strict detector avoids background noise but may miss soft speech, far-field microphones, children, accents, or speech under music. A sensitive detector catches more speech but may trigger on keyboard clicks, fans, breath, TV audio, or room echo.

Latency adds a third constraint. If the system waits for too much future audio before deciding, it becomes accurate but sluggish. If it decides immediately, it becomes responsive but jumpy. Streaming VAD is mostly the art of choosing where to sit in that triangle for the product you are building.

The right VAD setting is not universally "most accurate." It is the operating point your product can tolerate.

Frame decisions are not enough

Real VAD is rarely a single threshold on one frame. Speech has pauses between words. Plosives and fricatives have different energy. Background noise changes over time. A raw frame-level detector will flicker: speech, non-speech, speech, non-speech. The product does not want flicker. It wants stable segments.

That is why production VAD combines scoring with state. It may require several speech frames before opening a segment, hold the segment open through short silences, add padding around boundaries, and enforce minimum segment lengths. These rules turn uncertain frame scores into usable speech spans.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why is VAD different from speech recognition?
How can VAD affect cost, latency, and transcript quality?
What is the tradeoff between missed speech and false alarms?
Why does a production system need segments, not only frame scores?

Quick check

The exact words spoken by the user
Speech and non-speech regions over time
The identity of each speaker