Lesson 07

Evaluate and ship VAD

VAD quality is not one accuracy number. You need to know what speech it misses, what noise it lets through, how long it waits, and how its segments affect the rest of the voice stack.

The one idea

Evaluate VAD by downstream damage: missed words, false triggers, late endpoints, broken segments, wasted compute, and product slices that drift after launch.

Measure frame and segment quality

Frame metrics compare speech and non-speech labels at small time intervals. They are useful for model development, but they do not fully describe user impact. A few bad frames at the start of an utterance may drop a word. A few bad frames in a long silence may not matter.

Segment metrics ask whether starts and ends are usable. Did the VAD preserve the first syllable? Did it end near the real turn end? Did it split one utterance into pieces? Did it merge two separate turns? These questions are closer to production quality.

Telephony metrics: FA/hr and FR/hr

In phone and meeting products, frame accuracy is hard to interpret. Teams often report false accepts per hour (FA/hr) and false rejects per hour (FR/hr) on labeled continuous audio.

  • False accept (FA): VAD declares speech when the label says non-speech. FA/hr counts those events per hour of audio. High FA/hr wastes ASR compute and can trigger agents on hold music or background TV.
  • False reject (FR): VAD misses labeled speech. FR/hr counts missed speech time or missed speech events per hour. High FR/hr means users are not heard.

Convert frame errors into these rates on a representative eval set: run VAD over N hours of labeled audio, count FA and FR events using your product's definition (some teams count events, others count seconds of missed speech), and report both at the operating point you ship.

For diarization pipelines that care about who spoke when, also track diarization error rate (DER): the fraction of scored time that is wrong speaker assignment, including speech attributed to the wrong speaker or missed speaker regions. VAD quality sets the floor for DER because bad speech boundaries break downstream clustering.

Evaluation protocol on labeled audio

  1. Build a golden set of product-representative clips with frame or segment labels for speech vs non-speech. Follow the golden-set discipline in Evaluation & Observability, lesson 02: real failures, versioned audio, and slice tags (device, room, language).
  2. Run the full production pipeline on each clip: same resampling, denoise, VAD config, and postprocessing you ship.
  3. Score frame metrics (precision/recall on speech frames) for model tuning, then segment metrics (start offset, end offset, missed speech seconds, FA/hr, FR/hr) for product decisions.
  4. Add downstream checks: ASR word error rate on VAD segments, agent repeat rate, or endpoint delay p95 from timestamps.
  5. Compare configs in paired A/B on the same clips before rollout. Keep public benchmarks (NIST SRE-style detection tasks, AVA-AVSP speech activity sets) as sanity checks, but do not replace your golden set with them.

Track the important errors

Missed speech is usually the most visible error because users notice when they are ignored. False alarms are expensive because they run ASR and may trigger unwanted agent behavior. Late endpoints make the system feel slow. Early endpoints cut the user off.

Report these separately. A single aggregate score can hide a detector that looks good overall but fails quiet speakers, mobile microphones, or noisy rooms. Slice metrics by device, environment, language, speaker distance, network path, and product mode.

Missed speech 1.8% False accepts 4.6/hr False rejects 1.2/hr Endpoint p95 620 ms kept speech missed region kept speech
A useful VAD dashboard separates different failure modes. One aggregate accuracy number hides the errors users actually feel.

Evaluate with downstream systems

A VAD can look strong on acoustic labels and still hurt ASR. For transcription, measure word error rate with the VAD segmentation in place. For voice agents, measure time to first response, interruption rate, missed utterances, and user repeats. For analytics, inspect talk-time accuracy and segment counts.

This matters because VAD is rarely the final product. Its output is an input to ASR, diarization, LLM orchestration, or storage. Optimize for the next system, not for a leaderboard number detached from the product.

Eval slices
Audio you should deliberately test
Must cover
Quiet speech, far-field microphones, short commands, long pauses, cross-talk, browser echo cancellation, telephony audio, and barge-in over TTS.
Easy to forget
Music, laughter, keyboard noise, hold music, TV in the room, low battery Bluetooth microphones, clipped audio, and automatic gain control changes.
Release habit

When changing a VAD model or config, run a paired comparison: same audio, old VAD, new VAD, downstream metrics side by side. It makes regressions visible before launch.

Monitor after launch

Production audio changes. New devices appear. Users move to different rooms. A mobile OS update changes capture processing. Traffic shifts from quiet demos to real noise. Monitor VAD event rates, average segment length, endpoint delay, ASR empty transcripts, user repeat phrases, and cancellation patterns.

Privacy matters here. You may not be able to store raw audio freely. In that case, log derived metrics, timestamps, device metadata, and sampled review data under explicit retention and consent rules. VAD observability should be designed before the incident.

Ship with rollback

VAD config changes can reshape the whole voice product. Use versioned configs, staged rollout, dashboard slices, and quick rollback. Keep the old model or threshold available until the new path has survived real traffic.

For high-stakes workflows, build a conservative fallback. If the neural model fails to load, use an energy gate. If endpointing is uncertain, prefer asking for confirmation over pretending the user was done. The right fallback depends on the harm of missing speech versus acting on noise.

Checkpoint

You understand the course if you can answer these from memory:

  • Why can frame accuracy hide bad segment quality?
  • What are FA/hr and FR/hr, and when do telephony teams use them?
  • Why should VAD evaluation include ASR or voice-agent metrics?
  • What should be monitored after launch?

Quick check

  • Only frame-level precision
  • Endpoint delay and time from user stop to agent response
  • The number of audio files stored per day