Lesson 05

WebRTC and neural VAD models

Energy tells you how loud audio is. WebRTC VAD adds a hardened classical detector. Neural VAD learns richer acoustic patterns. All three still need segmentation logic before they become a product.

The one idea

Detector choice changes the frame score, not the whole system. Energy, WebRTC, and neural VAD all feed the same product layer: smoothing, hangover, endpointing, evaluation, and rollback.

Three detector families

In practice you will see three families. Energy VAD is a simple loudness baseline. WebRTC VAD is a classic production detector from the WebRTC audio stack. Neural VAD is a learned model that predicts speech probability from waveform or acoustic features.

They are not mutually exclusive. Many systems keep energy VAD for debugging, use WebRTC VAD for a small reliable baseline, and graduate to neural VAD only when product audio is too hard for classical detectors.

PCM frames 10 to 30 ms Energy loudness rule WebRTC classical VAD Neural learned score Config thresholds Postprocess segments, endpoints
Detector families differ, but production still needs a postprocessing layer that turns frame decisions into useful segments.

WebRTC VAD

WebRTC VAD is a widely used classical VAD from the WebRTC project. It expects 16-bit linear PCM audio, usually mono, in strict 10, 20, or 30 ms frames at supported sample rates such as 8, 16, 32, or 48 kHz. If the frame size or format is wrong, wrappers commonly reject the input rather than guessing.

Its API shape is intentionally simple: pass one valid frame and get a speech or non-speech decision. You typically choose an aggressiveness mode from 0 to 3. Lower modes are less aggressive and let more audio through. Higher modes reject more non-speech but increase the chance of missing quieter or degraded speech.

That makes WebRTC VAD a strong baseline for telephony, browser, embedded, and server systems that need low CPU cost and predictable behavior. On mobile, it is often the default when you cannot ship a neural model on-device: fixed frame sizes map cleanly to audio callbacks, aggressiveness mode 2 or 3 can suppress pocket rustle, and mode 0–1 preserves soft speech when the mic is close. It is not magic. It does not output a calibrated probability, it still flickers on hard audio, and it still needs hangover, padding, and endpointing logic around it.

WebRTC checklist

Before tuning quality, verify the basics: 16-bit PCM, supported sample rate, exact 10/20/30 ms frame length, one consistent channel policy, chosen aggressiveness mode, and postprocessing around the binary decisions. The upstream implementation lives in the WebRTC common_audio VAD tree if you need to read the frame rules.

Silero VAD: the default neural OSS choice

Silero VAD is the most common open-source neural VAD engineers reach for. It outputs a speech probability per chunk, ships PyTorch and ONNX checkpoints, and runs on CPU with single-digit millisecond inference on typical server hardware.

Common integration details: mono PCM at 8 kHz or 16 kHz, chunks of 512 samples at 16 kHz (32 ms) or the matching size at 8 kHz, and a probability threshold you still pair with hangover and min-duration rules from lesson 03. For browser or embedded deployment, export the ONNX model and run it through ONNX Runtime or a WASM build instead of shipping PyTorch.

Silero learns richer patterns than WebRTC VAD, which helps in cafes, far-field rooms, and mixed noise. It still needs product-specific eval. Treat the default threshold as a starting point, not a universal setting.

OSS

Landmark neural VAD for this course. Start with the README examples for streaming inference, then the ONNX export path if you deploy outside Python. Pair it with lesson 03 postprocessing and lesson 07 metrics before you call it production-ready.

Take from it
API: probability per chunk, 8/16 kHz mono. Ship: ONNX + ONNX Runtime for low-latency CPU. Tune: threshold plus min speech/silence durations.
It skips
Endpointing policy, echo cancellation, and telephony FA/hr targets. Use lessons 04, 06, and 07 for those; use Silero for the frame score.

Telephony vs conferencing vs open-world noise

Telephony and narrowband calls: WebRTC VAD at aggressiveness 2–3 is often enough when audio is already band-limited and close-mic. Optimize for low false accepts per hour because hold music and line noise are costly.

Conferencing and browser capture: Mixed microphones, room reverb, and soft far-field speech push you toward neural VAD (Silero) plus conservative hangover. WebRTC can remain a lightweight fallback.

Noisy cafes and open environments: Denoise first (lesson 04), then neural VAD. Energy-only or aggressive WebRTC modes will miss whispers or fire on background chatter.

What neural VAD learns

A neural VAD model receives audio features or waveform chunks and predicts speech probability. Unlike a pure energy detector or a compact classical detector, it can learn spectral shape, temporal rhythm, phonetic clues, and context across neighboring frames.

This matters because speech is not only loudness. A quiet vowel can be speech. A loud keyboard hit is not. A fan may be steady noise, while a voiced syllable has structure. A model can learn these differences if its training data covers them. Many neural VAD models consume log-mel or STFT features; if that vocabulary is new, see Fourier and spectrograms and mel features in Audio Foundations.

Domain coverage decides quality

A VAD trained mostly on clean close-talk speech may fail on far-field rooms. A model trained on English call audio may behave differently on tonal languages, singing, code-switching, children, elderly speakers, or accented speech. VAD is not language understanding, but speech acoustics and recording conditions still vary.

The most useful evaluation set is not a generic benchmark alone. It is a set of product-representative audio: target devices, target rooms, target languages, target background noise, and target user behavior. If the product includes barge-in over TTS, include barge-in examples.

Model trap

Do not assume a neural VAD is immune to music, laughter, TV audio, echo, or cross-talk. It learned from a dataset. Outside that distribution, it can be confidently wrong.

Deployment constraints

VAD often runs continuously. That makes CPU cost, memory, startup time, and power use important. A large model may be accurate offline but too expensive for browsers, phones, embedded devices, or high-scale servers. A smaller model with good postprocessing may win in production.

Streaming models also need bounded lookahead. If the model requires too much future audio, it cannot make fast endpoint decisions. Some architectures are causal, some use limited context, and some are more suitable for offline segmentation than live agents.

Production choice
When each detector is a reasonable default
Use WebRTC VAD when
You need tiny CPU cost, deterministic deployment, binary decisions, and audio conditions close to calls, meetings, or browser voice input.
Use neural VAD when
You have noisy, far-field, multilingual, music-heavy, or domain-specific audio where classical decisions miss too much speech or trigger too often.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What input constraints does WebRTC VAD usually impose?
  • How do WebRTC aggressiveness modes trade false alarms against missed speech?
  • Why is Silero VAD a common default for neural OSS deployment?
  • Which product-specific audio should appear in your VAD eval set?

Quick check

  • WebRTC VAD aggressiveness 3 with no denoising
  • RNNoise or similar suppression, then Silero VAD with tuned hangover
  • Fixed RMS threshold with no noise floor tracking
  • WebRTC VAD at 8 kHz, 20 ms frames, aggressiveness 2
  • A GPU-hosted transformer VAD with 500 ms lookahead
  • pyannote segmentation with no postprocessing