Lesson 02

Energy-based VAD and thresholds

The simplest VAD asks whether the current audio frame is loud enough to be speech. That baseline is easy to build, easy to debug, and exactly where many production mistakes begin.

The one idea

Energy-based VAD compares short-frame loudness against a noise-aware threshold. It works when speech is clearly louder than the background and fails when loudness stops being a reliable proxy for speech.

Start with frames

VAD does not usually classify an entire file at once. It chops audio into short windows, often 10 to 30 milliseconds. Each window is small enough to react quickly but large enough to measure signal properties.

For each frame, you can compute root mean square energy, peak amplitude, or log energy. RMS is common because it reflects average power instead of one spike. You then compare that value with a threshold. Above the threshold means speech-like activity. Below it means likely silence or noise.

waveform 10 to 30 ms frames RMS energy per frame threshold speech frames noise silence Energy VAD is a chart problem: frame, measure, compare, then postprocess.
Energy VAD turns a waveform into one number per frame, then asks whether that number is high enough relative to the background.

The threshold is the product

A fixed threshold is tempting: if RMS energy is above a chosen value, call it speech. It works in a quiet room with one microphone level. It breaks when the user changes distance from the mic, switches devices, whispers, speaks in traffic, or has automatic gain control in the capture stack.

A better baseline estimates a noise floor. During known or likely non-speech regions, track background energy and set the speech threshold relative to that floor. This turns the question from "is this loud?" into "is this meaningfully louder than the current background?"

Useful baseline

Keep the energy VAD around even if you later use a neural model. It is a cheap sanity check, a fallback for embedded systems, and a useful debugging signal when model scores look strange.

SNR explains the failure mode

Signal-to-noise ratio, or SNR, is the gap between the speech signal and background noise. Energy VAD is strongest when SNR is high: speech is much louder than the room. It is weakest when SNR is low: speech and noise live near the same energy level.

Low SNR happens in cars, cafes, far-field microphones, call center recordings, open offices, and devices with poor gain control. In those cases, non-speech audio can exceed the threshold and soft speech can fall below it. Energy alone cannot know whether a loud sound was a spoken syllable or a dropped object.

Mental model

Energy VAD is a good first ruler, not a speech understanding model. When background noise and speech overlap in loudness, the ruler has no feature left to separate them.

Preprocessing changes the score

Resampling, channel mixing, automatic gain control, noise suppression, clipping, and echo cancellation can all change frame energy. That means a threshold tuned on raw microphone audio may fail after a capture SDK update. It also means offline evaluation must use the same preprocessing path as production.

For stereo audio, decide whether you classify each channel separately or downmix first. For telephony, check whether the signal is narrowband. For browser audio, understand whether the browser is applying echo cancellation and automatic gain. VAD is small, but it sits on a stack that moves.

The hardened classical alternative to a hand-rolled energy gate is WebRTC VAD, which wraps energy and spectral rules behind a simple binary API. We cover its frame sizes and aggressiveness modes in lesson 05.

Try the threshold

Drag the threshold on this synthetic waveform. Watch missed speech frames (real speech below the line) trade off against false alarms (noise above the line).

Common trap

Do not tune an energy threshold only on clean demo audio. The demo microphone, room, and speaker volume become hidden assumptions in the system.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • Why does VAD use short frames instead of classifying a whole file at once?
  • Why is a fixed threshold brittle?
  • How does estimating a noise floor improve energy VAD?
  • Why does low SNR make energy-based VAD unreliable?

Quick check

  • It removes all tuning from the system
  • It compares speech against the local background noise level
  • It detects the exact words in the signal