Denoising and noise suppression for VAD

Denoising can make speech easier to detect, but it is not automatically upstream magic. It changes the signal VAD sees, which means it can improve false alarms while also erasing quiet speech boundaries.

The one idea

Denoising belongs between capture cleanup and detection, but it must be evaluated with VAD and ASR together. A cleaner-sounding signal is not always a better detection signal.

Where denoising sits

A practical voice pipeline usually starts with capture, resampling, channel handling, echo cancellation, and automatic gain control. Denoising or noise suppression often runs after those capture-level steps and before VAD, ASR, or both.

That placement is useful because VAD is sensitive to the noise floor. If a fan, road, keyboard, or room hum is reduced before detection, the VAD has an easier job. The speech frames stand out more clearly from the background.

Denoising usually sits before VAD, but raw or lightly processed audio is still useful for evaluation and incident debugging.

Denoising is not the same job as VAD

Denoising tries to improve the audio signal by reducing unwanted sound. VAD tries to decide whether speech is present. They are adjacent but different jobs. A denoiser can make audio sound cleaner to a human while making VAD boundaries worse.

For example, aggressive suppression can remove low-energy consonants, smear speech starts, create musical artifacts, or make background speech sound like foreground speech. Those changes matter because VAD is often looking for exactly the subtle edges that denoising modifies.

RNNoise and the DNS challenge

RNNoise is a widely used recurrent noise suppressor from Xiph. It targets real-time speech enhancement with modest CPU cost, which makes it a common denoiser in voice stacks before VAD or ASR. It is not a VAD model itself, but lowering the noise floor can make energy and classical detectors more stable.

The Microsoft Deep Noise Suppression (DNS) Challenge datasets and leaderboards pushed modern speech-enhancement models on realistic room, device, and codec noise. When you compare denoisers for VAD, borrow that mindset: test on noisy product audio, not only on clean studio clips.

Denoise before VAD or after?

Most production voice pipelines run denoising before VAD. The detector sees a cleaner noise floor, false alarms from HVAC and road hum drop, and segmentation stays closer to what the user hears after capture cleanup.

Running denoise after VAD is rare for realtime turn-taking because VAD would still react to raw noise. The exception is batch transcription: you might segment on lightly processed audio, then denoise each speech chunk independently for ASR. Even then, measure both paths. If denoise smears word boundaries, ASR WER can rise even when VAD frame scores look better.

Recommendation: denoise before VAD for streaming agents and calls; keep a raw copy for debugging; re-run VAD metrics whenever you change denoiser aggressiveness.

What helps VAD

Noise suppression helps most when the noise is steady or predictable: fans, HVAC, hum, road noise, mild room tone, or device hiss. Lowering that floor increases the gap between speech and non-speech frames, which helps both energy-based VAD and model-based VAD.

Echo cancellation is especially important for voice agents. Without it, the assistant's own TTS can leak back into the microphone and look like user speech. Denoising alone is not enough for that problem. Echo cancellation needs the playback reference signal so it can remove the system audio from the mic stream.

Practical default

For realtime voice agents, think in this order: echo cancellation for playback leakage, gain/channel normalization for stable levels, noise suppression for background floor, then VAD and endpointing.

What can hurt VAD

Over-denoising can make the detector overconfident. If quiet speech is removed, missed speech goes up. If artifacts appear during silence, false alarms go up. If the denoiser adds lookahead, endpointing gets slower even if the detector itself is fast.

This is why denoising cannot be evaluated only by listening to a few samples. You need paired VAD metrics: raw or lightly processed audio versus denoised audio, measured on missed speech, false alarms, endpoint delay, and ASR quality.

The best denoiser for VAD is not the one that sounds most silent. It is the one that preserves speech evidence while reducing confusing noise.

Streaming constraints

Offline denoisers can use future audio. Realtime denoisers cannot use much lookahead without adding latency. A denoiser with 200 ms of lookahead may be acceptable for recorded transcription and unacceptable for a live agent.

Streaming denoising also changes buffering. If the denoiser emits chunks late, every downstream VAD event is late. Timestamp audio at capture time and carry those timestamps through denoising, VAD, ASR, and endpointing so you can measure real user-perceived delay.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Where does denoising usually sit relative to capture cleanup, VAD, and ASR?
When should denoise run before VAD versus only on speech chunks?
What is RNNoise used for in a voice stack?
What should you measure before deciding a denoiser helped VAD?

Quick check

The denoiser removed speech evidence that VAD needed
The ASR vocabulary got smaller
The microphone sample rate is too high