Hysteresis, smoothing, and segmentation

Frame scores are noisy. Users and downstream models need clean speech spans. The difference is a small state machine with carefully chosen timing rules.

The one idea

A VAD score becomes useful only after temporal logic turns flickering frame decisions into stable starts, stable ends, and sensible segment boundaries.

Why raw decisions flicker

Speech is not continuously loud. Words have gaps. Consonants can be weak. Breaths and mouth sounds can look speech-like. Background noise can spike for one frame. If you cut directly on every frame decision, a single sentence can become a dozen tiny fragments.

That fragmentation hurts ASR because recognizers need context. It hurts voice agents because the user appears to stop and start repeatedly. It hurts analytics because talk time and turn counts become artifacts of the detector.

Hysteresis uses two thresholds

Hysteresis means the system uses different rules to enter and exit speech. To start a speech segment, the score must cross a higher threshold. To stay in speech, it only needs to remain above a lower threshold. The gap prevents the state from flipping rapidly near one boundary.

This pattern appears everywhere in signal processing because it respects uncertainty. Once the system believes speech has started, a short dip should not immediately end the segment. Once it believes the user is silent, a tiny spike should not immediately open a segment.

Hysteresis and hangover turn a nervous score line into one usable segment.

Hangover protects short pauses

Hangover keeps the speech state open for a short time after scores drop. If speech resumes within that window, the pause stays inside the same segment. This protects natural pauses between words and clauses.

Too little hangover cuts words and sentences. Too much hangover delays endpointing and makes voice agents feel slow. The right value depends on the product: command-and-control systems want quick endpoints, while dictation can tolerate longer pauses.

Padding protects boundaries

Many systems add a small amount of audio before the detected start and after the detected end. Start padding protects quiet first syllables. End padding protects trailing phonemes and gives ASR a little acoustic context.

Padding is cheap in batch processing. In streaming systems, start padding requires a rolling buffer because the system must keep recent audio before it knows speech started. That buffer is one reason VAD is usually stateful.

Engineering reality

Most VAD tuning is not the model threshold alone. It is threshold plus smoothing window plus hangover plus padding plus minimum segment length. Track them as a config set, not as one magic number.

Minimum durations remove junk

A minimum speech duration can reject clicks, pops, and isolated noise bursts. A minimum silence duration can stop the system from splitting a sentence on a tiny pause. These rules encode what counts as a usable segment for the next stage.

Be careful with aggressive filtering. Very short user commands can be real: "yes," "no," "stop," "send." If the product expects short utterances, minimum duration rules need a path that preserves them.

Segmentation knobs

Common production configs include open threshold, close threshold, smoothing window, pre-roll padding, post-roll padding, minimum speech duration, minimum silence duration, and maximum segment length.

Starting numbers

A reasonable first pass for a voice agent at 20 ms frames: min speech duration 250 ms (about 12 frames), min silence 100 ms (5 frames) before closing a segment, hangover 300–500 ms for natural pauses, and pre-roll 150–250 ms so the first syllable is not clipped. Tune from product audio, not from these defaults alone.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why do frame-level VAD scores flicker?
How does hysteresis prevent rapid state changes?
What does hangover protect, and what does it cost?
Why do start padding and a rolling buffer belong together?

Quick check

Hangover or endpoint silence duration
The output sample rate of TTS
The number of ASR model parameters