Channels, mono, stereo, and resampling
A channel is one waveform stream. Voice systems often collapse everything to mono, but the conversion has to be deliberate.
Most speech models want one clean waveform at one expected sample rate. Channels and resampling are format contracts, not cosmetic file details.
What channels store
Mono audio has one channel: one sequence of samples. Stereo has two channels, usually left and right. A conference recording might have more: separate participants, system audio, or beamformed microphone arrays.
For music, stereo image matters. For ASR and VAD, the model usually wants intelligible speech, not spatial placement. That is why many pipelines downmix stereo to mono before feature extraction.
Downmixing is not always harmless
The simple downmix is an average: mono = (left + right) / 2. That works when both channels contain similar speech. It fails when one channel has a different speaker, delayed echo, noise, or inverted phase. In the worst case, averaging can cancel part of the speech you wanted to keep.
When channels have distinct meaning, preserve that meaning. A call-center recording with agent on the left and customer on the right should not always be averaged. You may need diarization, per-channel ASR, or a clear choice of which channel drives the model.
Log channel count and downmix policy. Many "ASR got worse" incidents are really input changes: a vendor switched from mono to stereo, a browser changed capture format, or one channel became silent.
Resampling changes the time grid
Resampling converts audio from one sample rate to another. Going from 48 kHz to 16 kHz means throwing away samples after filtering out frequencies that 16 kHz cannot represent. Going from 8 kHz to 16 kHz adds samples, but it does not recover high-frequency detail that was never recorded.
This is why "upsampling" phone audio does not make it wideband. It only changes the format so a model can accept it. The missing detail stays missing.
In Python, librosa.resample and the soxr library (also exposed as soxr.resample) are common choices. soxr is often faster and higher quality for speech downsampling. FFmpeg's aresample filter is the usual choice in media pipelines. Avoid naive "take every third sample" downsampling: it skips the anti-aliasing filter and creates the aliasing bugs from Lesson 02.
Low-quality resamplers add ringing and phase smear that humans barely notice but ASR can. When WER jumps after an infrastructure change, compare spectrograms before and after resampling, not just waveforms.
Normalize at the boundary
A robust voice pipeline usually decodes whatever the user provides, validates it, converts to the target format, and only then runs feature extraction or model inference. For many speech models that target format is 16 kHz mono PCM.
Make that boundary explicit. Store the original when useful for auditing, but feed the model the normalized representation it was trained to understand.
NumPy arrays from audio loaders are shaped (n_samples,) for mono or (n_samples, n_channels) when channels are still separate. PyTorch and ONNX runtimes usually want (batch, n_samples) or (batch, n_mels, time) after feature extraction. Document the layout in your API so downstream services do not transpose silently.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What is an audio channel?
- Why do speech models often prefer mono?
- How can stereo downmixing damage speech?
- Why does upsampling not restore missing detail?
Quick check
- Normalize it deliberately to the expected mono sample rate
- Pass it through because higher sample rate is always better
- Rename the file so it says 16 kHz mono