Lesson 02

Sample rate and Nyquist

Digital audio cannot store every instant of a continuous wave. It takes snapshots. The sample rate is how often those snapshots happen, and it sets the highest frequency the recording can represent.

The one idea

To capture a frequency without ambiguity, you need to sample at more than twice that frequency. That boundary is the Nyquist limit.

Sampling turns a wave into numbers

A microphone produces a continuous signal. A computer needs discrete values. Sampling is the act of measuring the signal at fixed intervals: 16,000 times per second for 16 kHz audio, 44,100 times per second for CD audio, 48,000 times per second for common video and conferencing pipelines.

Each measurement becomes one sample. Put the samples in order and you have a digital approximation of the original wave. More samples per second can represent faster changes in the waveform. Fewer samples per second throw away high-frequency detail.

Sampling does not store the continuous curve. It stores measurements along the curve at fixed time intervals.

The Nyquist limit

The highest frequency a sample rate can represent is half the sample rate. At 16 kHz, the theoretical ceiling is 8 kHz. At 48 kHz, it is 24 kHz. This does not mean everything below the limit is magically perfect, but it does tell you where representation becomes impossible.

If a sound contains frequencies above the Nyquist limit and you do not filter them out before sampling, they fold back into lower frequencies. That false low-frequency content is called aliasing. Once aliasing is recorded, you cannot reliably undo it because the digital samples no longer say which frequency was real.

A 16 kHz sample rate does not mean 16 kHz of useful frequency content. The usable ceiling is half that.

Pipeline bug

Bad downsampling is a silent model-quality bug. If you convert 48 kHz microphone audio to 16 kHz without proper low-pass filtering, high-frequency energy can alias into the speech band and confuse downstream features.

Why speech often uses 16 kHz

Human hearing extends much higher than 8 kHz, but speech recognition does not need every high-frequency detail. Much of the information needed to recognize words lives below 8 kHz, and many ASR systems are trained around 16 kHz mono audio. That makes 16 kHz a practical speech default: small enough to save bandwidth and compute, wide enough for intelligible speech.

Telephony is often narrower. Classic phone audio is commonly 8 kHz sample rate, which only preserves up to about 4 kHz. It is understandable for human speech, but consonants lose detail and ASR gets a harder input. Wideband telephony improves this by using 16 kHz.

Music and podcast pipelines often stay at 44.1 kHz or 48 kHz because they need high-frequency content and because those rates are baked into consumer hardware. Speech models trained on telephony or conferencing audio usually do not need that extra bandwidth. Matching the training rate matters more than chasing the highest number on the spec sheet.

Whisper input contract

OpenAI Whisper expects 16 kHz mono float32 samples in the range [-1, 1]. If you pass 48 kHz stereo int16 WAV without conversion, the mel frontend computes the wrong number of samples per frame and accuracy drops even when the audio sounds fine to you. Normalize at the boundary: decode, downmix, resample with a quality converter, then scale to float32.

The model contract

Sample rate is part of a model's input contract. If a model was trained on 16 kHz audio, giving it 44.1 kHz samples without conversion changes the meaning of time. A 25 ms analysis window no longer contains the expected number of samples. A pitch pattern appears stretched or compressed relative to training. The model is not hearing the same representation.

Good audio pipelines normalize sample rate explicitly at the boundary. They do not hope that every browser, phone, SDK, and storage service picked the same rate.

When you log ingestion metrics, track the distribution of incoming sample rates. A sudden spike in 48 kHz uploads after a mobile app release is a clue that a client SDK changed defaults, not that users suddenly care about hi-fi speech.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What does 16 kHz audio mean?
What is the Nyquist limit for 16 kHz audio?
What is aliasing?
Why should sample rate conversion be explicit in a voice pipeline?

Quick check

16 kHz
8 kHz
4 kHz

16 kHz mono float32 after deliberate conversion
48 kHz mono because more samples are always better
48 kHz stereo as-is