Lesson 06

Fourier transform and spectrograms

A waveform tells you how pressure changed over time. A spectrogram tells you which frequencies were active as time moved.

The one idea

The Fourier transform rewrites a short slice of audio as a mixture of frequencies. A spectrogram repeats that over many slices, creating a time-frequency image.

From time to frequency

The waveform is useful, but many speech patterns are easier to see by frequency. Vowels have formants: bands where energy concentrates. Fricatives like "s" often show high-frequency noise. Plosives like "p" and "t" create short bursts.

The Fourier transform asks: if this short piece of audio were made from sine waves, how much of each frequency would we need? The fast version used in software is the FFT.

The FFT takes one short time window and estimates how much energy is present at each frequency.

Windows and hops

Speech changes over time, so we do not transform the whole recording at once. We take a short window, often around 20 to 30 ms, run an FFT, move forward by a hop, and repeat. Stack those frequency snapshots left to right and you get a spectrogram.

Window length creates a tradeoff. Longer windows distinguish frequencies better but smear fast events in time. Shorter windows locate sudden events better but give rougher frequency resolution. Audio feature extraction is full of these practical compromises.

A spectrogram is built by sliding a window over the waveform and stacking the FFT result for each window.

Numeric example at 16 kHz

At a 16 kHz sample rate, time in milliseconds converts directly to sample counts:

Parameter	Time	Sample count at 16 kHz
FFT / window length	25 ms	400 samples (16000 × 0.025)
Hop / frame shift	10 ms	160 samples (16000 × 0.010)
Frames per second	10 ms hop	100 frames/s (16000 ÷ 160)

These numbers are the Whisper STFT contract. If your hop is 10 ms, a streaming system emits a new feature column every 10 ms. That hop size sets a floor on how quickly spectral features can react to new audio.

Magnitude is what most models start with

The FFT output contains magnitude and phase. Many speech systems keep the magnitude, often after converting to power or log scale, because it captures where energy lives. Phase can matter for reconstruction and enhancement, but recognition systems often learn from magnitude-like features.

A log scale is common because raw energy has a huge range. Log compression makes quiet and loud details coexist in a range models can learn from more easily.

Power spectrograms square the magnitude; log-mel applies log after mel weighting. The order matters. Whisper computes mel power, takes log10, then clamps and normalizes. Mixing up power vs magnitude or natural log vs log10 is a common reason homemade features do not match the reference implementation.

Mental picture

Think of a spectrogram as a heatmap: time on the horizontal axis, frequency on the vertical axis, brightness for energy.

Why spectrograms became the bridge to ML

Before end-to-end audio models became common, speech systems relied heavily on engineered features derived from spectrograms. Even modern models often start from log-mel spectrograms because they are compact, stable, and aligned with speech perception.

The important shift is that a spectrogram exposes structure that the raw waveform hides. It turns "pressure over time" into "which frequencies changed when." For speech, that is a much friendlier starting point.

Go deeper (optional)

3Blue1Brown's video on the Fourier transform builds intuition for why any signal can be decomposed into rotating frequencies. Watch it if the jump from waveform to spectrum still feels like a magic trick.

Engineering reality

Realtime voice stacks buffer incoming PCM until one STFT window is full. With a 400-sample window at 16 kHz, that is 25 ms of audio per frame. A 160-sample hop means you need only 10 ms of new audio before the next frame, but the first frame still waits 25 ms. Size capture buffers as multiples of the hop (160 samples = 10 ms at 16 kHz) to avoid partial frames and extra memcpy. VAD that runs every 10 ms and ASR that consumes 25 ms context are often sharing the same underlying hop contract.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What question does the Fourier transform answer?
How does a spectrogram use repeated FFT windows?
What tradeoff does window length create?
Why is log energy useful?

Quick check

Time, frequency, and energy
Sample rate, bit depth, and channel count
Amplitude, phase, and file size