Lesson 07

Mel spectrograms and audio features

The mel scale is a pragmatic bridge between raw frequency bins and how humans perceive pitch. It is why many speech models start with log-mel features instead of raw FFT bins.

The one idea

A mel spectrogram groups frequency energy into perceptual bands, then usually log-compresses it. The result is a compact, speech-friendly input representation.

Why not use raw FFT bins?

An FFT gives evenly spaced frequency bins. Human hearing is not evenly spaced. We are more sensitive to small differences at low frequencies than at high frequencies. Speech also puts a lot of useful structure in low and mid frequencies.

Mel filters compress the high end and keep more detail where perception needs it. Instead of giving the model every FFT bin, you pass energy through a bank of overlapping mel filters. The output is a smaller set of mel bands.

Linear FFT bins (201 at 16 kHz, n_fft=400) Mel filterbank (80 filters, Whisper) equal Hz spacing wider triangles above ~1 kHz Each mel band sums weighted FFT energy, then log-compresses
Mel filters keep more detail at lower frequencies and summarize wider ranges at higher frequencies. Whisper uses 80 such bands.

The Whisper log-mel contract

Whisper is the de facto reference for modern open-source ASR. Its frontend is not "generic mel" with arbitrary knobs. The paper and reference implementation fix the parameters below. Match them exactly when reproducing Whisper features or fine-tuning on its checkpoint.

ParameterGeneric speech defaultWhisper
Sample rate16 kHz (common)16 kHz
FFT / window20–32 ms typical400 samples (25 ms)
Hop / frame shift10–12.5 ms typical160 samples (10 ms)
Mel bands40–12880 bins
Frequency scalelinear → melmel, then log10 (clamped)
Input tensorvaries(80, 3000) max frames ≈ 30 s

In librosa terms, the closest one-liner is librosa.feature.melspectrogram(y, sr=16000, n_fft=400, hop_length=160, n_mels=80) followed by log scaling per the Whisper source. Small deviations in n_fft or hop break fine-tuned checkpoints.

Reference

See the Whisper paper and OpenAI Whisper GitHub for the authoritative mel implementation.

Log-mel features

After mel filtering, systems commonly take the logarithm of the energy. This makes huge energy differences easier to model and roughly matches how loudness feels to humans. A log-mel spectrogram is therefore a time-by-mel-band matrix.

Many modern ASR systems use this representation directly. The exact details vary: sample rate, window length, hop length, number of mel bins, normalization, padding, and whether the model expects fixed-length chunks.

Waveform STFT windowed FFT Mel filters perceptual bands Log scale compress energy time x mel-band matrix Common speech feature pipeline
Log-mel features are not raw audio. They are a carefully chosen representation with model-specific parameters.
Model contract

Feature extraction parameters are part of the trained model. Changing mel bins, hop length, normalization, or sample rate at inference can hurt accuracy even if the audio sounds fine to you.

MFCCs compress one step further

MFCCs, or mel-frequency cepstral coefficients, take log-mel features and apply a discrete cosine transform (DCT). The result is a handful of coefficients per frame that were useful for GMM-HMM recognizers and classical pipelines where compute and bandwidth were tight.

MFCCs are still useful in lightweight systems, keyword spotting on microcontrollers, and some classical baselines. Modern transformer ASR (Whisper, Conformer, wav2vec 2.0 fine-tunes) usually feeds log-mel spectrograms directly so the network learns its own frequency summaries. Using MFCCs with a Whisper checkpoint is a category error: the model never saw them during training.

Rule of thumb: if the model card says log-mel or references Whisper features, use log-mel. If you are hand-building a tiny on-device classifier with scikit-learn, MFCCs may still be the right tradeoff.

Raw waveform models still have front ends

Some models learn directly from waveform samples. That does not remove the need to understand features. It means the model learns part of the feature extraction internally, often with convolutional layers or learned filterbanks.

The practical pipeline questions remain: what sample rate, what channels, how much context, what normalization, and how robust the representation is to noise, compression, and silence.

Where this leads next

Now you can read the rest of the audio track with a useful map. VAD asks whether a time window contains speech. ASR maps audio features to text. TTS creates acoustic features from text and then turns them into waveform audio. Realtime voice agents wire all of this into a latency-sensitive loop.

The same core contract keeps appearing: the model is only as good as the signal and representation it receives. When you debug a voice product, walk the chain in order: codec and transport, sample rate and channels, float scaling, STFT parameters, mel bins, then model weights.

Checkpoint

You have finished the course if you can answer these from memory:

  • Why does the mel scale group frequencies unevenly?
  • What is a log-mel spectrogram?
  • How are MFCCs different from log-mel features?
  • Why are feature extraction parameters part of the model contract?

Quick check

  • They compress frequency energy into speech-friendly bands and log-scale the range
  • They preserve every original sample exactly
  • They automatically remove all background noise