Mel spectrograms and audio features
The mel scale is a pragmatic bridge between raw frequency bins and how humans perceive pitch. It is why many speech models start with log-mel features instead of raw FFT bins.
A mel spectrogram groups frequency energy into perceptual bands, then usually log-compresses it. The result is a compact, speech-friendly input representation.
Why not use raw FFT bins?
An FFT gives evenly spaced frequency bins. Human hearing is not evenly spaced. We are more sensitive to small differences at low frequencies than at high frequencies. Speech also puts a lot of useful structure in low and mid frequencies.
Mel filters compress the high end and keep more detail where perception needs it. Instead of giving the model every FFT bin, you pass energy through a bank of overlapping mel filters. The output is a smaller set of mel bands.
The Whisper log-mel contract
Whisper is the de facto reference for modern open-source ASR. Its frontend is not "generic mel" with arbitrary knobs. The paper and reference implementation fix the parameters below. Match them exactly when reproducing Whisper features or fine-tuning on its checkpoint.
| Parameter | Generic speech default | Whisper |
|---|---|---|
| Sample rate | 16 kHz (common) | 16 kHz |
| FFT / window | 20–32 ms typical | 400 samples (25 ms) |
| Hop / frame shift | 10–12.5 ms typical | 160 samples (10 ms) |
| Mel bands | 40–128 | 80 bins |
| Frequency scale | linear → mel | mel, then log10 (clamped) |
| Input tensor | varies | (80, 3000) max frames ≈ 30 s |
In librosa terms, the closest one-liner is librosa.feature.melspectrogram(y, sr=16000, n_fft=400, hop_length=160, n_mels=80) followed by log scaling per the Whisper source. Small deviations in n_fft or hop break fine-tuned checkpoints.
See the Whisper paper and OpenAI Whisper GitHub for the authoritative mel implementation.
Log-mel features
After mel filtering, systems commonly take the logarithm of the energy. This makes huge energy differences easier to model and roughly matches how loudness feels to humans. A log-mel spectrogram is therefore a time-by-mel-band matrix.
Many modern ASR systems use this representation directly. The exact details vary: sample rate, window length, hop length, number of mel bins, normalization, padding, and whether the model expects fixed-length chunks.
Feature extraction parameters are part of the trained model. Changing mel bins, hop length, normalization, or sample rate at inference can hurt accuracy even if the audio sounds fine to you.
MFCCs compress one step further
MFCCs, or mel-frequency cepstral coefficients, take log-mel features and apply a discrete cosine transform (DCT). The result is a handful of coefficients per frame that were useful for GMM-HMM recognizers and classical pipelines where compute and bandwidth were tight.
MFCCs are still useful in lightweight systems, keyword spotting on microcontrollers, and some classical baselines. Modern transformer ASR (Whisper, Conformer, wav2vec 2.0 fine-tunes) usually feeds log-mel spectrograms directly so the network learns its own frequency summaries. Using MFCCs with a Whisper checkpoint is a category error: the model never saw them during training.
Rule of thumb: if the model card says log-mel or references Whisper features, use log-mel. If you are hand-building a tiny on-device classifier with scikit-learn, MFCCs may still be the right tradeoff.
Raw waveform models still have front ends
Some models learn directly from waveform samples. That does not remove the need to understand features. It means the model learns part of the feature extraction internally, often with convolutional layers or learned filterbanks.
The practical pipeline questions remain: what sample rate, what channels, how much context, what normalization, and how robust the representation is to noise, compression, and silence.
Where this leads next
Now you can read the rest of the audio track with a useful map. VAD asks whether a time window contains speech. ASR maps audio features to text. TTS creates acoustic features from text and then turns them into waveform audio. Realtime voice agents wire all of this into a latency-sensitive loop.
The same core contract keeps appearing: the model is only as good as the signal and representation it receives. When you debug a voice product, walk the chain in order: codec and transport, sample rate and channels, float scaling, STFT parameters, mel bins, then model weights.
Checkpoint
You have finished the course if you can answer these from memory:
- Why does the mel scale group frequencies unevenly?
- What is a log-mel spectrogram?
- How are MFCCs different from log-mel features?
- Why are feature extraction parameters part of the model contract?
Quick check
- They compress frequency energy into speech-friendly bands and log-scale the range
- They preserve every original sample exactly
- They automatically remove all background noise