Lesson 03

Bit depth, dynamic range, and noise

Sample rate controls time resolution. Bit depth controls value resolution: how finely each sample can store loudness.

The one idea

Bit depth decides how many possible amplitude levels each sample can use. More levels mean more dynamic range and less quantization noise, but they do not rescue clipped or badly recorded speech.

Each sample needs a number

After sampling chooses moments in time, bit depth chooses how precise each measurement can be. Sixteen-bit PCM can store 65,536 possible values per sample. Twenty-four-bit PCM can store far more. Floating point audio stores values differently, but the practical idea is the same: the signal must fit inside a numeric range.

If the range is too coarse, quiet details get rounded away and the rounding error becomes quantization noise. If the signal is too loud for the range, the top and bottom of the waveform are cut flat. That is clipping, and it is much more damaging than simply recording a little quiet.

Coarse amplitude levels Finer amplitude levels more rounding error closer to the original amplitude
Bit depth is vertical precision. The sample times are the same, but the stored amplitude has more or fewer possible levels.

Dynamic range and headroom

Dynamic range is the span between the quietest useful signal and the loudest signal the system can store. In speech systems, you want enough range for whispers, normal speech, emphasis, and unexpected spikes without crushing everything into the same loudness.

Headroom is the space you leave below the maximum so sudden peaks do not clip. A recording that peaks around -12 dBFS is often easier to work with than one slammed against 0 dBFS. You can amplify a clean quiet signal later. You cannot reconstruct the rounded top of a clipped consonant.

Clean headroom Clipped peaks 0 dBFS ceiling ceiling hit peaks stay inside the range shape is permanently flattened
Headroom leaves room for peaks. Clipping changes the waveform shape, which is why it is hard to repair later.
Clipping smell

If an ASR model misses words when users speak loudly, inspect the waveform. Flat tops, harsh consonants, and high confidence nonsense often point to clipping or aggressive noise suppression before the model.

Noise floor matters more than bit depth marketing

In theory, higher bit depth gives more dynamic range. In practice, the microphone, room, preamp, browser processing, and compression often dominate. A noisy laptop mic in a cafe will not become studio audio because you store it as 24-bit WAV.

For voice AI, the useful question is not "what is the highest bit depth?" It is "is speech clearly above the noise floor without clipping?" A clean 16-bit, 16 kHz mono recording can beat a noisy 48 kHz stereo file with bad gain staging.

The float32 pipeline most models see

Files on disk are often 16-bit or 24-bit PCM integers. After decode, most Python and JavaScript audio stacks convert to float32 samples in [-1, 1]. A value of 0.0 is silence. Values near ±1.0 are loud. Values above 1.0 or below -1.0 usually mean clipping already happened or gain was applied incorrectly.

The conversion is simple in concept: divide integer samples by the maximum representable value (32768 for 16-bit signed PCM). In practice, keep the order straight: decode to PCM, optionally dither on down-conversion, scale to float, then resample. Resampling integer audio before scaling can introduce subtle rounding bias.

int16 PCM -32768..32767 ÷ 32768 scale to float float32 [-1.0, 1.0] resample e.g. 16 kHz model Whisper, librosa, torchaudio, and Web Audio all assume this normalized float path internally
Bit depth describes storage on disk. Model front ends usually work in float32 regardless of how the file was recorded.
Engineering reality

Log peak amplitude and clipping rate on float32 buffers after decode. A file that peaks at 0.3 float may be fine; one that hard-limits at 1.0 on every utterance is a capture problem, not a model problem.

Normalize carefully

Normalization changes gain so the audio reaches a target level. It helps when recordings arrive at wildly different volumes, but it can also lift background noise or hide clipping. Loudness normalization is usually better for user-facing playback. Peak normalization is simple, but one cough can set the level for the whole file.

For model inputs, choose a predictable policy: reject clipped audio when you can, normalize after decoding, and keep metrics for signal level and clipping rate. Those checks catch many issues before they become mysterious model regressions.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What does bit depth control?
  • Why is clipping usually worse than low volume?
  • What is the noise floor?
  • Why does 24-bit storage not automatically mean better ASR input?

Quick check

  • The speaker was recorded a little quiet
  • The waveform clipped at the top and bottom
  • The file had unused headroom