Lesson 05

PCM, WAV, and audio codecs

Audio files have two separate questions: how are the samples encoded, and what container holds them?

The one idea

WAV is usually a container. PCM is a raw sample encoding. Codecs like MP3, AAC, Opus, G.711, and AMR compress audio by deciding what information to keep, approximate, or discard.

Containers are not codecs

A container is the box a file uses to store streams and metadata. WAV, MP4, WebM, and Ogg are containers. A codec is the method used to encode the audio inside the box. PCM, Opus, AAC, MP3, FLAC, and G.711 are codecs or codec families.

People often say "send a WAV" when they really mean "send uncompressed PCM in a WAV container." That distinction matters because a container name alone does not guarantee the model receives the format you expect.

Concrete examples: a .wav file usually holds PCM but can hold compressed audio in some tools. An .mp4 or .webm from a screen recorder often contains Opus or AAC inside. A .ogg is a container that frequently wraps Vorbis or Opus. Always inspect with ffprobe or your decoder's metadata API before assuming sample rate and codec.

A container is packaging. The codec is the way the audio samples are represented inside that package.

PCM is the simple baseline

PCM stores samples directly. For each channel, at each sample time, it stores a numeric amplitude. There is no perceptual trick, no learned compression, and no missing high-frequency reconstruction. That simplicity makes PCM useful as the normalized internal format for speech pipelines.

The tradeoff is size. One minute of 16 kHz, 16-bit mono PCM is about 1.9 MB. One minute of 48 kHz stereo is much larger. Compressed codecs exist because storage and network bandwidth matter.

Lossy codecs change the signal

Lossy codecs reduce size by removing or approximating information. MP3 and AAC were designed for general listening. Opus is strong for speech and realtime communication. Telephony codecs like G.711, G.729, AMR, and EVS make different quality and bandwidth tradeoffs.

Compression artifacts can matter for AI even when humans understand the audio. A model may be sensitive to metallic noise, bandwidth limits, dropped frames, or packet loss concealment. ASR can still work, but eval results should be measured on the same codec path users actually take.

Opus and WebRTC in realtime voice

Browser and mobile voice apps rarely ship raw PCM over the network. WebRTC calls typically encode with Opus, a lossy codec tuned for speech at low bitrates. Opus can switch between narrowband, wideband, and fullband modes depending on network conditions. That means your VAD and ASR backends often receive decoded Opus audio, not the original microphone PCM.

Opus is not "bad for AI," but it is a different distribution than studio WAV. Packet loss, jitter buffers, and adaptive bitrate change the noise floor and transient shapes VAD thresholds see. When you debug voice agents, log the codec, bitrate, and packet loss alongside model scores.

The WebRTC and neural VAD lesson picks up here: how realtime transports, Opus decode, and frame-based VAD interact in production.

Engineering reality

Do not benchmark on clean WAV files if production traffic arrives over a compressed realtime channel. The codec is part of the product distribution.

Decode first, then normalize

For model inference, the usual flow is: accept a file or stream, decode it to samples, convert sample rate and channels, then extract features or run the model. The model should not have to understand every file format users upload.

Keep metadata around for debugging: original codec, container, bitrate, sample rate, channel count, duration, clipping rate, and decode errors. Those fields make bad-audio incidents observable instead of anecdotal.

Decode and normalize at the boundary. Keep original metadata for debugging, but keep model input predictable.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

How is a container different from a codec?
Why is PCM a common internal format?
How can lossy compression affect ASR quality?
Why should eval audio match production codec paths?

Quick check

That the audio is always 16 kHz mono PCM
A container structure, not necessarily the exact model-ready encoding
That the audio is compressed with Opus