Lesson 01

Sound as a wave

Before audio is a file, a tensor, or a spectrogram, it is pressure changing through time. This lesson gives you the physical picture that every later audio representation is compressing.

The one idea

Audio is a time series. A microphone measures tiny pressure changes in the air and turns them into a changing electrical value. Digital audio stores that changing value as numbers.

Pressure over time

Sound happens when something moves air. A speaker cone pushes and pulls. A vocal cord vibrates. A guitar string shakes the body of the instrument, which shakes the air around it. Those movements create regions of slightly higher and lower pressure that travel outward.

A microphone does not store "words" or "notes." It measures pressure at its diaphragm. If pressure is above the resting level, the signal moves one way. If pressure is below it, the signal moves the other way. Plot that value over time and you get a waveform.

That is the first useful simplification: an audio clip is one long list of values ordered by time. Everything downstream, from voice activity detection to Whisper-style ASR, starts with that list or a transformed version of it.

In code you rarely see pressure in pascals. Libraries normalize each sample to a small numeric range, usually float32 between -1.0 and 1.0. That convention shows up again in Lesson 03 and in every model preprocessor.

Amplitude, frequency, and phase

Three words carry most of the mental model.

Amplitude is how far the wave moves from the center line. Larger amplitude usually means louder sound.
Frequency is how often the wave repeats per second, measured in hertz. Higher frequency usually means higher pitch.
Phase is where a repeating wave is in its cycle. Phase matters when waves combine, because two waves can reinforce or cancel each other.

A waveform is not the sound itself. It is a graph of how pressure changed at the microphone over time.

Drag sample rate on a 440 Hz tone. Few samples per cycle mean a rougher stored shape; above the Nyquist limit the reconstruction lies.

Real audio is many waves at once

A clean sine wave is useful for learning, but speech is not one sine wave. It is a messy stack of frequencies changing quickly. Vowels have strong bands of energy. Consonants often contain short bursts, friction, or silence. Background noise adds its own shape.

When waves combine, the microphone receives the sum. It does not know which part came from speech, fan noise, room echo, or keyboard clicks. Separating useful signal from everything else is why audio preprocessing matters.

Engineering reality

Most voice bugs start before the model. Low microphone gain, clipping, echo cancellation artifacts, stereo channels with phase cancellation, and background noise all change the waveform the model receives. A stronger ASR model helps only after the signal is still usable.

Why this matters for ML

Machine learning models cannot consume air pressure directly. They consume numbers. The audio pipeline turns pressure into samples, samples into windows, windows into frequency features, and features into model inputs.

If you keep the waveform picture in your head, the rest of the course becomes less arbitrary. Sample rate decides how often you measure the wave. Bit depth decides how finely you store each measurement. Channels decide how many waveforms you store at once. Spectrograms show how frequency content changes as the waveform moves through time.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What does a microphone measure?
How are amplitude and frequency different?
Why is speech more complicated than a single sine wave?
Why do audio models care about preprocessing before inference?

Quick check

A time-ordered signal showing pressure changes captured by a microphone
A list of recognized words and timestamps
A chart that only stores frequencies, not time