Lesson 05

How models learn: loss and gradient descent

Lesson 01 said training nudges the model's knobs until its output matches the answers. This is the nudge, spelled out. It turns out to be one small loop, repeated until your patience or your budget runs out.

The one idea

Training is rolling downhill. The loss is a single number that says how wrong the model is right now. The gradient says which way is uphill, so you step the other way, a little bit, then measure again and repeat. That loop is the whole of training.

One number for "how wrong"

Back in lesson 01 the third ingredient was the objective: a number that measures how badly the model is doing. That number has a name, the loss, and it's worth getting concrete about it because everything else hangs off it.

The model makes a prediction. You know the right answer. The loss is a formula that compares the two and boils the gap down to one number. Predict 0.9 when the answer was 1.0? Small loss. Predict 0.1 when the answer was 1.0? Big loss. You compute the loss across a pile of examples and average it, so a single value tells you how the model is doing right now across the whole task. Lower is better. Zero would mean perfect, which never happens.

The exact formula depends on the task. Predicting a number uses something like squared error (how far off, squared). Picking a category uses cross-entropy (how much probability you put on the wrong answers). The details don't matter yet. What matters is the shape of the idea: loss is one number, smaller is better, and learning is the act of making it smaller.

Plain version

The loss is the model's report card collapsed into a single grade. Training is the model studying to raise that grade, with no idea what the subject means.

The landscape of wrongness

Here's the picture that makes the rest click. A model from lesson 04 is just a big bag of weights, the adjustable knobs. Imagine you could turn those knobs and, for every setting, read off the loss. Plot loss against the knob settings and you get a surface, a landscape with hills and valleys. High ground is settings that make the model very wrong. Low ground is settings that make it good.

You can't draw a million-knob surface, so we cheat and draw one knob on the floor and loss going up. It looks like a bowl. Training is dropping a ball onto that surface and letting it roll to the bottom. The bottom is the knob setting with the lowest loss, which is the best version of the model you can find.

Loss plotted against one weight. The ball starts wherever the random initial weights put it, and each training step rolls it a little further downhill toward the lowest loss.

So how does the ball know which way is down? It can't see the whole bowl. It can only feel the slope right under it. That slope is the gradient.

The gradient points uphill, so go the other way

The gradient is the slope of the loss at the model's current settings. Formally it's the direction of steepest increase: the way you'd turn the knobs to make the loss go up the fastest. That sounds backwards for something that wants loss to go down, and it is, on purpose. You compute the direction that makes things worse, then step the exact opposite way. Steepest uphill, flipped, is your best guess at downhill.

The gradient also tells you more than a direction. It tells you how steep the slope is for every single knob at once. A knob on a steep part of the surface gets a big push; a knob on a near-flat part barely moves. That's the magic of it: one calculation hands you a personalized nudge for every weight in the model, all pointing toward less loss.

When the ground goes flat, the gradient shrinks to near zero. No slope means no push, which means you've settled into the bottom of a valley and there's nothing left to gain by stepping. That's roughly what "the model converged" means.

The learning rate: how big a step

The gradient gives you a direction. The learning rate decides how far you actually move in that direction on each step. It's a single small number you pick, and it's deceptively important.

Too small and the ball creeps down the bowl one timid millimeter per step. It'll get there eventually, but eventually might mean days of compute you didn't need to spend. Too big and you overshoot the bottom, landing on the far wall higher than you started. Do that repeatedly and the ball bounces up the sides instead of settling, and the loss climbs instead of falling. The sweet spot is the largest step that still reliably heads downhill.

Same bowl, three step sizes. The learning rate is the knob that turns a careful descent into either a crawl or a disaster.

One training step, start to finish

Put loss, gradient, and learning rate together and you get a single step. It's four moves, always in the same order:

Forward pass. Run some examples through the model with its current weights and collect the predictions. This is just the model doing its normal job.
Compute the loss. Compare the predictions to the right answers and boil the gap down to one number. Now you know how wrong you are.
Backpropagate the gradient. Work backward through the model to figure out how each weight contributed to that loss, which gives you the gradient: a nudge direction for every weight. This backward pass is where the name backprop comes from.
Update the weights. Move every weight a small step (the learning rate) in the direction that lowers loss. The model is now slightly less wrong than it was.

Then you do it again. And again. The loop below is, no exaggeration, what training a model is. A frontier model runs this same four-step cycle on the order of millions of times.

The training loop. Predict, measure the error, work out which way each weight should move, take the step, repeat. Millions of times.

In code the inner loop is almost embarrassingly short. The libraries hide the calculus, so what you actually write looks like this:

for batch in data:
    preds = model(batch.inputs)        # 1. forward pass
    loss  = loss_fn(preds, batch.labels)  # 2. how wrong
    loss.backward()                    # 3. backprop the gradients
    optimizer.step()                   # 4. nudge every weight

Everything else in machine learning, all the architectures and data work, exists to make those four lines produce something useful.

Batches and epochs

One detail from that loop: you don't feed the whole dataset through at once. There's usually too much of it to fit in memory, and you'd rather not wait for the entire set before taking a single step. So you chop the data into batches, a few dozen or a few hundred examples each, and run one full training step per batch. Each batch gives a slightly noisy estimate of the slope, but cheaply and often, which works out fine.

When you've gone through every batch once, you've completed one epoch, a single full pass over the data. One pass is rarely enough, so you loop over the whole dataset for many epochs, each time refining the weights a little more. Batches are how you take frequent small steps; epochs are how many times you walk the entire dataset.

Where the picture gets messy

The bowl was a friendly lie. A real loss landscape over millions of weights isn't a clean bowl, it's a wild, bumpy terrain with dips, ridges, and plateaus everywhere. The ball can roll into a local minimum, a valley that's low but not the lowest, and the gradient there is zero so it just stops. In practice this matters far less than you'd fear: in very high dimensions there are almost always directions left to roll, and the dips you land in tend to be good enough. You're not hunting for the single perfect setting, just a low spot that makes the model work.

Engineering reality

The training loop is simple. Getting it to actually converge is where the hours go.

The learning rate wrecks more runs than anything else. Set it too high and the loss doesn't just stall, it explodes: it shoots up, overflows the number format, and prints as NaN (not a number). Once loss goes NaN the run is dead, every weight is garbage, and you start over. The first thing anyone checks on a blown-up training run is the learning rate.

Training is the expensive phase, by a wide margin. Inference (lesson 02) runs the forward pass once. Training runs forward and backward, on huge batches, for millions of steps. The backward pass roughly doubles the work of the forward pass, and you do it over and over. That's why training frontier models costs millions of dollars in GPU time while a single answer costs a fraction of a cent.

Know when to stop. More steps stop helping once the model starts memorizing the training data instead of learning the pattern (overfitting, from lesson 01). You watch the loss on data the model isn't training on, and when that number stops improving you halt, even if the training loss is still creeping down. That's early stopping, and it saves both money and a worse model.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is the loss, in one sentence, and which direction is better?
The gradient points uphill. Why do you step the opposite way?
What goes wrong if the learning rate is too big? Too small?
List the four moves of a single training step, in order.
What's the difference between a batch and an epoch?

Quick check

A single number for how wrong the model's predictions are right now
How big a step the model takes on each update
How fast the model is training

Step along the gradient, to raise the loss
Step opposite the gradient, to lower the loss
Ignore the gradient and move the weights randomly

The learning rate is too high, so the steps overshoot and the loss blows up
The dataset is too small
The model has already finished learning