Lesson 04

Neural networks, visually

People talk about neural networks like they're brains. They're not. A network is a long chain of tiny, boring arithmetic steps. The surprise is that stacking enough of those steps lets you bend a straight line into almost any shape you want.

The one idea

A neural network is a stack of simple transforms. One neuron multiplies each input by a weight, adds them up, adds a bias, then bends the result with a nonlinearity. A layer runs many neurons in parallel. Stack a few layers and you can fit shapes a straight line never could.

Start with one neuron

Forget the brain picture. A single neuron is a tiny formula. It takes a few numbers in, and it spits one number out. The whole thing is three steps: weigh, sum, bend.

First it multiplies each input by its own weight. A weight is just a number that says how much this input matters, and whether it pushes the answer up or down. Then it adds all those products together and tacks on one more number called the bias, which shifts the whole result up or down regardless of the inputs. So far this is one weighted sum, a single line's worth of math. The last step is the one that matters: it pushes that sum through a nonlinearity, a simple function that bends the output.

x₁ x₂ x₃ × w₁ × w₂ × w₃ Σ + b weighted sum + bias activation (bend) out
One neuron, end to end. Each input gets a weight, the products and a bias are summed, and the result is bent by an activation function. The output is a single number.

In one line of math, a neuron does out = activation(w₁·x₁ + w₂·x₂ + w₃·x₃ + b). The weights and the bias are the knobs. Everything a network "knows" is stored in numbers exactly like these.

Why the bend matters

The weighted-sum-plus-bias part is a straight line. If you skip the activation and just stack these linear steps, something quietly disastrous happens: the whole network collapses. A line of a line is still a line. Ten linear layers in a row do exactly what one linear layer could do, no matter how many neurons you pile in. You'd have spent a fortune on compute to build a glorified ruler.

The nonlinearity is what breaks that. After each weighted sum you bend the result, so the next layer is working on something that's no longer straight. Now stacking actually buys you something. The most common bend is ReLU, which is almost insultingly simple: keep the number if it's positive, otherwise output zero. That single kink, repeated across thousands of neurons, is enough to approximate wildly complicated shapes.

Plain version

Without a nonlinearity, depth is free of cost and free of benefit: the network can only ever draw straight lines. The bend is the thing that lets layers compound into curves.

A layer is just many neurons at once

One neuron outputs one number, which isn't much. So we run a whole row of them side by side, all looking at the same inputs but each with its own weights and bias. That row is a layer, and its output is a list of numbers, one per neuron. Feed that list into the next layer, and the next, and you have a network.

The layers have names. The input layer is just your raw numbers going in. The output layer is the answer coming out, like a score or a probability. Everything between them is a hidden layer, called hidden only because you never look at those numbers directly, they're intermediate work. When people say a network is deep, they mean it has many hidden layers stacked up. That's the entire origin of the term "deep learning". It's not deep as in profound, it's deep as in many layers.

input hidden hidden output
A small network. Numbers enter on the left, flow right through two hidden layers, and leave as two output numbers. Every line is a weight; every circle is a neuron doing weigh-sum-bend.

Notice that every neuron in one layer connects to every neuron in the next. Each of those lines is a separate weight to store and tune. That's where the parameter counts come from, and it's worth pausing on the cost.

Stacking is what bends the shape

Here's the intuition that makes the whole thing click. The first layer can only carve the input space with straight cuts, because each neuron is one weighted sum plus a bend. But the second layer doesn't see your raw inputs anymore. It sees the bent outputs of the first layer. So it's drawing straight cuts on an already-warped surface, which from the original input's point of view looks like a curve.

Keep stacking and the warps compose. Layer by layer, the network folds and bends a flat space until two classes that were hopelessly tangled become cleanly separable. No single step is clever. The cleverness is entirely in the composition, the same way a complicated origami shape comes from many simple folds.

This is also the honest limit of the picture. A trained network is a specific stack of folds that happened to separate your training data. It is not reasoning, and it holds no symbols or concepts you could point to. It found weights that bend the input into a shape where the right answers land on the right side. That's a function approximator, full stop.

The engineering cost of depth and width

"Just add more layers" sounds free. It is not. Every weight is a number you have to store, load into memory, and multiply during every single forward pass.

Engineering reality

Two fully connected layers, one with m neurons feeding one with n neurons, need m × n weights plus n biases. That product is the trap. A layer going from 4,096 to 4,096 neurons is already about 16.8 million weights, in one layer.

Memory. Parameters have to live in RAM or GPU memory. At common 16-bit precision, a billion parameters is roughly 2 GB just to hold the weights, before you add the activations and optimizer state needed during training. This is why model size is quoted in parameter counts, and why "7B" or "70B" tells you whether it even fits on your hardware.

Compute. Each forward pass redoes all those multiplications. Wider layers and more of them both multiply your cost, in latency at serving time and in dollars at training time. Doubling width roughly quadruples the work in a layer, since the weight count grows with the product of the two layer sizes.

So depth and width are real budgets, not knobs you spin for free. Bigger can fit more complex shapes, but you pay for every neuron in memory and in time, on every request forever.

Go deeper (optional)

This lesson is the still picture: what a network is. The natural next question is how it gets its weights in the first place, and that's lesson 05. If you want the visual, geometric version of that answer before we write it up, one short film does it better than text can. We point you there instead of redrawing it.

Lesson 01 already sent you to chapter 1 of this series, "But what is a neural network?", for the shape-bending intuition. This is chapter 2, about training. It's the bridge into lesson 05, so it's a deliberate next step, not a repeat. As always, we add what it leaves out: the engineering cost above, and the careful walk through backprop in the next lesson.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What are the three steps a single neuron performs on its inputs?
  • What is a weight, and what is a bias?
  • Why does a network with no nonlinearity collapse, no matter how many layers it has?
  • What's the difference between an input, hidden, and output layer, and what makes a network "deep"?
  • Roughly how many weights connect a layer of m neurons to a layer of n neurons, and why does that matter for memory and compute?

Quick check

  • A weighted sum of its inputs plus a bias, passed through a nonlinearity
  • It stores past inputs and averages them over time
  • It measures how wrong the network's answer is
  • To make the network run faster on a GPU
  • Without it, stacked layers collapse into one linear transform and depth adds nothing
  • It's where the network stores its learned weights
  • It has a very large number of neurons in a single layer
  • It has many hidden layers stacked between input and output
  • It understands the meaning of its inputs more thoroughly