Lesson 07

Why GPUs

AI didn't end up on graphics cards by accident or marketing. The math a neural net does turns out to be exactly the math a GPU was built to do, and once you see the match, the whole hardware story falls into place. So does the catch nobody warns you about.

The one idea

A neural net is mostly one operation, matrix multiplication, repeated billions of times. That operation is thousands of identical small multiply-and-adds that don't depend on each other, so they can all run at once. A CPU does a few things fast and in order. A GPU does thousands of things at the same time. That single fit is why AI runs on graphics chips.

It's almost all matrix multiplication

Back in lesson 04 we opened up a layer and found it was just a matrix multiply: take the inputs, multiply by a grid of weights, add a bias, run the result through a simple function, pass it on. Lesson 05 then showed that training runs that same forward pass plus a backward pass, over and over, across millions of examples. Stack those two facts and you get the punchline: running or training a model is overwhelmingly one kind of arithmetic, done at enormous scale.

And that arithmetic is humble. Every output number in a matrix multiply is just a pile of "multiply two numbers, add them to a running total." Nothing fancy, no branching, no decisions. The same tiny operation, a multiply-and-add, repeated until you've filled the whole output grid. A modern model does trillions of these per second when it's generating text.

The important part is what these multiply-adds do not need from each other. The number in the top-left of the output doesn't depend on the number in the bottom-right. They read different rows and columns, they never wait on one another, and they can be computed in any order or all together. That independence is the door the whole hardware story walks through.

Inputs row × Weights col = Outputs · Each cell: row · column, a stack of multiply-adds. No cell waits on another.
One output cell is a row times a column: a stack of multiply-and-adds. The dots are other output cells, each its own independent calculation. That's thousands of small jobs with no order between them.

A CPU and a GPU are different shapes of fast

A CPU is built to be brilliant at one thing at a time. It has a handful of big, powerful cores, deep caches, and a lot of clever machinery for guessing what comes next and keeping a single chain of instructions racing forward. If your work is a long sequence of steps where each one depends on the last, like most ordinary software, that's exactly what you want. But hand a CPU a billion identical multiply-adds and it grinds through them mostly a few at a time. It's a sports car on a road with one lane.

A GPU made the opposite trade. Instead of a few strong cores, it packs thousands of small, simple ones. Each one is weaker than a CPU core and not nearly as clever, but there are so many of them that they finish enormous batches of the same operation together. The trick that makes this work is that the GPU issues a single instruction and a huge crowd of cores all run it at once, each on its own slice of the data. People call this SIMT or SIMD, single instruction over many threads or much data. It only helps when the work is the same operation repeated across lots of independent numbers, which is precisely what a matrix multiply is.

CPU a few powerful cores core core core core GPU thousands of small cores Great at one long chain of steps Great at one operation × a huge crowd
The CPU spends its silicon on a few smart cores. The GPU spends it on a swarm of simple ones. Neither is "better." They're tuned for different shapes of work, and a neural net's work is the GPU's shape.

This is also why "GPU" stopped being only about graphics. Drawing a frame means computing the color of millions of pixels with the same little program, all independent of each other. That is the same parallel pattern as a matrix multiply. The chip that was built to shade pixels turned out to be the chip that could train neural nets, so the name stuck even though the job changed.

Imagine a worksheet of a million simple multiplications. A CPU is four expert accountants who are very fast and never make mistakes, but there are only four of them, so they chew through the page in chunks. A GPU is a stadium of ten thousand schoolkids, each slower and only able to do one multiply at a time, but they finish the whole worksheet in one go because every problem is independent. For a worksheet this is no contest. For a problem where each answer feeds the next, the four accountants win easily. Neural nets are mostly worksheets.

The twist: the math isn't the slow part

Here's where most explanations stop, and where the real engineering begins. Once you have thousands of cores, raw multiply-add speed stops being the thing you run out of. The new bottleneck is feeding those cores. Every weight and every activation lives in the GPU's memory, and before a core can multiply two numbers it has to fetch them across the chip. Moving a number costs far more time and energy than multiplying it. A lot more.

So the question that decides whether your GPU is actually busy is the ratio between math and memory traffic: how many multiply-adds do you do for each byte you pull from memory? That ratio has a name, arithmetic intensity. If you do plenty of math per byte, the cores stay fed and the chip flies. If you do very little math per byte, the cores sit idle waiting on memory, and all those thousands of lanes are stuck in traffic. The wall a chip hits, compute on one side and memory bandwidth on the other, is the roofline picture: you're either limited by how fast you can calculate, or by how fast you can move data, and most of the time it's the second one.

Memory weights + activations narrow pipe = memory bandwidth Thousands of cores mostly idle, waiting on the pipe
The compute is enormous. The pipe feeding it is not. When only a trickle of bytes arrives per cycle, most cores sit idle. That's a memory-bound workload, and it's the normal case for generating text one token at a time.

Why generating text is the hard case

This is not a hypothetical. It's the day-to-day reality of running an LLM. A model produces text one token at a time, and to produce each single token it has to read every weight in the model from memory. A large model is tens or hundreds of gigabytes of weights. So for one token, you stream the whole model across the chip, do a relatively small amount of math with it, and throw most of that bandwidth at moving numbers rather than crunching them. The arithmetic intensity is tiny, often just a handful of operations per byte, while the chip could happily do a couple hundred. The cores are starving.

That's why an LLM feels slow even on a chip that, on paper, does quadrillions of operations a second. You're not waiting on the math. You're waiting on the weights to arrive. Knowing this flips a lot of intuition: a faster-computing GPU barely helps the token-by-token part, but a GPU with fatter memory bandwidth helps a lot.

Engineering reality

The whole serving stack is shaped by the fact that moving data, not doing math, is what costs you. A few consequences worth carrying forward:

Memory-bound vs compute-bound. Generating tokens one at a time is memory-bound: the cores wait on weights. The first big pass over your prompt, before any token comes out, is compute-bound and uses the chip well. Same model, two completely different bottlenecks, which is why "tokens per second" hides a lot.

Batching is the main trick. If you've already paid to drag the weights across the chip, run many users' requests through them at the same time. The bytes moved barely change, but you get many tokens out instead of one. That's why servers wait a beat to bundle requests: it raises arithmetic intensity and turns a memory-bound job into a well-fed one. It helps total throughput, not the latency of any single request.

VRAM is the hard ceiling. The whole model plus its working memory has to fit in the GPU's memory, or it won't run at full speed at all. This is the real reason a model's size, in gigabytes, decides which card you need, and why people obsess over quantization to shrink weights. The wall is capacity first, then bandwidth.

Idle cores are the silent cost. A GPU can report low utilization while technically "busy," because it's busy waiting on memory. Buying more compute you can't feed is money lit on fire. Real optimization is almost always about moving fewer bytes: keep data on-chip, reuse it, and don't read what you don't have to.

Pulling the thread back together

So the arc is short. Lesson 04 said a layer is a matrix multiply. Lesson 05 said training runs that multiply millions of times. This lesson says that multiply is thousands of identical, independent multiply-adds, which is exactly the shape a GPU's swarm of cores eats for breakfast, and that the real limit, once you have that swarm, is how fast you can shovel the weights to it. Compute is cheap and plentiful. Moving data is the expensive part. Hold onto that last sentence. It quietly explains most of what makes AI systems fast, slow, or expensive, and it's where the inference and serving tracks pick up.

Checkpoint

You've finished the Foundations course if you can answer these from memory:

  • What single operation makes up most of a neural net's work, and why does that connect back to lessons 04 and 05?
  • Why does the independence of the multiply-adds in a matrix multiply matter for hardware?
  • In one sentence each, how does a CPU's design differ from a GPU's, and which kind of work suits each?
  • What is arithmetic intensity, and what does it mean for a workload to be memory-bound rather than compute-bound?
  • Why is generating text one token at a time usually memory-bound, and why does batching help throughput?

Quick check

  • Each GPU core is faster than each CPU core
  • The core operation is thousands of identical, independent multiply-adds that can all run at once, which matches the GPU's many-core design
  • GPUs use higher-precision math and far less memory than CPUs
  • The GPU doesn't have enough raw multiply-add throughput
  • It's memory-bound: each token requires streaming all the weights from memory, so the cores wait on data movement
  • The model is retraining itself on each new token
  • The weights are moved from memory once and reused across all the batched requests, doing more math per byte loaded
  • It makes each individual request finish faster
  • It reduces the number of weights the model has