Lesson 04

Why LLM inference is memory-bound

Modern GPUs are absurdly good at math. LLM inference can still be slow because the hard part is often feeding the math units quickly enough with weights, activations, and KV cache data.

The one idea

Decode often becomes memory-bound: the GPU spends more time moving model and cache data than doing arithmetic. Smaller weights, better batching, and faster memory can matter as much as raw compute.

Compute-bound vs memory-bound

A workload is compute-bound when arithmetic is the slow part. Give it more math throughput and it gets faster. A workload is memory-bound when moving data is the slow part. Give it more math units and they sit around waiting for bytes.

LLM inference has both modes, but decode is often memory-bound. For each generated token, the server has to read a large amount of model weight data and use the current KV cache. If the batch is small, there may not be enough arithmetic per byte loaded to keep the GPU fully busy.

Raw GPU math is not enough. Decode speed depends on how fast the system can feed data into that math.

Why decode has low reuse

In training and large prefill batches, the same model weights can be used across many tokens at once. That creates a lot of arithmetic for each weight load. The hardware has a better chance of staying busy.

In single-request decode, you generate one token for one user. The model still has to touch a huge set of weights, but it only gets one token's worth of work from that read. That is poor reuse. Batching multiple decode steps from different users improves reuse, which is why serving systems care so much about batching.

Model size is speed pressure

A bigger model usually means more weights to read per token. More weights can improve quality, but they also raise memory traffic and memory footprint. If your task does not need the larger model, you pay for unused capacity on every request.

Quantization helps because lower precision weights take fewer bytes. A 4-bit model can be much easier to fit and move than a 16-bit model. But quantization is not magic. It can affect quality, and speed depends on whether the hardware and kernels are optimized for that format.

The KV cache joins the fight

Weights are not the only memory traffic. Decode also reads from the KV cache so the current token can attend to previous tokens. As the context grows, the cache gets larger. As concurrency grows, there are more caches alive at the same time.

This creates a practical serving tradeoff: long contexts can improve capability, but they reduce how many requests fit comfortably on the same hardware. A model might have a 128k context window on paper and still be too expensive to serve at that length for normal traffic.

Engineering reality

When someone asks for a bigger context window, translate it into capacity: more prefill work, more KV cache memory, lower concurrency, and higher tail latency. The product benefit has to be worth that bill.

What improves a memory-bound path

The usual fixes reduce bytes moved, reuse bytes better, or use hardware with faster memory. Smaller models, quantization, grouped-query attention, cache paging, better kernels, and batching all target some part of this.

There is no single universal fix because the bottleneck changes with model size, prompt length, output length, batch size, hardware, and traffic shape. That is why benchmark numbers without workload shape are hard to trust.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What does memory-bound mean?
Why can decode have poor weight reuse?
How does batching help memory-bound decode?
Why can a larger context window reduce serving capacity?

Quick check

The workload is memory-bound
Temperature is too low
The tokenizer is using the wrong language

It deletes model weights from memory
It improves weight reuse across requests
It makes every request independent of previous tokens