Lesson 02

Batching LLM requests

Batching is the single biggest lever for throughput, and during decode it is close to free. This lesson explains why that is true, how batch size trades against latency, and why the obvious way to batch leaves most of the win on the table.

The one idea

Because decode is memory-bound, one decode step loads the weights once and can serve many requests in roughly the same time it would serve one. Batching turns that idle math capacity into throughput. The trap is static batching: forcing requests to start together and wait for the slowest one to finish.

What a batch actually is

A batch is a group of requests the GPU processes in the same forward pass. Instead of one request's activations flowing through the model, several flow through side by side, stacked into a bigger matrix. The weights are read once and applied to the whole stack.

That stacking is why batching exists. In the last course you saw that decode is memory-bound: the GPU spends its time loading weights, and the math units sit mostly idle. A batch fills those idle math units with other requests' work without paying for another weight load. You were going to read the weights anyway, so you may as well put them to use for more than one request.

Why decode batching is nearly free

Think of the weight load as a fixed toll you pay every decode step, no matter what. With a batch of one, you pay the toll and do one request's worth of math. With a batch of 32, you pay the same toll and do 32 requests' worth of math, and that extra math runs on units that were idle before. Up to a point, the step takes almost the same wall-clock time whether the batch holds 1 request or 32.

So throughput climbs almost linearly with batch size at first, while per-step latency barely moves. This is the regime serving teams live for: more users served at almost no latency cost. It does not last forever. Eventually the batch gets big enough that the math itself becomes the bottleneck, the step starts taking longer, and you cross from memory-bound into compute-bound. Past that knee, bigger batches buy throughput by spending latency.

Before the knee, bigger batches add throughput for almost no latency. After it, every extra request slows the whole batch.

Prefill batches differently from decode

One nuance from the last course matters here. Prefill is already compute-heavy, because it processes every prompt token in parallel. So prefill does not have the same pool of idle math to give away. Batching helps decode enormously and helps prefill much less. A server that mixes both phases has to think about them separately, which is exactly the scheduling problem the last lesson of this course tackles.

Where static batching wastes the win

The naive way to batch is to collect a group of requests, run them all to completion together, then accept the next group. This is static, or request-level, batching. It is simple and it is wasteful for two reasons.

Requests finish at different times. One user asks for a three-word answer, another asks for a thousand-token essay. In a static batch, the short request is done in a few steps but its slot stays occupied until the long one finishes. Those freed slots sit empty, decoding padding, doing no useful work. The batch runs at the speed of its slowest member.

New requests wait at the door. A request that arrives one step after the batch starts cannot join. It waits for the entire current batch to drain before the next batch forms. Under load, that queueing time lands directly on TTFT, and it is the single worst part of the static-batching experience.

Solid is useful work, faded is wasted slot time. The batch cannot take new requests or release the GPU until req D, the longest, finishes.

This is the gap the next lesson closes. If the server could let a finished request leave and a waiting one take its place mid-flight, none of that slot time would be wasted and no one would queue behind the slowest essay in the batch. That idea is continuous batching, and it is the reason a modern serving stack can do far better than this picture.

Engineering reality

Batch size is not free of memory. Every request in the batch carries its own KV cache, and the cache grows with each token. A big batch of long-context requests can run the GPU out of memory even when there is plenty of compute headroom. In practice the real ceiling on batch size is usually KV cache memory, not math, which is why memory-efficient cache management is part of the same conversation as throughput.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why is adding requests to a decode batch nearly free up to a point?
What is the "knee" where bigger batches start costing latency?
Name the two ways static batching wastes the GPU.
Why does batch size often bottleneck on memory rather than compute?

Quick check

The weight load dominates the step, and the extra requests fill otherwise-idle math units
The GPU does less math for a bigger batch
A bigger batch uses a smaller version of the model

It is immediately given to a newly arrived request
It stays occupied and idle until the longest request in the batch finishes
The long request is forced to finish faster