Latency vs throughput
Almost every serving decision is a trade between two goals that pull in opposite directions: making one request finish fast, and making the whole fleet serve more requests per dollar. Get clear on the two before you touch a single knob.
Latency is how long one request takes. Throughput is how many requests the system finishes per second. On a GPU running an LLM, the cheapest way to raise throughput is to make each request wait a little, so the two are usually in tension. The whole course is about widening the gap between them.
Two numbers, two owners
Latency is the user's number. It is the time from pressing send to getting an answer. If you are the person staring at a chat box, this is the only thing you feel.
Throughput is the operator's number. It is how many requests, or how many tokens, the deployment can push through per second. Divide your GPU bill by that number and you get cost per token, which is the metric that decides whether the product can exist at the price you want to charge.
These two are not the same goal wearing different clothes. You can have great throughput and terrible latency: a system that processes a thousand requests per second but takes nine seconds to answer any one of them. You can have great latency and terrible throughput: a system that answers in 200 milliseconds but only if exactly one person is using it. Production needs both to be acceptable at once, and that is the hard part.
Why one request wastes a GPU
To see why the two fight, you have to remember what the decode phase actually does. For every new token, the GPU loads the full set of model weights out of memory, multiplies them against a tiny amount of activation data for the in-flight request, and produces one token. Loading the weights is the slow part. The math is almost an afterthought.
That is the punchline of memory-bound inference: during decode, the GPU spends most of its time moving weights, not computing. When a single request runs alone, those weights get loaded, used for one request's worth of math, and thrown away. The expensive part of the trip happened, and you carried one passenger.
The fix that runs through this whole course: load the weights once and let many requests share that load. That is batching, and it is nearly free during decode precisely because the GPU was idle on math anyway. The cost is that those requests now travel together, which can make any one of them wait.
The tradeoff, stated plainly
Picture a single GPU. If you serve requests one at a time, each user gets the machine to themselves and latency is as low as it goes, but you are paying for a GPU that is mostly waiting on memory. If you pack 64 requests together, you load the weights once for all 64 and your cost per token plummets, but the batch now does more work per step, so each user's tokens come a little slower, and a request that arrives mid-batch may wait for a slot.
So the dial reads: low latency on the left, high throughput on the right. Naive serving makes you pick a point and live with it. Every technique in this course is a way to get more of both than the dial alone would allow, by being smarter about how requests share the machine.
The numbers worth naming
Two families of latency, one of throughput. Keep them separate.
- TTFT (time to first token). How long the user waits in silence before anything appears. Dominated by prefill and by queueing before the request even starts.
- ITL / TPOT (inter-token latency, or time per output token). The gap between streamed tokens once they start. This sets how fast the answer "types" itself.
- Throughput. Tokens per second across all requests on the server, or requests per second. This is the number that divides into your bill.
Notice that batching helps throughput and TTFT-under-load (because requests stop queueing), but can hurt ITL (because each decode step now does more work). That split is why you cannot manage serving with a single "it feels fast" gut check. You need the numbers separately, and you need them at the tail, not the average.
Pick your target before tuning. An interactive chat product lives or dies on TTFT and ITL. A nightly batch job that summarizes a million documents does not care about latency at all and should be tuned purely for throughput. The same server, same model, gets configured very differently for those two.
Write tail targets, not averages. A reasonable starting point for an interactive chat assistant at moderate concurrency:
- p50 TTFT under 300 ms, p99 TTFT under 1.2 s (includes queueing under load)
- p50 ITL under 40 ms, p99 ITL under 120 ms (smooth streaming, no visible stutter)
- p99 E2E under 8 s for a 512-token answer (product cap on max output length)
These numbers come from the product, not the GPU. A coding assistant with 8k-token RAG context might accept higher TTFT but need tighter ITL. A voice agent often needs p99 TTFT under 500 ms total including network. Measure at the client during a load test shaped like real traffic; an empty-server benchmark will lie about p99.
Average latency lies. A p50 of 400 ms can hide a p99 of 6 seconds when a big batch or a long prompt blocks the queue. Users remember the worst answers, and SLOs are written on tail percentiles for that reason. Always report p95 and p99 next to the median, and measure them under realistic concurrency, not with a single request on an empty server.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What does latency measure, and what does throughput measure?
- Why does serving a single request waste most of a GPU during decode?
- Why does batching usually raise throughput while it can hurt inter-token latency?
- Why is tail latency (p99) more honest than average latency?
Quick check
- It is serving requests one at a time, so each weight load carries only one request
- The model is computing each token too slowly
- Network latency between client and server
- Time to first token
- Inter-token latency
- Throughput
- Disable batching entirely
- You hit the latency-throughput tradeoff and need techniques beyond raw batch size (more replicas, continuous batching, or chunked prefill)
- Switch to a faster CDN