Lesson 06

Scheduling and tuning throughput

Every technique in this course meets in the scheduler: the loop that decides, each step, which prefills and which decodes share the GPU. This final lesson shows how that decision shapes latency and throughput, and turns the course into a short list of knobs you actually tune.

The one idea

Prefill is compute-heavy and decode is memory-bound, and they compete for the same GPU every step. The scheduler's job is to mix them so new requests start promptly (good TTFT) without freezing the requests already streaming (good ITL). Tuning a deployment is mostly tuning that balance against your SLOs.

The conflict at the heart of serving

Pull the threads together. Continuous batching keeps a churning roster of sequences in flight. Most of those are decoding, one token per step, memory-bound. But new requests keep arriving, and each needs its prompt prefilled first, which is a burst of heavy compute. So on any given step the scheduler faces a choice: spend this step prefilling a newcomer, or spend it decoding the people already waiting on their next token?

Favor prefill and new requests start fast, but the in-flight streams stutter because their decode steps got bumped. Favor decode and streams stay smooth, but newcomers wait longer for their first token. This is the same latency-versus-throughput tension from lesson 01, now playing out step by step inside one server. There is no setting that wins both; there is only a balance that fits your traffic.

Chunked prefill: stop letting big prompts freeze everyone

The worst case is a single huge prompt. Prefilling 30,000 tokens in one step monopolizes the GPU, and every streaming user sees their tokens pause until it finishes. Lesson 03 named the fix: chunked prefill. Instead of prefilling the whole prompt at once, the scheduler slices it into chunks and processes one chunk per step, interleaving those chunks with ongoing decode.

Slicing the prefill lets in-flight decodes keep getting steps, so a giant prompt no longer freezes everyone else's stream.

Chunked prefill is the single setting that most often rescues a server whose tail latency falls apart under mixed traffic. It slightly raises the newcomer's own TTFT, because their prefill is spread over more steps, in exchange for protecting everyone else's ITL. That is usually the right trade.

The idea is formalized in Sarathi-Serve, which names chunked prefill as the fix for prefill-decode interference: long prompts no longer monopolize a step, and decode keeps getting regular turns. vLLM, SGLang, and TensorRT-LLM expose chunk-size knobs under different names; the underlying trade is the same.

Priorities and fairness

Not all requests deserve equal treatment. An interactive chat user waiting on a first token is more latency-sensitive than a background batch job summarizing documents. Schedulers can run priority classes so interactive traffic jumps the queue while batch work fills whatever capacity is left. You can also bound fairness so one user firing a hundred requests cannot starve everyone else. These are policy choices layered on top of the prefill-versus-decode mechanics, and they matter most when one deployment serves mixed workloads.

The knobs, and what each one costs

This is the course distilled into the handful of settings you will actually touch on a vLLM- or SGLang-class engine. Each is a point on the latency-throughput dial.

Max batch size / max concurrent sequences. Higher means more throughput until the knee, then rising ITL and memory pressure. The real ceiling is usually KV cache memory, not compute.
KV cache memory fraction. How much GPU memory the engine reserves for cache. More cache means more sequences in flight, but leaves less headroom and risks evictions or recompute when it fills.
Chunked prefill chunk size. Smaller chunks protect ITL for in-flight streams at a small cost to new-request TTFT. Larger chunks do the reverse.
Max model / context length. Allowing very long contexts raises worst-case per-sequence memory, which forces a smaller batch. Capping it buys back throughput.
Speculative decoding policy. On (or load-aware) for latency at low concurrency, off or backed-off at peak so it does not steal compute from the batch.
Prefix / prompt caching. Reuse KV blocks for shared system prompts and RAG prefixes. Dramatically cuts TTFT on cache hits; tune cache size and eviction policy alongside max_num_seqs.
Number of replicas. When one GPU's dial cannot satisfy both SLOs at your traffic, the honest answer is often more replicas behind a load balancer, not more tuning.

Sidebar: disaggregated prefill and decode (2024+)

At very large scale, some deployments split prefill and decode onto different GPU pools. Prefill workers handle compute-heavy prompt processing; decode workers handle memory-bound token generation. A scheduler routes each request phase to the right pool, often over fast interconnects such as NVLink or RDMA. This avoids long prefills stalling decode on the same chip and lets you size each pool independently (more compute for prefill-heavy RAG, more memory for long outputs).

Disaggregation adds network hop latency and operational complexity. It pays off when single-node scheduling cannot meet TTFT and ITL SLOs at your traffic mix, not as a day-one choice. vLLM, TensorRT-LLM, and research systems such as DistServe explore this pattern; most teams still start with colocated continuous batching and chunked prefill.

How to actually tune: SLOs first

Do not tune for a vague "fast." Write the targets down first, as tail numbers: for example, p95 TTFT under 500 ms and p95 ITL under 80 ms at the concurrency you expect at peak. Those numbers come from the product, not the GPU. Then load-test against a realistic replay of prompt and output lengths at a realistic arrival rate, push concurrency up, and watch where a target breaks. Raise batch size and cache fraction for throughput until TTFT or ITL crosses the line, back off one notch, and if you cannot meet both targets on one replica, add replicas. Tuning is this loop: change one knob, measure the tail under real traffic, keep what holds the SLO.

Practical read

Change one knob at a time and always measure the tail, not the average. Many "optimizations" improve p50 while quietly wrecking p99, which is the number your users and your SLO actually live on. If a change helps the median but the 99th percentile gets worse, you usually made the system feel worse, not better.

Engineering reality

The cheapest throughput win is almost never a scheduler knob. It is sending fewer tokens. Trimming a bloated system prompt, cutting retrieved context that does not earn its place, and capping max output length reduce prefill and decode work for every single request at once, and they cost nothing in hardware. Exhaust the prompt-shape savings before you go shopping for GPUs. This is also where this course hands off to Serving & Economics, where those same numbers turn into dollars.

Checkpoint

You're ready to finish the course if you can answer these from memory:

Why do prefill and decode compete, and how does favoring each affect TTFT and ITL?
What does chunked prefill trade away, and what does it protect?
Name three serving knobs and the latency-throughput cost of turning each up.
Why should tuning start from written SLOs measured at the tail under realistic traffic?
What problem does Sarathi-style chunked prefill solve?

Quick check

Chunked prefill, so the big prompt is processed in slices interleaved with decode
Increase the maximum batch size
Reduce the KV cache memory fraction

A win, since the average is what matters
Likely a regression, because users and SLOs live on the tail, not the average
Irrelevant, because only throughput matters

Where this course leaves you

You can now read a serving stack the way an operator does. Batching turns idle math into throughput. Continuous batching keeps that batch full without making people queue. Speculative decoding spends spare compute to cut single-stream latency. Streaming reshapes the wait into something users tolerate, if you handle backpressure and cancellation. And the scheduler arbitrates all of it against your SLOs. The next course, Serving & Economics, takes these mechanics and turns them into cost per token, self-host versus API, and the dollar decisions that follow.