Lesson 03

Continuous batching

Static batching wastes slots and makes requests queue behind the slowest answer. Continuous batching fixes both by making one decision per token instead of one per batch. It is the most important throughput technique in modern serving.

The one idea

Stop thinking of a batch as a fixed group that starts and finishes together. Treat the batch as a roster the scheduler rebuilds every single decode step: any sequence that finished leaves, any waiting request joins, and the GPU never decodes empty slots or makes new requests wait for the batch to drain.

From request-level to iteration-level

Static batching schedules at the level of a request: pick a group, run it to completion, repeat. Continuous batching, sometimes called iteration-level scheduling or in-flight batching, schedules at the level of a single forward pass. The name comes from Orca (OSDI 2022), which showed that treating each model iteration as its own scheduling unit beats waiting for whole requests to finish. vLLM and most modern engines implement this Orca-style loop: after every decode step the server asks two questions. Did any sequence just emit its end token or hit its length limit? If so, evict it and free its slot. Is there a request waiting in the queue and a free slot for it? If so, admit it.

The result is a batch whose membership changes constantly. There is no moment where everyone starts together and no moment where the GPU sits decoding padding while it waits for one long answer to finish. The roster is always as full of useful work as the queue and memory allow.

Four slots stay busy as requests come and go slot 1 slot 2 slot 3 slot 4 each colored block is a different request; a slot refills the step after one finishes
No faded idle blocks. The instant a request finishes, the next one slides into its slot, so the GPU is always doing useful work.

Why this wins on both axes

Recall the two goals from lesson 01. Continuous batching improves both at once, which is rare.

Throughput goes up because no slot decodes padding. In static batching, slots freed by short requests were wasted until the batch drained. Here they are immediately reused, so the average number of useful sequences in flight is much higher for the same GPU.

Latency goes down under load because a new request no longer waits for an entire batch to finish. It waits only for a slot to open, which happens every few steps as short requests complete. TTFT under concurrency, the thing static batching was worst at, gets dramatically better.

Prefill and decode have to share the step

There is a wrinkle. A newly admitted request needs its prompt prefilled before it can decode, and prefill is the compute-heavy phase. If the scheduler stops everything to prefill a long prompt, the requests already streaming will stall for that step, and their users will see a hitch in the token stream.

Early continuous-batching systems handled this by pausing decode whenever a new prompt arrived. Modern systems are smarter: they break a long prefill into pieces and interleave those pieces with ongoing decode, so admitting a new request does not freeze everyone else. That technique, chunked prefill, is part of the scheduling story in the final lesson. For now the point is that continuous batching is not just "evict and admit." The scheduler is constantly balancing new prefills against in-flight decodes.

Practical read

You usually do not implement continuous batching yourself. You get it by running an engine that has it: vLLM, TGI, TensorRT-LLM, SGLang, and similar. Your job is to understand it well enough to tune it, read its metrics, and not fight it. If you ever see throughput collapse when one user sends a giant prompt, you are watching prefill starve decode, and the fix is a scheduler setting, not more GPUs.

The memory side: paged KV cache

Continuous batching only pays off if the server can actually fit a shifting set of sequences in memory. Each sequence holds a KV cache that grows token by token, and you do not know in advance how long any answer will be. Reserving worst-case memory per slot would waste most of the GPU.

The companion idea is PagedAttention: store each sequence's KV cache in small fixed-size blocks, allocated on demand, the way an operating system pages memory. A sequence grabs another block only when it needs one, and frees its blocks the moment it finishes. This is what lets the roster churn freely without fragmenting memory, and it is why vLLM in particular became the reference implementation. Efficient attention kernels such as FlashAttention matter here too: they cut memory traffic during prefill and decode, which raises how many sequences can share a step before you hit the knee.

Continuous batching and paged KV cache are two halves of the same design: one keeps the compute busy, the other keeps the memory packed.

Three decode iterations (Orca-style scheduling) step t req A req B req C req D done D leaves, E joins step t+1 req A req B done req C req E new B leaves, F joins step t+2 req A req F new req C req E in-flight decode newly admitted finishes this step
Each iteration rebuilds the batch. Finished work leaves immediately; waiting work joins on the next step instead of blocking behind a static group.
Landmark paper

The paper that named iteration-level scheduling and showed why request-level batching leaves GPUs idle. vLLM's continuous batching loop is the practical descendant of this idea, paired with PagedAttention for memory.

Take from it
Schedule per forward pass, not per request; evict finished sequences immediately; admit waiting work on the next iteration; measure under mixed prompt/output lengths.
It skips
Chunked prefill, speculative decoding, prefix caching, and modern disaggregated prefill/decode farms. Those are covered in lessons 04–06 and the L06 sidebar.
Engineering reality

Continuous batching makes throughput depend on the live mix of traffic, which makes benchmarks tricky. A test that sends 100 identical short prompts will report a very different number from real traffic with a heavy tail of long generations. When you benchmark a continuously batched server, replay a realistic distribution of prompt and output lengths at a realistic arrival rate. A single fixed prompt length will flatter the system and mislead your capacity plan.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What does "iteration-level" scheduling mean compared to "request-level"?
  • How does continuous batching improve throughput and TTFT at the same time?
  • Why can admitting a new request's prefill stall the requests already decoding?
  • What problem does paged KV cache solve for continuous batching?

Quick check

  • Continuous batching rebuilds the batch every decode step instead of running one fixed group to completion
  • Continuous batching uses a different model architecture
  • Continuous batching turns off batching to lower latency
  • The model got larger
  • A long prefill is starving the in-flight decodes, a scheduling issue addressed by chunked prefill
  • Continuous batching stopped working entirely