Lesson 04

Speculative decoding

Decode is sequential and memory-bound, so a single stream wastes the GPU's math. Speculative decoding spends that idle math on a clever bet: guess several tokens cheaply, then check them all at once. When the bet pays off, the model emits multiple tokens for the price of one slow step.

The one idea

A small, fast draft model proposes the next few tokens. The big model then verifies all of them in a single forward pass and keeps every guess that matches what it would have produced anyway. The output is provably identical to normal decoding, but several tokens can be accepted per expensive step instead of one.

The waste it exploits

From lesson 01: during decode the big model is memory-bound. Each step loads the full weights to produce one token, and the math units are mostly idle. Verifying one token or verifying five in the same pass costs almost the same wall-clock time, because the bottleneck is loading the weights, not the math.

That is the opening. If you could somehow have five candidate tokens ready, you could check all five in one big-model pass for nearly the cost of checking one. The hard part is getting good candidates cheaply. That is the draft model's job.

Draft, then verify

The loop has two players. A draft model, small and cheap, generates a short run of candidate tokens by ordinary fast decoding, say four or five of them. Then the target model, the real model you want output from, runs one forward pass over the prompt plus all the drafted tokens at once and produces its own probabilities at every position.

Now compare. Walk the draft left to right. As long as the target agrees with a drafted token, accept it. At the first place the target disagrees, throw away the rest of the draft, take the target's own token there, and start the next round from that point. Because the acceptance rule is built from the target's probabilities, the tokens you emit follow exactly the target model's distribution. This is not an approximation of the big model. It is the big model's output, produced in fewer slow steps.

Three guesses matched, the fourth was corrected by the target, the fifth was thrown away. Four real tokens came out of one expensive verify pass.

Acceptance rate is everything

The whole payoff rides on how often the draft is right. If the draft agrees with the target most of the time, you accept long runs and the speedup is large, often roughly two to three times faster end to end. If the draft is usually wrong, you accept one token per round and you have done the draft work for nothing, which makes things slower.

Let α be the per-step acceptance probability (how often the draft's next token matches what the target would have sampled) and γ the number of draft tokens proposed per round. Under the standard rejection-sampling rule from Leviathan et al., the expected number of output tokens per target-model forward pass is:

E[accepted] ≈ (1 − α^γ+1) / (1 − α)

Example: with α = 0.75 and γ = 4, that is about 3.3 tokens per expensive step instead of 1, a ~3.3× decode speedup if draft cost is negligible. Drop α to 0.4 and the same γ yields about 1.5 tokens per step, barely worth the overhead. This is why draft pairing matters more than draft length.

Acceptance depends on the pairing. A draft model that is a smaller sibling of the target, trained on similar data, agrees often. A random tiny model does not. Acceptance is also higher on predictable text (boilerplate, code with rigid structure, formatting) and lower on genuinely surprising content. The draft length is a tuning knob too: guess too few and you leave speedup on the table, guess too many and you waste draft effort on tokens that get discarded after the first miss.

Variants worth knowing by name

You do not always need a separate model. Several designs fold the drafting into the target itself.

Draft model (the classic). A separate small model. Simple, but you have to host and load a second model.
Self-speculation / Medusa-style heads. Extra lightweight prediction heads bolted onto the target predict several future tokens at once, so no second model is needed.
Prompt lookup / n-gram drafting. For tasks with lots of repetition, the draft is just copied from text already in the context. Nearly free, and great for summarization or editing where the output echoes the input.
EAGLE and successors. Draft at the feature level rather than the token level for higher acceptance. Common in current high-performance stacks.

When it helps and when it backfires

Speculative decoding trades extra compute for fewer sequential steps. That trade is great when the GPU has spare compute, which is exactly the low-load, low-batch situation: a single user, a latency-sensitive request, math units idle. It cuts that user's latency without hurting anyone.

The trade turns sour under heavy load. When the server is already running a big continuous batch, the GPU is no longer memory-bound; the math units are busy with all those batched requests. Now the draft and the wider verify passes compete for compute that is already scarce, and the speculative overhead can lower total throughput. This is the key tension: speculative decoding optimizes single-stream latency, batching optimizes many-stream throughput, and they pull on the same resource.

Practical read

A common production pattern is to make speculation load-aware: turn it on when batch size is small and back off as the batch fills. That gives low latency to the lone late-night user and full throughput at peak, instead of forcing one global choice.

Engineering reality

Speculative decoding does not change output quality, so do not expect it to. Its only job is speed, and its benefit is entirely empirical: measure acceptance rate and end-to-end latency on your real traffic before and after. A draft pairing that looks great on a benchmark of clean prose can have mediocre acceptance on your messy production prompts. Treat the draft model, draft length, and the on/off-by-load policy as tunables, and verify the win with numbers, not faith.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What does the draft model do, and what does the target model do?
Why is the output identical to ordinary decoding rather than an approximation?
What is acceptance rate, and what makes it high or low?
Why does speculative decoding help at low load but can hurt under heavy batching?

Quick check

Decode is memory-bound, so checking several positions in one pass reuses the same weight load with idle math capacity
The target shrinks to a smaller model during verification
It skips the target model for those tokens

Speculation lowered output quality
Under heavy batching the GPU is already compute-bound, so the speculative overhead competes for scarce compute
The draft model's acceptance rate fell to zero