Lesson 02

Prefill vs decode

The most useful split in LLM inference is not "prompt" and "answer." It is prefill and decode. One phase reads the input. The other produces the output. They stress the system in different ways.

The one idea

Prefill processes all prompt tokens and builds the state the model needs. Decode uses that state to generate one new token at a time. Long prompts mostly hurt prefill. Long answers mostly hurt decode.

Prefill reads the prompt

Before the model can generate a useful first token, it has to read the prompt: system message, user message, chat history, tool results, retrieval chunks, and any hidden scaffolding the product adds. This is the prefill phase.

In prefill, the model can process many input positions in parallel. A transformer layer can compute representations for the prompt tokens together, because all prompt tokens are already known. This makes prefill chunky: a lot of work happens before the first visible token appears.

That first visible delay is called time to first token, or TTFT. A large prompt increases TTFT because the server has to read more context before it can start decoding.

Decode writes the answer

Decode starts after prefill. The model predicts one next token, the sampler picks it, and that token is appended to the context. Then the model predicts the next token. The answer grows one token at a time.

This phase is sequential. You cannot generate token 200 until token 199 exists. The server can still batch multiple users together, but each individual response has a dependency chain through its own generated tokens.

One request timeline Prefill read prompt tokens Decode one output token per step first token done
Prefill delays the first token. Decode determines how long the stream keeps going.

Why users feel the phases differently

A slow prefill feels like silence. The user has clicked send and nothing is visible yet. A slow decode feels like a weak stream: text appears, but slowly. Both can produce the same total request time, but they feel different.

This is why streaming helps perceived latency. If prefill is short and decode is long, the user sees progress quickly. If prefill is long, streaming cannot show anything until the first token exists.

Practical read

If TTFT is bad, look at prompt length, queueing, prefill batching, and cold starts. If tokens per second is bad, look at model size, hardware, decode batching, and cache memory pressure.

Prompt shape matters

Two prompts with the same visible user message can have very different prefill cost. One might include a short system prompt. Another might include 20 turns of chat history, five retrieved documents, tool traces, and a JSON schema. The model sees all of it as input tokens.

That hidden context is often where latency comes from. RAG systems are a common example. Retrieval can improve quality, but every chunk inserted into the prompt has to be read during prefill. Bigger context is not free.

Engineering reality

Measure prefill and decode separately in benchmarks. A change that improves one can hurt the other. For example, adding more retrieved context might improve answer quality while making TTFT worse. A product decision needs both numbers.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What work happens during prefill?
  • Why is decode sequential for a single response?
  • Which phase mostly affects time to first token?
  • Why can retrieved context make a system slower?

Quick check

  • Prefill
  • Decode token rate
  • Temperature sampling
  • The tokenizer needs to learn new tokens
  • Each generated token depends on the tokens before it
  • The browser has to render them in order