Prefill vs decode
The most useful split in LLM inference is not "prompt" and "answer." It is prefill and decode. One phase reads the input. The other produces the output. They stress the system in different ways.
Prefill processes all prompt tokens and builds the state the model needs. Decode uses that state to generate one new token at a time. Long prompts mostly hurt prefill. Long answers mostly hurt decode.
Prefill reads the prompt
Before the model can generate a useful first token, it has to read the prompt: system message, user message, chat history, tool results, retrieval chunks, and any hidden scaffolding the product adds. This is the prefill phase.
In prefill, the model can process many input positions in parallel. A transformer layer can compute representations for the prompt tokens together, because all prompt tokens are already known. This makes prefill chunky: a lot of work happens before the first visible token appears.
That first visible delay is called time to first token, or TTFT. A large prompt increases TTFT because the server has to read more context before it can start decoding.
Decode writes the answer
Decode starts after prefill. The model predicts one next token, the sampler picks it, and that token is appended to the context. Then the model predicts the next token. The answer grows one token at a time.
This phase is sequential. You cannot generate token 200 until token 199 exists. The server can still batch multiple users together, but each individual response has a dependency chain through its own generated tokens.
Why users feel the phases differently
A slow prefill feels like silence. The user has clicked send and nothing is visible yet. A slow decode feels like a weak stream: text appears, but slowly. Both can produce the same total request time, but they feel different.
This is why streaming helps perceived latency. If prefill is short and decode is long, the user sees progress quickly. If prefill is long, streaming cannot show anything until the first token exists.
If TTFT is bad, look at prompt length, queueing, prefill batching, and cold starts. If tokens per second is bad, look at model size, hardware, decode batching, and cache memory pressure.
Prompt shape matters
Two prompts with the same visible user message can have very different prefill cost. One might include a short system prompt. Another might include 20 turns of chat history, five retrieved documents, tool traces, and a JSON schema. The model sees all of it as input tokens.
That hidden context is often where latency comes from. RAG systems are a common example. Retrieval can improve quality, but every chunk inserted into the prompt has to be read during prefill. Bigger context is not free.
Measure prefill and decode separately in benchmarks. A change that improves one can hurt the other. For example, adding more retrieved context might improve answer quality while making TTFT worse. A product decision needs both numbers.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What work happens during prefill?
- Why is decode sequential for a single response?
- Which phase mostly affects time to first token?
- Why can retrieved context make a system slower?
Quick check
- Prefill
- Decode token rate
- Temperature sampling
- The tokenizer needs to learn new tokens
- Each generated token depends on the tokens before it
- The browser has to render them in order