Lesson 03

KV cache explained

The KV cache is the reason decode is practical. It saves the attention state for previous tokens so the model does not have to recompute the whole prompt and every generated token from scratch at every step.

The one idea

The KV cache stores the keys and values produced by attention layers for tokens the model has already seen. Decode appends one new entry per layer per generated token, trading memory for speed.

What gets cached

In transformer attention, each token produces three useful vectors: a query, a key, and a value. The current token's query looks at keys from previous tokens to decide which values matter. That is how the model uses context.

During prefill, the model computes keys and values for all prompt tokens. During decode, the previous keys and values do not change. Instead of recomputing them, the server stores them in memory. That stored state is the KV cache.

When a new token is generated, the model computes that token's new key and value, appends them to the cache, and uses the whole cache for the next step.

The cache starts with prompt tokens and grows with each generated token. Longer conversations use more cache memory.

Why this speeds up decode

Without the KV cache, every decode step would rerun attention for the entire prefix: original prompt plus all generated tokens so far. Generating a 500-token answer would repeatedly redo old work.

With the cache, the model only computes the new token's fresh state and reads the cached states for previous tokens. That does not make decode free. It still has to run model layers and read cache memory. But it avoids a huge amount of repeated computation.

The cache is not tiny

The cost of the KV cache grows with context length, number of layers, attention heads, hidden size, precision, and active requests. This is why long-context serving is a memory problem, not just a model-quality feature.

Every active request needs its own cache because each user has a different prompt and generated output. A server handling many long requests at once can run out of cache memory before it runs out of raw compute.

Common trap

A model that fits in GPU memory at load time may still fail under real traffic. The weights fit, but the active KV caches for concurrent requests may not.

Why cache management is serving logic

Inference servers spend a lot of effort managing KV cache blocks. They allocate cache memory when a request starts, append blocks as decode continues, reuse freed blocks when requests end, and sometimes evict or swap when memory pressure gets bad.

This is one reason libraries like vLLM, TGI, TensorRT-LLM, and llama.cpp matter. The model architecture is only part of serving. The cache scheduler, memory layout, batching behavior, and paging strategy can decide whether the same model feels fast or unusable.

Engineering reality

The KV cache is a production capacity limit. If you allow every user to send giant prompts and request giant answers, you are reserving memory for those choices. Put real limits on input tokens, output tokens, and concurrent long-context requests.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What do the K and V in KV cache refer to?
Why does the cache help decode?
Why does cache memory grow during generation?
Why can concurrency make KV cache memory the bottleneck?

Quick check

Reuse previous tokens' attention state
Store the tokenizer vocabulary
Store a second copy of all model weights

The weights grow for each user
Each active request needs its own KV cache
Temperature uses extra memory