Lesson 05

Context length and output tokens

Tokens are the unit of work in LLM inference. Input tokens and output tokens both cost money, but they hurt latency and memory in different places.

The one idea

Input tokens drive prefill and KV cache size. Output tokens drive decode duration and keep the request alive. A cheap-looking prompt can become expensive if it asks for a long answer.

Context is everything the model sees

The context is not just the user's latest sentence. It includes the system prompt, developer instructions, chat history, retrieved passages, tool outputs, schemas, examples, and the generated tokens so far.

All of those become tokens. The model does not care which part was visible in your UI. If it is sent to the model, it is context, and the server has to process it.

The generated answer joins the context as it is produced, so a request consumes more context while it runs.

Long input hurts before the first token

Input tokens must be processed during prefill. More input usually means higher time to first token, more attention work, and more initial KV cache memory.

This is why prompt stuffing has a cost. Adding examples, retrieved passages, or tool logs can improve quality, but each addition has to justify its token budget. "Maybe useful" context is not free context.

Long output keeps the machine busy

Output tokens are produced during decode. If the model writes 1,000 tokens, that is 1,000 decode steps for that request. The server keeps the KV cache alive the whole time, streams tokens if enabled, and cannot fully free the request until the answer is done.

Long output is especially painful in interactive products. A user might tolerate a detailed answer for a research task. They will not tolerate it in autocomplete, voice, support triage, or a tool call that should return a small JSON object.

Practical default

Set explicit output limits by task. Extraction, routing, classification, and tool calls should usually have small max-token budgets and schema validation. Do not let them write essays.

History needs a policy

Chat history is useful until it becomes baggage. Keeping every prior turn makes the prompt grow without asking whether old messages still matter. The result is slower TTFT, more cache memory, and sometimes worse answers because stale context distracts the model.

Good systems decide what to keep. They can trim old turns, summarize older context, keep only task state, or retrieve relevant past facts instead of blindly replaying the whole conversation.

Retrieval needs a budget too

RAG systems often fail by over-retrieving. Ten chunks feel safer than three until you count the tokens. More chunks can dilute the answer, add contradictions, and slow prefill.

A retrieval budget should be designed like an API budget: how many chunks, how long each chunk can be, what scores qualify, and how the system behaves when there is not enough relevant evidence.

Engineering reality

Token budgets should live in product configuration, not scattered prompt strings. Track input tokens by source: system, history, retrieval, user, tool output. That breakdown makes latency and cost review possible.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What counts as context?
Why do input tokens and output tokens affect different phases?
Why can unlimited chat history hurt quality and latency?
What should a retrieval token budget specify?

Quick check

Input token count increased
Output token count decreased
The tokenizer vocabulary changed

Longer JSON is easier to parse
They need compact structured output, not open-ended prose
Output limits replace validation