Context windows: prompt, memory, and limits
The context window is the model's working memory for a single request. If something is not in the weights, tools, or current context, the model cannot use it. Long context helps, but it is not free memory.
A context window is the maximum number of tokens the model can consider at once: system instructions, user input, retrieved documents, conversation history, tool results, and the output being generated.
Everything competes for the same budget
When you call an LLM, the server builds one token sequence. That sequence might include system instructions, developer instructions, previous messages, the user's latest request, retrieved chunks, tool outputs, formatting examples, and room for the model's answer. They all draw from the same context limit.
This is why prompt design is not just wording. It is budgeting. The more history and documents you stuff in, the less room remains for the answer and the more work the model must do.
The model does not remember your last request
An LLM call is stateless unless the product around it stores state and sends it back. Chat apps feel persistent because the app includes previous messages in the next request, or it summarizes them, or it retrieves saved memories. The model itself is not quietly updating its weights after each message.
This distinction matters when debugging. If the model "forgot" something, ask whether that thing was actually included in the current context, whether it was summarized away, or whether it was buried under too much other text.
Long context is useful, not magical
A larger window lets you include bigger files, more conversation, and more retrieved evidence. That is genuinely useful. But long context creates new problems: higher cost, slower prompt processing, more distraction, and weaker attention to details in the middle of huge inputs.
The model still has to decide what matters. A 100-page dump can contain the right answer and still produce a bad response if the important line is hard to find, conflicts with other text, or is surrounded by noise. Retrieval and summarization still matter.
Long context has two costs. First, the prefill step has to process the whole prompt before generation starts. Second, the KV cache must store attention keys and values for the tokens kept around during decoding. That memory footprint becomes a serving constraint when you run many requests in parallel.
Context engineering is selection
Good AI systems rarely send everything. They select. For a support assistant, that might mean the current ticket, the user's plan, the top few relevant docs, and the last few turns. For a code assistant, it might mean the file being edited, nearby symbols, failing test output, and a compact repo map. The art is choosing the smallest context that contains the right evidence.
This is the bridge into RAG and agents. Retrieval decides what external knowledge enters the context. Tools decide what fresh facts can be fetched. The model still only answers from what it can represent in the current window plus what it already learned during training.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What kinds of text compete for context window space?
- Why is a chat app not proof that the model itself has memory?
- What costs grow when you use a long context window?
- Why is selecting relevant context often better than sending everything?
Quick check
- All tokens the model can consider for the request, including prompt parts and output
- Only the model's trained weights
- Only the user's latest message
- Because it deletes weights after every answer
- Because those details may not be present in the current request context
- Because tokenizers remove old messages automatically
- The model cannot read any text longer than one paragraph
- It can increase cost and latency while burying the important evidence
- It permanently retrains the model on that text