Context budgeting and window management
A 128k window sounds infinite until you stack system text, eight RAG chunks, tool logs, and forty chat turns. Budgeting is deciding what ships, what gets compressed, and what gets dropped, before the API silently truncates the wrong end.
Context is a zero-sum resource. Every token of history or evidence reduces room for the answer and costs money on prefill. Manage the window explicitly; do not let the client library clip blindly.
The budget equation
For each request, sketch:
window_max = model context limit (input + output, provider-specific)
reserved_output = max_tokens you allow for the completion
input_budget = window_max − reserved_output − safety_margin
Everything static and dynamic must fit in input_budget: system prompt, tools, shots, retrieval, tool results, history, user message. If the sum overshoots, something must give. APIs differ on whether they error, drop the middle, or truncate the start. Assume bad defaults until you read the docs.
What to protect first
Not all tokens are equal. A common priority stack:
- System / developer instructions (safety, output format)
- Current user request (the task right now)
- Fresh evidence (RAG chunks, latest tool results relevant to this turn)
- Recent turns (immediate conversational continuity)
- Older history (compress or drop)
- Optional flourishes (long personas, extra shots)
The "lost in the middle" effect: models sometimes under-use information buried in the center of very long contexts. Liu et al. (Lost in the Middle, 2023) showed U-shaped recall: content at the start and end of long prompts is used more reliably than material in the middle. Put critical evidence and citation rules near the start or end, not in the middle of a 80k paste. Repeating the highest-priority constraint right before the user question is a common production pattern.
History strategies
Sliding window: keep the last N turns or last K tokens. Simple, predictable. Loses long-range thread context.
Summarization: periodically compress older turns into a paragraph stored alongside or injected as a system note. Cheaper than full history but summary errors compound. Re-summarize from source messages when possible.
Structured memory: extract facts ("user prefers metric units") into a small key-value block instead of replaying chat. Good for preferences; bad if you need exact quotes.
Thread forking: new sub-session without baggage for tangents. Useful in support when the user pivots topics.
Using the same LLM to summarize history saves tokens but can drop nuance the main turn needed. Keep summaries short, factual, and labeled as unreliable for citations. Some teams use a smaller cheaper model for summarization with a strict bullet template.
RAG and tool output in the budget
Retrieval should be budget-aware: retrieve more candidates than you inject, rerank, then pack greedily by score until you hit a token cap. Hard-cap chunk size at index time so one document cannot eat the window.
Tool outputs (JSON blobs, stack traces) blow up fast. Truncate with ellipsis, store full payload server-side, pass an ID the model can request again. Log what was trimmed so debugging is possible.
Deduplicate: eight chunks that say the same policy waste space and confuse attention.
Long-context marketing vs your workload
128k or 1M context windows do not mean you should fill them. Attention quality, prefill cost, and latency often degrade before you hit the hard limit. Benchmark your own tasks: sometimes retrieval plus 8k context beats dumping whole PDFs.
Legal and compliance reviews may require retaining full transcripts server-side while sending only summaries to the model. The window is an inference convenience, not your system of record.
When vendors raise limits, revisit architecture. Problems you solved with aggressive summarization might accept more evidence, but re-measure before deleting compression logic.
Testing overflow behavior
Write tests that deliberately exceed budget. Does your assembler drop oldest turns or fail the request? Does the API error or silently truncate? Document the behavior and make it explicit in UX ("conversation too long, start new thread").
Simulate fat tool payloads and huge RAG injections in staging. Production will eventually hit them via one verbose API partner.
Alert when p95 input tokens cross thresholds. Gradual creep is how features become slow and expensive without anyone filing a ticket.
Multi-turn features and session design
Chat products need a session policy documented alongside the budget: max turns, when to suggest "new conversation," what state lives server-side vs in context. Users should not have to know about tokens; the product should gracefully reset when continuity is no longer affordable.
Carry forward structured state outside the model when possible: current ticket ID, cart contents, user tier. Re-inject as compact facts instead of replaying the entire negotiation.
Cross-session memory products are just compression strategies with branding. Evaluate them with the same skepticism as summarization: what was lost, and does it matter for your task?
Worked example: support thread overflow
Imagine a 32k window, 3k system+tools, 4k reserved output, 25k input budget. After twenty turns the history alone is 18k. Retrieval wants 10k of policy chunks. You are 3k over.
Reasonable pack order: keep system, keep latest user message, inject top 6k of RAG by score, summarize turns 1–15 into 1k bullets, keep turns 16–20 verbatim. Drop optional few-shot. Log that summarization ran so support knows quotes from early turns may be lossy.
This is not theoretical. Without an assembler policy, the client SDK may drop the oldest messages including the only copy of the refund policy you injected six turns ago.
Output headroom and truncation
Reserving too little output space truncates JSON and code mid-stream. If structured answers need 800 tokens, do not reserve 256 because "answers are usually short." Truncated structured output fails validation more often than verbose prose.
Dynamic reservation helps: classify tasks get 100 tokens, draft tasks get 2k. The assembler picks max_tokens from task metadata, not a global constant.
When truncation happens, detect it (finish_reason, incomplete brackets) and retry with higher ceiling or a decomposed task. Do not feed half-JSON to business logic.
Budget reviews should include output reservation, not only input. Teams obsessed with trimming RAG while starving max_tokens see mysterious JSON failures that look like model stupidity but are simple math.
Prompt caching and prefix economics
Long static prefixes (system prompt, tool schemas, few-shot library) repeat on every request in a session. Providers let you cache identical prefix tokens so later calls skip full prefill cost on that block. Anthropic calls this prompt caching; OpenAI uses prefix caching on matching input prefixes. Self-hosted stacks (vLLM, SGLang) cache KV blocks for shared prefixes across requests.
What gets cached: the token sequence from the start of the prompt up to a cache breakpoint you control. Everything after the breakpoint is fresh each turn (user message, new RAG chunks, tool results). Structure assembly as stable_prefix || volatile_suffix. Put timestamps, session IDs, and live retrieval after the breakpoint, not inside the cached block.
Cost model (API, illustrative): uncached input tokens bill at the listed input rate. Cached input tokens bill at a reduced rate (often roughly 10% of uncached, provider-specific). You still pay for cache writes on the first hit or when the prefix changes. Example: 8k-token system+tools cached, 4k fresh user+RAG per request at 1M requests/month:
- Without caching: 12k input tokens × 1M = 12B input tokens at full rate.
- With caching: 8k written once per session (or TTL), then 8k cached reads at discount + 4k uncached each request.
On chat products with many turns, caching the static prefix often cuts input cost more than trimming a few RAG chunks. Accidentally invalidating cache (dynamic date in system text, unordered tool list) is a common silent regression. Instrument cache hit rate if your provider exposes it. Deeper serving math lives in the Serving & Economics course.
Worked example: one RAG + tools request
32k window, support bot with search tool. Budget sketch:
| Layer | Tokens | Notes |
|---|---|---|
| System + output contract | 1,200 | Cacheable prefix |
| Tool definitions (search, ticket_update) | 2,400 | Cacheable if stable |
| Few-shot (3 examples) | 900 | Cacheable if static |
| RAG (6 chunks, capped) | 5,500 | Volatile; after cache breakpoint |
| Chat history (last 4 turns) | 3,800 | Volatile; summarize older |
| Current user message | 600 | Volatile |
| Input subtotal | 14,400 | |
| Reserved output | 2,048 | JSON answer + citations |
| Safety margin | 500 | Tokenizer mismatch buffer |
| Window used | 16,948 | ~53% of 32k; room to grow history |
If history grows to 12k, you are over budget. Drop order: summarize turns older than four, drop lowest-score RAG chunks, then trim few-shot before touching system or tools. Re-count after every assembler change.
Counting tokens
You cannot budget without measurement. Use the provider tokenizer (tiktoken, etc.) in your assembly code. Count before send. Track p50/p95 input sizes in metrics. Spikes often mean a tool started returning huge JSON or someone appended a novel to the system prompt.
Prompt caching (where supported) changes economics: static prefixes are cheaper on repeat. Structure prompts as stable prefix + volatile suffix to maximize cache hits.
Instrument cache hit rate if your provider exposes it. A refactor that accidentally moves dynamic timestamps into the static prefix can zero out savings overnight. Treat cache-friendly layout as a performance requirement, not an optimization pass.
Assembler ownership
One module should own context assembly: inputs in, token counts out, drop policy applied. Scatter-shot trimming in route handlers guarantees inconsistent behavior across features.
Document the assembler contract for feature teams: what they can pass, what gets summarized, hard caps per layer. Without that API, every squad reinvents truncation badly.
Run tabletop exercises: "RAG doubles chunk size next release; what breaks?" Budgeting is capacity planning for tokens the same way SREs plan CPU. Assume growth before the pager fires.
Publish internal docs with example budgets per feature template: support bot, codegen assist, summarizer. New features clone a template instead of guessing token headroom from a blog post about 128k windows.
Reconcile token metrics with finance monthly. Input growth without traffic growth is almost always a context assembly bug or a feature silently stuffing more history.
When in doubt, measure twice in staging with production-sized transcripts before trusting a new window limit in marketing copy.
Token debt compounds quietly; review budgets when retrieval or history features ship.
Prefill scales with input. Doubling context often hurts time-to-first-token more than doubling output tokens hurts total latency. Long contexts are not free speed-wise.
Silent truncation bugs. If your SDK drops the start of history, the model may forget system constraints that were only in early turns you thought were preserved. Test overflow behavior explicitly.
Cost alerts. Set per-feature token ceilings. A runaway agent loop that appends full web pages to history can burn budget in minutes.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- How do you compute input_budget from window_max and max_tokens?
- What is typically dropped first when context overflows?
- What is the "lost in the middle" problem?
- How does prompt caching change the cost of a static system prefix?
- Compare sliding window vs summarization for chat history.
- Why should RAG packing be score-aware and capped?
Quick check
- About 28k tokens for input
- 32k; output is separate and unlimited
- 28k input only if the system prompt is empty
- Bury it in the middle of 60k tokens
- Near the start or end of the context, close to the user question
- Only in the system prompt, never near the user message
- It eventually exceeds the window, increases cost, and crowds out evidence
- Models remember everything automatically once it was said once
- APIs stop charging after 10 turns
- Models read tokens alphabetically
- To maximize prompt cache hits on providers that cache identical prefixes
- To prevent the API from logging history
- After the system prompt but before every user message and RAG chunk
- After system, tools, and static shots; before live user content and retrieval
- Only on the first line of the system message