Learn/Prompting & Context Engineering/Lesson 05

Lesson 05

Context budgeting and window management

A 128k window sounds infinite until you stack system text, eight RAG chunks, tool logs, and forty chat turns. Budgeting is deciding what ships, what gets compressed, and what gets dropped, before the API silently truncates the wrong end.

The one idea

Context is a zero-sum resource. Every token of history or evidence reduces room for the answer and costs money on prefill. Manage the window explicitly; do not let the client library clip blindly.

The budget equation

For each request, sketch:

window_max = model context limit (input + output, provider-specific)

reserved_output = max_tokens you allow for the completion

input_budget = window_max − reserved_output − safety_margin

Everything static and dynamic must fit in input_budget: system prompt, tools, shots, retrieval, tool results, history, user message. If the sum overshoots, something must give. APIs differ on whether they error, drop the middle, or truncate the start. Assume bad defaults until you read the docs.

Allocate explicitly. If you do not choose what to drop, the platform or a naive trimmer will choose for you.

What to protect first

Not all tokens are equal. A common priority stack:

System / developer instructions (safety, output format)
Current user request (the task right now)
Fresh evidence (RAG chunks, latest tool results relevant to this turn)
Recent turns (immediate conversational continuity)
Older history (compress or drop)
Optional flourishes (long personas, extra shots)

The "lost in the middle" effect: models sometimes under-use information buried in the center of very long contexts. Liu et al. (Lost in the Middle, 2023) showed U-shaped recall: content at the start and end of long prompts is used more reliably than material in the middle. Put critical evidence and citation rules near the start or end, not in the middle of a 80k paste. Repeating the highest-priority constraint right before the user question is a common production pattern.

History strategies

Sliding window: keep the last N turns or last K tokens. Simple, predictable. Loses long-range thread context.

Summarization: periodically compress older turns into a paragraph stored alongside or injected as a system note. Cheaper than full history but summary errors compound. Re-summarize from source messages when possible.

Structured memory: extract facts ("user prefers metric units") into a small key-value block instead of replaying chat. Good for preferences; bad if you need exact quotes.

Thread forking: new sub-session without baggage for tangents. Useful in support when the user pivots topics.

Using the same LLM to summarize history saves tokens but can drop nuance the main turn needed. Keep summaries short, factual, and labeled as unreliable for citations. Some teams use a smaller cheaper model for summarization with a strict bullet template.

RAG and tool output in the budget

Retrieval should be budget-aware: retrieve more candidates than you inject, rerank, then pack greedily by score until you hit a token cap. Hard-cap chunk size at index time so one document cannot eat the window.

Tool outputs (JSON blobs, stack traces) blow up fast. Truncate with ellipsis, store full payload server-side, pass an ID the model can request again. Log what was trimmed so debugging is possible.

Deduplicate: eight chunks that say the same policy waste space and confuse attention.

Long-context marketing vs your workload

128k or 1M context windows do not mean you should fill them. Attention quality, prefill cost, and latency often degrade before you hit the hard limit. Benchmark your own tasks: sometimes retrieval plus 8k context beats dumping whole PDFs.

Legal and compliance reviews may require retaining full transcripts server-side while sending only summaries to the model. The window is an inference convenience, not your system of record.

When vendors raise limits, revisit architecture. Problems you solved with aggressive summarization might accept more evidence, but re-measure before deleting compression logic.

Testing overflow behavior

Write tests that deliberately exceed budget. Does your assembler drop oldest turns or fail the request? Does the API error or silently truncate? Document the behavior and make it explicit in UX ("conversation too long, start new thread").

Simulate fat tool payloads and huge RAG injections in staging. Production will eventually hit them via one verbose API partner.

Alert when p95 input tokens cross thresholds. Gradual creep is how features become slow and expensive without anyone filing a ticket.

Multi-turn features and session design

Chat products need a session policy documented alongside the budget: max turns, when to suggest "new conversation," what state lives server-side vs in context. Users should not have to know about tokens; the product should gracefully reset when continuity is no longer affordable.

Carry forward structured state outside the model when possible: current ticket ID, cart contents, user tier. Re-inject as compact facts instead of replaying the entire negotiation.

Cross-session memory products are just compression strategies with branding. Evaluate them with the same skepticism as summarization: what was lost, and does it matter for your task?

Worked example: support thread overflow

Imagine a 32k window, 3k system+tools, 4k reserved output, 25k input budget. After twenty turns the history alone is 18k. Retrieval wants 10k of policy chunks. You are 3k over.

Reasonable pack order: keep system, keep latest user message, inject top 6k of RAG by score, summarize turns 1–15 into 1k bullets, keep turns 16–20 verbatim. Drop optional few-shot. Log that summarization ran so support knows quotes from early turns may be lossy.

This is not theoretical. Without an assembler policy, the client SDK may drop the oldest messages including the only copy of the refund policy you injected six turns ago.

Output headroom and truncation

Reserving too little output space truncates JSON and code mid-stream. If structured answers need 800 tokens, do not reserve 256 because "answers are usually short." Truncated structured output fails validation more often than verbose prose.

Dynamic reservation helps: classify tasks get 100 tokens, draft tasks get 2k. The assembler picks max_tokens from task metadata, not a global constant.

When truncation happens, detect it (finish_reason, incomplete brackets) and retry with higher ceiling or a decomposed task. Do not feed half-JSON to business logic.

Budget reviews should include output reservation, not only input. Teams obsessed with trimming RAG while starving max_tokens see mysterious JSON failures that look like model stupidity but are simple math.

Prompt caching and prefix economics

Long static prefixes (system prompt, tool schemas, few-shot library) repeat on every request in a session. Providers let you cache identical prefix tokens so later calls skip full prefill cost on that block. Anthropic calls this prompt caching; OpenAI uses prefix caching on matching input prefixes. Self-hosted stacks (vLLM, SGLang) cache KV blocks for shared prefixes across requests.

What gets cached: the token sequence from the start of the prompt up to a cache breakpoint you control. Everything after the breakpoint is fresh each turn (user message, new RAG chunks, tool results). Structure assembly as stable_prefix || volatile_suffix. Put timestamps, session IDs, and live retrieval after the breakpoint, not inside the cached block.

Cost model (API, illustrative): uncached input tokens bill at the listed input rate. Cached input tokens bill at a reduced rate (often roughly 10% of uncached, provider-specific). You still pay for cache writes on the first hit or when the prefix changes. Example: 8k-token system+tools cached, 4k fresh user+RAG per request at 1M requests/month:

Without caching: 12k input tokens × 1M = 12B input tokens at full rate.
With caching: 8k written once per session (or TTL), then 8k cached reads at discount + 4k uncached each request.

On chat products with many turns, caching the static prefix often cuts input cost more than trimming a few RAG chunks. Accidentally invalidating cache (dynamic date in system text, unordered tool list) is a common silent regression. Instrument cache hit rate if your provider exposes it. Deeper serving math lives in the Serving & Economics course.

Worked example: one RAG + tools request

32k window, support bot with search tool. Budget sketch:

Layer	Tokens	Notes
System + output contract	1,200	Cacheable prefix
Tool definitions (search, ticket_update)	2,400	Cacheable if stable
Few-shot (3 examples)	900	Cacheable if static
RAG (6 chunks, capped)	5,500	Volatile; after cache breakpoint
Chat history (last 4 turns)	3,800	Volatile; summarize older
Current user message	600	Volatile
Input subtotal	14,400
Reserved output	2,048	JSON answer + citations
Safety margin	500	Tokenizer mismatch buffer
Window used	16,948	~53% of 32k; room to grow history

If history grows to 12k, you are over budget. Drop order: summarize turns older than four, drop lowest-score RAG chunks, then trim few-shot before touching system or tools. Re-count after every assembler change.

Counting tokens

You cannot budget without measurement. Use the provider tokenizer (tiktoken, etc.) in your assembly code. Count before send. Track p50/p95 input sizes in metrics. Spikes often mean a tool started returning huge JSON or someone appended a novel to the system prompt.

Prompt caching (where supported) changes economics: static prefixes are cheaper on repeat. Structure prompts as stable prefix + volatile suffix to maximize cache hits.

Instrument cache hit rate if your provider exposes it. A refactor that accidentally moves dynamic timestamps into the static prefix can zero out savings overnight. Treat cache-friendly layout as a performance requirement, not an optimization pass.

Assembler ownership

One module should own context assembly: inputs in, token counts out, drop policy applied. Scatter-shot trimming in route handlers guarantees inconsistent behavior across features.

Document the assembler contract for feature teams: what they can pass, what gets summarized, hard caps per layer. Without that API, every squad reinvents truncation badly.

Run tabletop exercises: "RAG doubles chunk size next release; what breaks?" Budgeting is capacity planning for tokens the same way SREs plan CPU. Assume growth before the pager fires.

Publish internal docs with example budgets per feature template: support bot, codegen assist, summarizer. New features clone a template instead of guessing token headroom from a blog post about 128k windows.

Reconcile token metrics with finance monthly. Input growth without traffic growth is almost always a context assembly bug or a feature silently stuffing more history.

When in doubt, measure twice in staging with production-sized transcripts before trusting a new window limit in marketing copy.

Token debt compounds quietly; review budgets when retrieval or history features ship.

Engineering reality

Prefill scales with input. Doubling context often hurts time-to-first-token more than doubling output tokens hurts total latency. Long contexts are not free speed-wise.

Silent truncation bugs. If your SDK drops the start of history, the model may forget system constraints that were only in early turns you thought were preserved. Test overflow behavior explicitly.

Cost alerts. Set per-feature token ceilings. A runaway agent loop that appends full web pages to history can burn budget in minutes.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

How do you compute input_budget from window_max and max_tokens?
What is typically dropped first when context overflows?
What is the "lost in the middle" problem?
How does prompt caching change the cost of a static system prefix?
Compare sliding window vs summarization for chat history.
Why should RAG packing be score-aware and capped?

Quick check

About 28k tokens for input
32k; output is separate and unlimited
28k input only if the system prompt is empty

Bury it in the middle of 60k tokens
Near the start or end of the context, close to the user question
Only in the system prompt, never near the user message

It eventually exceeds the window, increases cost, and crowds out evidence
Models remember everything automatically once it was said once
APIs stop charging after 10 turns

Models read tokens alphabetically
To maximize prompt cache hits on providers that cache identical prefixes
To prevent the API from logging history

After the system prompt but before every user message and RAG chunk
After system, tools, and static shots; before live user content and retrieval
Only on the first line of the system message