Lesson 01

What is context engineering?

A model does not "remember" your product. It reads whatever tokens you send this request, predicts the next ones, and stops. Context engineering is the discipline of choosing those tokens on purpose: instructions, examples, retrieved docs, and chat history, all competing for the same finite window.

The one idea

Every LLM call is conditioned on a single string of context. System prompt, developer rules, few-shot examples, RAG chunks, tool results, and the user's latest message are not separate channels. They are one budget. Context engineering is how you allocate it.

From "prompt" to "context"

People still say "prompt engineering" because the first wave of LLM apps was mostly one-shot: you typed a paragraph into a chat box and hoped. That mental model breaks the moment you ship anything real.

A production call usually sends several layers at once:

  • System message: durable rules, persona, output format, safety boundaries.
  • Developer / tool preamble: API-specific instructions, JSON schemas, tool definitions.
  • In-context examples: input/output pairs that teach a task without weight updates.
  • Retrieved or injected facts: RAG chunks, database rows, API responses pasted in.
  • Conversation history: prior user and assistant turns, sometimes summarized.
  • The current user message: the thing the person actually asked right now.

The model sees all of this as one continuous prefix before it starts generating. There is no hidden slot where your system prompt "weighs more" unless the training and API treat role tags differently. What matters is order, repetition, salience, and how much room each piece gets.

That is why the better term is context engineering. You are not polishing a magic sentence. You are designing the full input the model conditions on, under a hard size limit, for a task that has to work thousands of times a day.

CONTEXT WINDOW (one budget) System rules Few-shot examples Retrieved evidence Chat history older turns User now latest ask room for output Model generates next tokens here Everything to the left is conditioning. Generation eats what is left.
All inputs share one window. Growing history or stuffing more docs shrinks the space left for the answer.

Conditioning, not programming

An LLM is not executing your instructions like a script interpreter. It is a conditional text model: given this prefix, what tokens are likely next? Instructions work because models were trained on text where roles, rules, and Q&A patterns appear often. Examples work because the model pattern-matches the task from similar pairs it has seen. Retrieved paragraphs work because the answer tokens become more likely when the evidence is sitting right there in the prefix.

That distinction matters when things fail. "I told it not to hallucinate" is not a guarantee. You increased the odds of careful behavior. "I put the answer in the context" is stronger, but only if retrieval actually found the right chunk and it survived truncation.

Teams often treat the system message as a junk drawer: every new rule from every stakeholder gets appended. Six months later you have 3,000 tokens of contradictory instructions and nobody knows which line the model actually follows.

System text is high leverage but not magic. Very long system prompts get partially ignored (lost in the middle), compete with user text, and cost money on every call. Good context engineering keeps system text short, stable, and tested, and moves volatile facts to retrieval or tool results instead.

The pieces you control

Think of context engineering as assembly, not authorship. You are building the prefix from parts your application owns:

Static instructions change rarely: tone, refusal policy, output shape, domain vocabulary. These belong in the system prompt and in version control. When they change, you run evals.

Dynamic evidence changes every request: search results, user profile fields, ticket contents, live prices. This is where RAG, SQL, and tool calls feed the model. The engineering question is not "can the model know this?" but "did we fetch the right fact and place it where the model will use it?"

Ephemeral state is the conversation itself. Multi-turn chat is just prior turns copied back into the window. The model has no memory outside what you resend. Summaries, sliding windows, and "memory" products are all different strategies for compressing that state back into tokens.

Control tokens and formats are the scaffolding: role tags, markdown headings, XML blocks, JSON schema descriptions, "think step by step" cues. Format is part of the task. The model learned from forums, repos, and papers that use specific shapes; matching those shapes helps.

How this differs from fine-tuning

Context engineering adapts behavior at inference time. Fine-tuning (covered in a later track) bakes behavior into weights. The tradeoff is familiar:

  • Context: instant to change, transparent to audit, pays per token, limited by window size, behavior can drift if the model ignores instructions.
  • Fine-tuning: upfront cost, opaque failures, cheaper per call once deployed, can internalize style and format, harder to update when product rules change.

Most products use both layers: fine-tune or pick a base model for general capability, then use context for the facts and rules that change weekly. Context engineering is the part you will touch every sprint.

Engineering reality

Cost and latency scale with context. Providers bill input tokens. A 2,000-token system prompt on 10 million requests is not free. Long prefixes also slow prefill: the model must process every input token before the first output token ships. Shipping a fat prompt because it was easy to write is a common way to double both bill and time-to-first-token.

Evals are the real artifact. The deliverable is not a beautiful prompt doc. It is a test set of real user tasks with expected properties (correct JSON, grounded answer, safe refusal) and scores over time. When you change context assembly, you rerun evals. Prompt tweaks without measurement are folklore.

Prompt changes need regression CI. Treat system prompt edits like API contract changes. Wire a frozen eval suite into CI and block merges when scores drop. The full workflow (fixtures, thresholds, flaky handling) is in Evaluation L05: Regression testing and CI for prompts and harnesses.

Version everything. Store system prompt hash, retrieval config, example set ID, and model version with each logged request. "It worked yesterday" debugging is impossible without that tuple.

Attach a small metadata block to every LLM call in logs and traces. Minimum useful fields:

  • prompt_git_sha or prompt_version: commit or semver of the prompts repo
  • eval_suite_hash: hash of the frozen regression cases that last passed CI
  • model_id: provider + snapshot (e.g. gpt-4.1-2025-04-14)
  • shot_set_id / rag_index_version: which few-shot library and retrieval index assembled this prefix
  • assembler_policy: drop order and budget caps in effect

When support reports a regression, you replay with the same tuple. If behavior differs on a replay, the provider changed. If only new traffic fails, a prompt or index drifted without a matching eval run.

Roles in the API vs roles in your head

Providers expose named roles, but the model still sees a token sequence. The role tag is a hint about how that block was meant to be used: durable policy vs one-off user input vs prior assistant speech. Different models weight roles differently. Some treat system and developer messages as higher trust; others flatten everything into chat turns during fine-tuning.

Practically, you should still separate concerns in your code even if the model blurs them. Build the prefix from functions: render_system(), render_shots(), render_evidence(), render_history(), render_user(). That makes budgets measurable and tests reproducible. When a bug appears, you know which layer grew or changed.

Do not stuff user-controlled strings into the system role to "make the model listen." That merges trust boundaries and opens injection paths. User content stays in user messages or clearly fenced blocks with delimiters the model was told to treat as data, not commands.

Salience, repetition, and position

Instructions at the very beginning and very end of the prefix tend to stick better than rules buried mid-prompt. That is why critical constraints appear twice in some production prompts: once in system, once right before the user ask ("Remember: JSON only."). Repetition costs tokens but buys compliance on brittle formats.

Salience also comes from formatting. Headings, bullets, and ALL CAPS labels change how strongly a constraint reads. You are competing with thousands of tokens of retrieved wiki pages. A single vague line about citations will lose.

None of this is a substitute for evals. Position tricks that work on GPT-4.1 may weaken on the next snapshot. Measure compliance rate on your task set after every model upgrade.

A minimal context assembly checklist

Before shipping a new feature, walk this list:

  1. What must be true on every request (system layer)?
  2. What changes per request (evidence, user text)?
  3. What is the token count of each layer at p95?
  4. What gets dropped first when over budget?
  5. What properties does output must satisfy (parseable, grounded, safe)?
  6. What do you log to replay a failure (prompt versions, retrieval IDs, model)?

If you cannot answer six quickly, you do not have context engineering yet. You have a demo prompt.

Where this course goes

The next lessons zoom into each lever: system prompts and roles, few-shot examples, structured output, budgeting the window, and tool calling. RAG (another course) is one way to fill the evidence slot. Agents (later) wrap tool results and history in a loop. Here we focus on the primitives those systems are built from.

If you take one habit from this lesson: before you write another sentence of instructions, sketch the budget. How many tokens for rules? For examples? For history? For docs? What is left for the answer? Context engineering starts with that arithmetic.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What is the difference between "prompt engineering" and "context engineering"?
  • Name four kinds of content that might share one context window on a single API call.
  • Why does the model not reliably "follow instructions" the way a program follows code?
  • What happens to output room when you add more chat history without changing the window size?
  • When would you change context instead of fine-tuning?

Quick check

  • The entire assembled prefix: system text, examples, retrieved content, history, and the latest user message
  • Only the system prompt; user messages are secondary
  • The model's weights, which update after each user message
  • Because multi-turn chat is illegal on modern APIs
  • Because shipped features combine instructions, examples, retrieval, tools, and history in one budget
  • Because vendors renamed the API field from prompt to context
  • Cheaper requests because the model needs fewer output tokens
  • Higher input cost, slower prefill, and less space left for history, evidence, and output
  • The model automatically gets a larger output allowance
  • Running LoRA on support tickets to teach tone
  • Inserting retrieved policy paragraphs into the prompt before each answer
  • Deploying a larger base model with more parameters
  • Only in the middle of the retrieved chunks block
  • In the system prompt at the start, with a short reminder near the user message at the end
  • Only in the final user message so they feel freshest