Learn/Prompting & Context Engineering/Lesson 03

Lesson 03

Few-shot prompting and in-context learning

You can teach a task without gradient updates by showing input/output pairs in the prompt. That is in-context learning. It looks like magic in demos and like a brittle dataset in production. This lesson is about making it closer to the latter on purpose.

The one idea

Few-shot examples are temporary training data. The model infers a pattern from the pairs you show and continues it. Quality, format consistency, and label balance matter as much as they would in a real dataset.

Zero-shot, one-shot, few-shot

Zero-shot: instructions only. "Classify sentiment as positive, negative, or neutral." Works when the task is common in pretraining or the instruction is very clear.

One-shot: one solved example before the real input. Helps disambiguate output format.

Few-shot: several examples, usually 3–8 for structured tasks. Teaches edge cases, label boundaries, and formatting by demonstration.

None of this updates weights. On the next API call, if you omit the examples, the lesson is gone. That is the tradeoff: flexible, reversible, but paid in tokens every time.

Consistent delimiters and labels across examples teach the pattern. The model continues the sequence on the last line.

What in-context learning is doing

During pretraining and instruction tuning, models see billions of patterns: FAQs, exams, code repos with repeated function shapes, forums where people quote and answer. At inference, a few examples steer the model toward a local pattern: "this chunk of text behaves like a classification template."

Attention lets later tokens look back at earlier ones. The examples are not stored as rules; they are evidence in the prefix. That is why a wrong label in one shot can poison the batch, and why near-duplicate examples waste tokens without adding information.

Strong models need fewer shots for common tasks. Niche enterprise tasks (your internal ticket taxonomy) often need more shots or fine-tuning despite impressive zero-shot marketing.

Picking and ordering examples

Treat shot selection like curating a validation set:

Cover the label space. If one class never appears, the model under-predicts it.
Include hard negatives. Pairs that look similar but differ on the decision boundary.
Match production format. Same JSON keys, same language, same noise (typos if users typo).
Keep length representative. If real inputs are 400 tokens, a 10-token demo misleads.

Order effects are real. Models can favor the last example (recency) or the majority label in the shots. Shuffle or rotate shots across requests when bias matters. For classification, balance classes in the prompt even if production is imbalanced.

Dynamic few-shot via retrieval

Static shots are simple but generic. Production systems often embed the live user input, search a labeled example store, and inject the top-k nearest neighbors into the prompt. That is few-shot plus retrieval: same in-context learning mechanism, but the teaching set changes per request.

A minimal pipeline:

Embed the current task text (ticket body, query, document snippet).
Search a vector index of labeled (input, output) pairs you curated offline.
Rerank by similarity and label diversity (avoid five near-duplicates).
Render shots with the same delimiter template as static few-shot.
Cap total shot tokens before injecting (see below).

It helps on heterogeneous tasks where no five static examples cover the tail. Failure modes: bad neighbors teach the wrong pattern, stale labels in the store propagate, embedding drift after you change models, and added retrieval latency. Log which example IDs were used per request so you can debug mispredictions.

Before adding shots, count them. Rough arithmetic for a classifier:

System prompt: 800 tokens
Per shot (input + output + labels): ~180 tokens
5 shots: 900 tokens
Live user input: 400 tokens
Shot subtotal: 2,100 input tokens before RAG or history

At 2M requests/month and $2/M input tokens, 900 shot tokens alone is about $3,600/month. Dynamic retrieval that always injects 8 long examples can cost more than a small fine-tune. Run the ablation: accuracy gain per 1k shot tokens, and stop when the curve flattens.

Format is part of the lesson

Use explicit delimiters: Input: / Output:, XML tags, or markdown headings. Stick to one pattern. If one example uses JSON and the next uses prose, the model learns ambiguity.

Put few-shot blocks after system instructions and before the live user task. End with the incomplete pair (input present, output blank) or ask clearly in the final user message. Some APIs simulate shots as fake user/assistant turns; that can work well because it matches chat fine-tuning data.

Chat-style few-shot example structure:

User: Classify: "battery dies fast"
Assistant: hardware

User: Classify: "cannot reset password"
Assistant: account

User: Classify: "{live_ticket}"
Assistant:

The trailing assistant turn invites completion. Your harness must strip or hide prior assistant labels from user-visible history if you replay this pattern.

Calibration and confidence patterns

Examples teach not only labels but how cautious to be. If every shot is a crisp classification with no ambiguity, the model may overcommit on muddy real inputs. Include at least one borderline example with the correct cautious label or an explicit UNKNOWN in the output format.

For extraction tasks, show what to do when a field is missing: null, empty string, or omit key. Mixed teaching produces mixed outputs. Align shots with your JSON schema null policy.

Calibration shots are especially important when mistakes are costly: fraud review, medical triage routing, permission checks. The model will imitate the confidence level demonstrated in the prefix.

Anti-patterns in shot libraries

Stale product shots after UI or policy changes teach outdated labels.
Near-duplicate examples that waste tokens without new decision boundaries.
Leaky PII in frozen examples copied into logs and third-party APIs.
Mismatched difficulty where shots are toy sentences but production inputs are long threads.
Hidden chain-of-thought in assistant labels that the model copies into user-visible output.

Review the shot library when metrics drift without a model change. Someone usually edited examples by hand.

One-shot as format anchor

Sometimes one high-quality example is enough to lock output shape when the task is familiar (sentiment, simple extraction). The example should be representative, not trivial: real length, real punctuation, real edge labels.

Pair one-shot with strict system instructions. Redundant format cues (schema in system + matching example) beat either alone for JSON and enum tasks.

Do not confuse one-shot with caching. The example still costs tokens every call unless your provider caches a static prefix that includes it.

Measuring few-shot value

Run ablations: zero-shot vs 3-shot vs 6-shot on the same eval set. Plot accuracy and parse success against input tokens. Stop when the curve flattens. Share results in the team wiki so nobody adds shots by superstition.

Track per-class metrics, not just accuracy. Few-shot often fixes one confused label while harming another. Confusion matrices reveal whether your examples teach the right boundaries.

When dynamic retrieval supplies shots, log which example IDs were used per request. Debugging "why did it classify this wrong?" requires tracing the teaching set.

Few-shot vs fine-tuning decision

Use few-shot when the task changes often, labels are small in count, or you need zero training pipeline. Move toward fine-tuning when you pay the same 2k example tokens on millions of calls, when style must be deeply internalized, or when proprietary format is hard to demonstrate in a handful of pairs.

Hybrid is common: fine-tune for tone and format, few-shot for rare class reminders or tenant-specific vocabulary injected dynamically. The cost math should include engineer time, not only token dollars.

Revisit the decision quarterly. Token prices, context windows, and fine-tuning tooling change fast. A task that was obviously few-shot in 2024 may be cheaper to fine-tune once calls hit eight figures.

How many shots is enough?

There is no universal number. Start with zero-shot plus a strict output schema. If evals miss format or confuse classes, add 3–5 diverse shots. Measure accuracy vs token cost. Doubling shots rarely doubles quality; it always doubles example tokens.

Watch diminishing returns: going from 0→3 shots often helps a lot; 8→12 may not move metrics but eats window. For long outputs per example, even three shots can cost thousands of tokens.

When few-shot stops scaling, consider fine-tuning or a smaller specialist model. Examples in context are expensive forever; weights amortize.

Temperature interacts with few-shot: lower temperatures often help format adherence when examples define a rigid pattern; slightly higher may help classification when labels are ambiguous and shots show nuanced boundaries. Measure both; defaults from chat UIs are not tuned for your task.

Quick patterns that work

For classification, show balanced labels and identical input wrappers. For extraction, show one complete JSON object with nulls where fields are missing. For rewriting, show before/after with the same tone constraints your system prompt demands.

Label your examples consistently in code-generated prompts so translators and PMs cannot accidentally break delimiter structure. Treat the shot template as code, not copy doc.

Engineering reality

Token tax. Eight 200-token examples are 1,600 input tokens on every call. At scale that dominates cost more than a longer system prompt. Cache static shots where providers allow prompt caching; rotate only the dynamic tail.

Label leakage. Never pull few-shot examples from the eval set. You will overestimate accuracy. Hold out a shot library separate from regression tests.

Consistency drift. Human-written examples updated by different people diverge in style. Centralize the shot library, review it like training data, and version it.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is the difference between zero-shot and few-shot prompting?
Why are few-shot examples "temporary training data"?
How can example order and class balance bias predictions?
When does dynamic few-shot retrieval help, and what can go wrong?
How do you estimate the token cost of N few-shot examples?
What signal tells you to stop adding shots and consider fine-tuning?

Quick check

The model picks up a task pattern from examples in the prompt at inference time
The model fine-tunes its weights after each few-shot prompt
The model permanently stores examples in long-term memory

Under-prediction of 'shipping' because shots are imbalanced
Perfect balance does not matter if instructions are clear
The API automatically reweights classes

Always; static shots are obsolete
When tasks are heterogeneous and similar labeled cases help disambiguate tail queries
When you need lower latency than zero-shot

Format inconsistency confuses the pattern; outputs become unreliable
The model always picks JSON because it is stricter
Only the system prompt matters for format, not the shots