Learn/Evaluation & Observability/Lesson 02

Lesson 02

Task evals and golden sets

A task eval turns a vague quality bar into a repeatable test. A golden set is the version-controlled collection of those tests you trust enough to block a release. This lesson is how to build both without fooling yourself.

The one idea

A task eval is fixed input, explicit success criteria, and a grader. A golden set is the curated, maintained bundle of those tasks that represents what "good" means for your product right now.

Anatomy of a task eval

Strip away tooling and a task eval has four parts:

Input. The user message, document set, tool state, or scenario setup.
Success criterion. What must be true for a pass. Be concrete.
Grader. Code, human, or model that applies the criterion.
Metadata. Tags for domain, difficulty, source (synthetic, production, requirement), and owner.

Example for a support bot: input is a ticket thread; success is "resolution status in the summary matches CRM record AND cites ticket ID"; grader is a script that checks CRM plus regex on citation; metadata tags billing, multi-turn, from-incident-2026-04-12.

Vague criteria produce vague evals. "Answer helpfully" is not a test. "Return JSON with fields status and next_step, where status is one of open|pending|closed" is a test.

Same shape for RAG, classifiers, and agents. Only the grader and success definition change.

Start from real failures, not brainstorms

The fastest way to build a useful eval set is to mine what already broke. Production tickets, thumbs-down sessions, escalations, and internal dogfood notes are higher signal than a whiteboard of "representative tasks."

Anthropic's engineering guidance suggests twenty to fifty tasks from actual failures as a strong starting point. That size is enough to catch gross regressions without drowning in maintenance. Each failure you encode becomes both a regression test and documentation of a failure mode you now admit exists.

Requirements written in plain language often become evals for free. "Summarize the last thirty days of tickets without inventing resolution statuses" is a product requirement and a test case. If you only ship the feature, you will discover the violation in production. If you also ship the eval, you discover it in CI.

Binary (pass/fail) is the default. Easier to aggregate, harder to game casually, forces clarity. Use it when success is objective: unit tests pass, record updated, JSON schema valid.

Graded (0 to 1 or 1 to 5) fits long tasks where partial credit matters. The risk is subjectivity and noisy trends. A practical compromise is binary decomposition: split "complete the migration" into five yes/no checks (schema preserved, no rows dropped, types cast, finished within iteration budget, user-verifiable summary). You keep clean graders and still see partial progress.

Three grader types

Code graders are first resort. String match, regex, JSON schema, SQL row checks, running unit tests the agent was supposed to write. Fast, cheap, reproducible. Use them whenever correctness is machine-checkable.

LLM-as-judge graders handle open-ended outputs. Covered in depth in lesson 03. Treat them as calibrated instruments, not oracles.

Human graders set ground truth and handle expert domains. Slow, expensive, necessary for calibration and for cases where mistakes are costly. The usual split: humans label a seed set and spot-check edge cases; automation runs volume.

For agents, add trajectory graders: did the system call the right tools in a sensible order with valid arguments, not only whether the final string looked right? Trajectory checks need logs or traces. Lesson 04 covers the instrumentation.

Building a golden set

A golden set is your regression backbone: typically two hundred to five hundred tasks covering your operational envelope, version-controlled, with change control. "Golden" means trusted and stable enough to compare runs over time.

Sources, in order of usefulness:

Production failures (anonymized and redacted).
Near-misses flagged by monitors (high cost, circuit breaker trips, max tokens).
Product requirements and acceptance tests from launch.
Synthetic cases for coverage gaps (permissions, empty retrieval, tool errors).
Safety probes: jailbreak and injection attempts from your threat model, encoded as pass/fail cases (must refuse, must not call write tools).

Synthetic data risks. LLM-generated test cases fill gaps fast but inherit the generator's blind spots. Synthetic sets over-represent fluent, well-formed inputs and under-represent typos, adversarial phrasing, and domain edge cases. Use synthetic cases to cover known holes (empty retrieval, permission denied), not as a substitute for production failures. Label synthetic rows in metadata so you know which slice is imagination vs reality.

Golden traces go one step further: record full successful trajectories (prompts, tool calls, results). Future runs compare against the trace. Deviation flags a regression without invoking a judge. Useful for agents where output text can vary but the path should not.

Engineering reality

Golden sets rot if nobody owns them. Assign an owner, review diffs like code, and require a sentence in the commit message when adding or removing cases. Deleting a failing test to green the dashboard is the same failure mode as deleting a flaky unit test: you did not fix quality, you hid it. Also version external dependencies: if evals hit live APIs or a moving document corpus, note the snapshot date or pin fixtures. Otherwise you are measuring drift in the world, not regressions in your system.

How big should a golden set be?

Size depends on what you need to detect, not on round numbers alone.

Starting heuristics. Twenty to fifty cases from real failures catch gross regressions during early development (Anthropic's guidance). Two hundred to five hundred cases cover an operational envelope for CI gates once multiple engineers edit prompts. Stratify by failure mode, not just product area: retrieval miss, faithfulness violation, tool hallucination, loop/stuck, injection probe, latency breach. Aim for at least ten cases per high-risk mode so a localized regression does not hide inside a global pass rate.

Statistical significance on small sets. A golden set with forty cases and a drop from 90% to 82% pass rate (four fewer passes) might be signal or noise. On small n, treat point estimates skeptically. Use Wilson score intervals or a simple binomial test before blocking a release on a 2% wobble. Rule of thumb: if the change affects fewer than three cases on a set under one hundred, investigate case-by-case before declaring regression. If it affects ten or more, treat it seriously. Slice by tag so a global 94% does not mask a billing slice at 71%.

Power analysis is rarely formal in product teams, but the intuition matters: small sets detect large breaks, not subtle ones. Subtle quality drift needs either more cases or online monitoring (lesson 06).

RAG evals: link to the retrieval course

RAG systems need metrics this lesson shares with the broader eval stack and metrics the RAG course owns. Use both.

From RAG L07 — Evaluating RAG systems: recall@k at each funnel stage, faithfulness vs correctness split, citation quality, abstention when evidence is missing. Store expected source chunks in each case, not just a golden answer string.

From this course: golden set maintenance, LLM-as-judge calibration (lesson 03), CI regression gates (lesson 05), tracing retrieval spans (lesson 04). When RAG L07 says "the judge needs its own eval," that calibration workflow lives here in lesson 03.

Agent trajectory evals

Output-only pass/fail is not enough for agents. A lucky final answer can hide a broken path. Agent evals add trajectory checks on top of outcome checks. See also Agents L06 — Harness failure modes for the failure taxonomy trajectory evals should cover.

Trajectory match. Compare tool-call sequence, argument shapes, and stop reason against a golden trace or allowed variants. Same task, different wording in the final message can still pass if the path is correct.

Tool-call correctness. ToolCallF1: precision and recall on which tools should fire. Argument validity against JSON schema. Wrong tool with a plausible summary is a fail.

Step budget. Assert max iterations, max tool calls, and max tokens per successful task. An agent that passes after twelve redundant searches fails a step-budget grader even if the answer is right.

Trajectory graders need traces (lesson 04). Encode expected_trace or allowed tool sequences in the golden file. Run trajectory checks in CI before expensive LLM judges.

Input: "Refund order #8821 per policy." Pass if: (1) calls get_order then issue_refund in that order, (2) refund amount matches policy field in tool result, (3) completes within six iterations, (4) final message cites order id. Fail if: skips get_order, calls delete_user, or exceeds iteration cap.

Maintenance and drift

Two kinds of drift hit evals. Product drift is when user behavior or policies change and old tests no longer represent the job. World drift is when underlying data changes (new SKU names, updated policies) and correct answers move.

Handle product drift with a monthly review: retire obsolete tasks, add new slices, rebalance difficulty. Handle world drift by pinning fixtures for CI and refreshing them on a schedule, or by grading on structure and citations rather than exact strings when content must stay live.

Keep a holdout set nobody tunes against: same format as the golden set, not used for day-to-day development, run weekly or before major releases. If internal pass rate climbs but holdout is flat, you are optimizing to the visible tests.

Rotate in five to ten new production cases per week. Static sets freeze your failures at launch time. Production never stops generating new ones.

Separate what you optimize on from what you use to sanity-check honesty.

Overfitting to your own evals

Goodhart's Law applies internally too. When pass rate becomes the goal, teams cherry-pick prompt tweaks that lift scores on known cases without improving real usage. Signs: climbing internal pass rate, flat production metrics, engineers who know which case IDs fail before running the suite.

Mitigations: holdout set, rotating production imports, trajectory checks (harder to game with a lucky final answer), and occasional blind human audits. Public benchmarks can be gamed even more aggressively; treat them as external context, not proof your product works.

Tagging, slicing, and ownership

A flat pass rate hides localized disasters. Tag every case with product area, language, risk tier, and data source. Report pass rate per tag in CI and in weekly dashboards. A global ninety-four percent can mean billing is perfect and compliance is seventy-one percent.

Assign owners per tag. The person who owns "refunds" approves changes to refund cases. Distributed ownership stops the golden set from becoming a junk drawer nobody maintains.

For agents, track auxiliary metrics on the same cases: tool-call count, argument error rate, recovery after injected tool failure. Output-only pass rate misses expensive paths that will not scale in production.

Injected failures and stress cases

Production is not only happy paths. Add eval cases that inject tool timeouts, empty retrieval, permission denials, and malformed user uploads. These belong in the golden set even if they are rare today. They test harness recovery, not model trivia.

For multi-turn agents, include mid-session redirects: user changes goal, tool returns unexpected schema, or prior context contradicts new data. Single-turn cases will not catch plan abandonment on turn fifteen.

id, input, success_criteria, grader_type, tags, source, optional expected_trace or expected_sources. Keep inputs as fixtures on disk when they are large. Reference by path so git diffs stay readable.

Tool metrics as first-class evals

For agent features, add metrics beside task pass rate: ToolCallF1 (precision and recall on which tools should fire), argument validity rate against JSON schema, redundant call rate, and multi-turn sequence accuracy. A model can pass the final answer while wasting budget on six duplicate searches.

Track average tool calls per successful task across harness versions. A forty percent increase in calls with flat pass rate is a cost regression you should catch in CI alongside quality.

Checkpoint

You are ready for the next lesson if you can answer these from memory:

What four parts does every task eval need?
How big should a golden set be at CI-gate maturity, and how do you stratify it?
What three checks define an agent trajectory eval?
What is the difference between a golden set and a holdout set?
When would you use trajectory grading instead of output-only grading?

Quick check

Recall@k on document chunks
ToolCallF1 and step budget
Labeled answer correctness and refusal when appropriate
SWE-bench score

ToolCallF1 and step budget only
Recall@k at funnel stages plus faithfulness to retrieved sources
MMLU benchmark score
P95 latency only

Recall@k on policy documents only
Trajectory match, step budget, and ToolCallF1
MMLU score on general knowledge
A single fluency judge on the final message

Definite regression; block all releases
Investigate the specific failures; may be signal or noise on a small set
Ignore; small sets never matter
Delete the two cases to restore 91%

Agent output parses as JSON with required fields and enum values
The tone feels helpful
The answer is good enough for most users
The response matches our brand voice

To delete them later when they pass
They capture real failure modes and prevent the same bug from returning
Synthetic data is always higher quality
Because Anthropic requires fifty engineers

Proper maintenance
You masked a regression by shrinking coverage
You improved the holdout set
You switched to trajectory grading

The task is single-turn FAQ with one retrieval call
Final answers vary in wording but the tool path should stay stable
You want to eliminate all human review forever
Your only problem is stale documents