Lesson 01

Why evals matter for AI features

You can ship an AI feature without evals. You cannot ship one you can change with confidence, defend in a postmortem, or improve on purpose. This lesson is the case for treating evaluation as part of the product, not a nice-to-have lab exercise.

The one idea

A model benchmark tells you how capable a model is in general. A product eval tells you whether your system does the job users pay for. Those are different questions, and conflating them is how teams get surprised in production.

The demo trap

Every AI feature has a golden path. You pick the example that makes retrieval look sharp, the prompt that produces a crisp summary, the agent run that closes the ticket in four tool calls. Stakeholders clap. You ship.

Then real users show up with messy inputs, stale documents, ambiguous requests, and edge cases you never imagined. The feature still "works" in the sense that it returns text. It just stops doing the job reliably. Without evals, you find out through support tickets, not through a dashboard you already had.

I have seen teams treat a single impressive demo as proof of readiness. The pattern is predictable: launch, a quiet week, then a spike of "the bot made this up" reports. The model did not suddenly get worse. The gap between demo conditions and production conditions was always there. Evals are how you measure that gap before users do.

What you tested vs what users send Production traffic Demo set three clean cases Evals need coverage across the cloud, not only the small box
Demos live in a small corner of the input space. Evals exist to represent the rest.

Model benchmarks are not product evals

MMLU, HumanEval, SWE-bench: useful signals, wrong layer for most product decisions. They measure model capability under controlled conditions. Your feature is almost never "the model alone."

Take a RAG assistant. MMLU does not tell you whether retrieval finds the right policy paragraph. Take a fine-tuned classifier. Leaderboard accuracy on a public dataset does not tell you whether your label schema still matches this quarter's product categories. Take an agent. HumanEval does not tell you whether your harness picks the right tool on turn twelve after a failed API call.

Benchmarks are diagnostic, not sufficient. If you swap models and SWE-bench scores move but your internal task pass rate does not, the bottleneck is your harness, prompts, or tools, not raw model IQ. If both move together, you may have a real upgrade. The point is to run your eval alongside the public number so you know which story you are in.

SWE-bench is a good example of the confusion. It scores the full agent stack: model, scaffold, tools, environment. Two teams can cite the same benchmark name and mean different systems. A harness eval asks a narrower question: given our prompts, our tools, and our retry policy, do our tasks pass?

What you are actually shipping

An AI feature in production is a system, not a weight file. Typical ingredients:

  • Model choice and inference settings (temperature, max tokens).
  • Prompts and templates that frame the task.
  • Retrieval, tools, or APIs that supply facts or take actions.
  • Harness logic: iteration limits, retries, context compression, guardrails.
  • Post-processing: JSON parsing, citation formatting, moderation filters.

Change any layer and behavior changes, even with frozen model weights. I once watched a team drop pass rate by eight percent by rephrasing a single system-prompt sentence. No model update. No tool change. Just prose. That is a harness regression, and only a harness-level eval catches it.

Engineering reality

Production AI quality is multiplicative, not additive. Strong model + weak retrieval + sloppy prompt still fails. Evals should cover the layers that actually touch users: did we retrieve the right evidence, call the right tool, stay within budget, and produce an answer the user can trust? If you only score final text fluency, you will miss failures where the answer sounds fine but cites the wrong source or took forty tool calls to get there.

Evals across RAG, fine-tuning, and agents

This course sits in Track 2 because evaluation cuts across how you build with models. The metrics differ, but the habit is the same: define success on real tasks, measure repeatedly, and treat regressions as release blockers.

RAG. Split retrieval and generation. A fluent wrong answer often means retrieval missed the chunk, not that the model "hallucinated" randomly. Measure recall at the candidate stage, faithfulness to sources, citation correctness, and latency. The RAG course (L07) covers retrieval-specific metrics; this course gives you the shared eval discipline. Read both: RAG L07 for faithfulness and recall@k; this track for judges, CI, and production loops.

Fine-tuning. Training loss going down is not a product metric. You need held-out task evals on the behaviors you tuned for, plus regression checks on behaviors you did not intend to change. Catastrophic forgetting shows up as quiet pass-rate drops on old task types while the new slice looks great.

Agents and harnesses. Measure task completion, tool selection, argument validity, error recovery, multi-turn coherence, and cost per successful task. An agent that succeeds after thirty redundant searches is not the same as one that succeeds in five, even if the final answer matches.

Different product shapes, one eval discipline RAG retrieval + answer Fine-tuning held-out behavior Agents tools + trajectory Shared task evals success criteria traces release gates Decision ship or fix
Different architectures, same obligation: measure the user-visible task, not just the model abstractly.

Offline vs online evaluation

Teams mix these terms constantly. Keep them separate.

Offline evals run against a fixed, labeled set you control: golden tasks in CI, holdout suites before release, calibration sets for judges. Inputs are frozen or versioned. You can rerun the same benchmark after a prompt change and compare pass rate. Offline evals answer: "Did this change break known cases?"

Online evals measure live traffic: sampled LLM judges on production sessions, A/B tests between prompt variants, shadow traffic that runs a candidate path without showing users the output, thumbs-up/down rates, rephrase-and-retry frequency, downstream outcomes (ticket reopened, refund issued). Online evals answer: "What is happening with real users that our frozen set never imagined?"

Neither replaces the other. Offline without online freezes your failures at launch time. Online without offline gives you anecdotes without a reproducible gate. Healthy teams run offline suites on every risky change and feed online signals back into the offline set weekly (lesson 06).

Once traffic is large enough, you can route a small percentage to prompt or model variants and compare outcome metrics. Multi-armed bandits adapt traffic toward the winning variant. This is online evaluation at scale. It does not remove golden sets: you still need offline regression tests so a bandit does not optimize for a metric you forgot to measure (for example, fluency over faithfulness).

When evals become non-optional

Internal prototypes can survive on vibes for a while. The line crosses when:

  • Users depend on the output to make decisions (support, legal review, billing).
  • Failures have a price tag: wrong refunds, bad medical triage, leaked secrets.
  • More than one person can change prompts, tools, or models.
  • You ship more than once a month and need to know what broke.

Before that line, informal spot checks are fine. After it, flying blind is negligence. Not because AI is special, but because AI systems are brittle in boring ways: silent regressions, nondeterministic outputs, and failures that look like success until someone reads the details.

Without evals, every change is a bet. Prompt edits, model upgrades, chunk size tweaks, and tool description rewrites all look low-risk because nothing crashes. Quality drifts until a customer or executive finds a bad example. Then you are debugging under pressure without a baseline. The fix is usually rushed, unmeasured, and followed by the same drift three weeks later.

The minimum viable eval mindset

You do not need a perfect framework on day one. You need a repeatable question: for this input, what does success look like, and how do we check it?

Start with twenty to fifty real tasks drawn from requirements and failures. Make pass/fail explicit. Run them when you change anything that touches behavior. Add tracing so failures are debuggable. Promote production incidents into the set weekly. The rest of this course fills in golden sets, judges, traces, CI gates, and monitoring. The habit starts here: measurement beats intuition once the feature matters.

What to measure at each maturity stage

Eval maturity ramps in layers. Trying to jump to full CI on day one burns people out. Skipping layers leaves holes.

Stage 0: prototype. Ten to twenty manual spot checks when you change prompts. Acceptable only while the audience is you.

Stage 1: private beta. Fifty labeled tasks in a spreadsheet or YAML file. Run by hand before each deploy. Track pass rate in a single number.

Stage 2: shared ownership. Golden set in git, smoke suite in CI, traces on every session. Regressions get case ids in PR comments.

Stage 3: production scale. Sampled online judges, weekly harvest from logs, holdout set, incident playbooks tied to eval additions.

Most teams stall between stage 1 and 2 because eval tooling feels like a side project. Treat the golden file as part of the product repo from the first user who is not you.

Write down the top five ways the feature could embarrass you in public. Turn each into one pass/fail test. Run them once before sharing the demo link. Log session ids when any fail in manual testing. That log becomes the seed of your failure archive.

Evals are how you negotiate with stakeholders

Product wants "smarter." Legal wants "grounded." Finance wants "cheaper." Without shared metrics, every meeting is opinions. A pass rate on a named suite gives you a neutral referee: "We can ship the prompt change when these forty cases still pass, or we document which cases we are accepting regressions on."

That sounds bureaucratic. It is less bureaucratic than reversing a launch because nobody agreed what "good" meant. Evals also make model vendor conversations honest. If a vendor demo looks great but your suite flatlines, you have data for the renewal call, not just a gut feeling.

Landmark guides

The essay that convinced many production teams to treat evals as product infrastructure, not a research afterthought. Read it after this lesson for the business case and iteration loop.

Take from it
Error analysis from real failures, building eval sets from production pain, and why fast iteration beats model shopping.
It skips
OTel tracing, CI merge gates, Cohen's kappa for judges, and agent trajectory grading. Those are what lessons 02–06 cover.
Official reference

API patterns for running evals at scale: dataset format, grader types, and comparing runs. Useful when you wire automation, not as a substitute for product-specific golden sets.

Take from it
Structured eval datasets, run comparison, and how hosted eval APIs fit a CI pipeline.
It skips
Harness regression, tracing, incident loops, and domain rubric design. Pair with lesson 05 for CI gates.

Checkpoint

You are ready for the next lesson if you can answer these from memory:

  • What is the difference between offline and online evaluation?
  • What is the difference between a model benchmark and a product task eval?
  • Why can harness or prompt changes break a feature without any model update?
  • Name two layers besides the model that evals should cover for a RAG or agent system.
  • At what point do informal demos stop being enough evidence to ship?

Quick check

  • MMLU measures general knowledge, not your product task under your stack
  • The bottleneck is elsewhere in the system, or the new model behaves differently on your tasks
  • Users are wrong; MMLU is authoritative for products
  • You should stop measuring and trust the leaderboard
  • A model capability failure
  • A harness regression
  • Normal noise you should ignore
  • Proof that evals are unnecessary because demos looked fine
  • Whether the final answer sounds fluent
  • Whether retrieval surfaced the expected source chunks
  • Whether the embedding model ranks high on a public leaderboard
  • Only whether latency is under two seconds
  • Early internal experiments with no users depending on correctness
  • A feature that automates refunds
  • A shared prompt multiple engineers edit weekly
  • A assistant used for compliance review