Learn/Evaluation & Observability/Lesson 05

Lesson 05

Regression testing and CI for prompts and harnesses

Prompts are code. Tool descriptions are code. Iteration caps and retry policies are code. CI should treat changes to them like changes to business logic, because they change behavior just as surely.

The one idea

Regression testing compares a candidate harness against a frozen golden set and blocks release when pass rate drops beyond a threshold. Eval results become merge decisions, not slide decks you read after launch.

What counts as a harness change

Teams watch model version bumps and ignore prose edits. That is backwards. Common regression sources with frozen weights:

System prompt wording or ordering.
Tool names, descriptions, and JSON schemas exposed to the model.
MAX_ITERATIONS, timeouts, token ceilings.
Context compression thresholds and memory injection rules.
Retry logic after tool failures.
Retrieval top-k, filters, or reranker swaps in RAG stacks.

Research on harness ablations found system-prompt swaps among the highest-risk edits: small rephrasings move pass rate while tool swaps sometimes help. Your CI path should run evals when any of these files change, not only when requirements.txt moves.

The goal is actionable diffs: which case IDs broke, not just a red build.

Golden sets in version control

Store tasks alongside application code: inputs, expected outcomes or trace fixtures, grader config, tags. Pull requests that add or remove cases get reviewed like test changes. Commit messages should say why a case was added ("incident #441: wrong refund status").

Golden traces complement golden outputs. Record a known-good tool sequence for agent tasks. CI compares trajectory shape; mismatches fail even if final text looks plausible. Deterministic layers (code graders, trace diff) should run before expensive judges in the pipeline.

Smoke suite (10 to 20 cases) on every PR touching prompts, tools, or harness config. Full golden set on merge to main or nightly. Block merge if pass rate drops more than two percentage points vs main baseline, or if any P0 tagged case fails. Post a table of regressed case ids and links to traces. Require human override with ticket id to bypass.

Tiered suites and cost control

Agent evals burn real API dollars. Running five hundred multi-turn tasks per commit does not scale. Tier instead:

Smoke: fast, cheap, representative slice on every PR.
Regression: full golden set on main merges or nightly.
Holdout: weekly or pre-release, not on every PR.

Track dollars per CI run next to pass rate. If eval spend rivals engineering salary, shrink suites, use batch APIs where latency allows, cache deterministic retrieval fixtures, and prefer code graders over judges in the inner loop.

Engineering reality

Flaky evals destroy trust faster than no evals. Retry only transient infra failures, not model nondeterminism. For stochastic tasks, use majority vote across two or three runs or compare distributions against a window, not single-shot equality. Pin model version and temperature in CI. Document expected variance on the README so on-call knows a 1% wobble from sampling is normal but a 8% drop from a prompt edit is not.

Promptfoo in CI

Promptfoo is a practical exemplar for prompt and harness regression in CI. You define eval configs in YAML, point at your prompts or API endpoints, attach graders (code, model-graded, or custom), and run promptfoo eval locally or in GitHub Actions. It diffs pass rates across prompt versions and posts results to PRs.

Promptfoo is not the only option—Braintrust, LangSmith, and homegrown pytest wrappers work too—but its model is a good template: versioned config in git, deterministic smoke on PR, full suite on merge, explicit thresholds. Start with ten to twenty cases before importing five hundred.

Agent evals in CI

Agent pull requests should run trajectory graders (lesson 02) in the smoke suite: tool sequence, argument validity, step budget. Link failures to Agents L06 failure tags so regressions are classifiable. Output-only judges are too slow and too noisy for every commit; trajectory diffs are often deterministic and cheap.

What to diff in the PR report

A red build without context gets ignored. Useful PR comments include:

Aggregate pass rate delta vs baseline.
List of regressed and improved case ids with one-line descriptions.
Median and P95 latency and cost per task delta when agents are in scope.
Links to trace URLs for top three regressions.
Model and prompt version hashes so you know what was compared.

Tooling like Promptfoo, Braintrust, LangSmith, and DeepEval (pytest-style) integrate with GitHub Actions for this workflow. Pick one that fits your stack; the habit matters more than the logo.

Ablations and parameter tuning

When tuning harness parameters, change one knob at a time against the same golden set. System prompt first (high impact), then tool descriptions, then iteration caps, then compression, then retries, then memory strategy. Record pass rate, cost per success, and tool-call counts per variant. A harness that passes more while spending forty percent more tool calls per task is a cost regression even if pass rate looks flat.

When to update the golden set vs fix the product

Not every failure means revert. Sometimes the product intentionally changes behavior and tests must update. The rule: if behavior change was deliberate and documented, update golden expectations in the same PR with reviewer sign-off. If behavior change was accidental, fix the harness. Never delete failing cases to green the build without replacing them with a corrected expectation and reason.

Handling flaky and nondeterministic evals

Models sample. Retrieval can tie-break. CI that expects bit-identical prose will flake. Mitigations: pin temperature to 0 where appropriate, compare structured fields not free text, use majority vote across two runs for borderline cases, or widen tolerances with explicit rubrics.

Retry policy. Retry once on timeout or HTTP 5xx from the provider. Do not auto-retry on assertion failure—that hides real regressions. If two runs disagree on a stochastic case, mark it flaky and quarantine after three inconclusive weeks.

Thresholds. Document expected variance: ±1% pass rate on a five-hundred-case suite with temperature 0 is unusual; ±1% on fifty cases may be noise. Block merges on P0 tag failures always; block on aggregate delta only when it exceeds a pre-agreed threshold (for example 2 points on smoke, 1 point on full golden).

Separate infra flakes from behavior flakes. If the second run passes after timeout, file a ticket for provider reliability. If it fails twice, it is a real regression. Publish flake rate per case id monthly and quarantine cases that waste more than five engineer hours without catching a real bug.

Release trains and eval gates

Map eval gates to your release cadence. Continuous deploy teams need fast smoke on every PR and full regression before prod promotion. Weekly release trains can run full suites nightly and gate only the promotion branch. Document which gate blocked which release so product learns the cost of skipping tests.

Keep a "break glass" override with ticket id and postmortem requirement. Overrides should be rare enough that managers notice when the counter increments.

Path filters: what triggers CI

Configure CI path filters so harness-adjacent files always run evals: prompts/, tools/, agents/, retrieval config, guardrail rules, eval datasets themselves. Include shared libraries that assemble context. Missing a path filter teaches teams to rename files to skip tests, which is worse than slow CI.

When monorepos mix AI and non-AI services, scope eval workflows to the AI package so unrelated frontend changes do not burn API budget, but any change to shared prompts still triggers the smoke suite.

Comparing runs over time

Store eval run summaries: git sha, model id, prompt hash, pass rate, cost, latency percentiles, timestamp. Plot trends. A slow two-point drift over six releases is invisible in any single PR but obvious on a chart. This is how you catch death by a thousand prompt tweaks.

Trigger on paths. Checkout. Load API keys from secrets. Run smoke eval CLI against pinned baseline branch. Upload trace artifacts. Comment pass rate delta and regressed ids on PR. Fail job if delta exceeds threshold or any P0 tag fails. Nightly cron runs full suite on main.

Set a per-run dollar cap in the workflow. If smoke exceeds it, fail loudly so someone shrinks the suite or switches fixtures to cached retrieval. Track monthly eval spend in the same dashboard as production LLM spend.

Model upgrades in CI

Model swaps are harness changes. Run the full smoke suite when the provider string changes, not only when prompts change. Models with higher benchmarks can break JSON adherence, tool-call format, or refusal policies your harness assumes. Keep a pinned fallback model id in config for one-click rollback while you rebaseline evals.

Document baseline pass rate per model id in the repo README. Future you should not guess whether ninety-one percent is normal or a disaster for that model family.

Require eval results as an artifact attachment on release tickets the same way you attach migration scripts. Releases without a linked run id should not ship to high-risk cohorts.

When multiple teams share one golden set, namespace case ids (billing-012, search-012) so PR comments stay unambiguous and ownership stays obvious.

Parallelize eval runs carefully. Throwing five hundred agent tasks at an API without rate-limit awareness can throttle production traffic on shared org limits. Stagger CI workers or use a dedicated eval API key with its own quota.

Archive failing traces as CI artifacts with thirty-day retention so reviewers can open the exact run that caused a red build without re-running the whole suite.

Treat red CI from eval regression as a normal part of shipping, not an annoyance to bypass.

Checkpoint

You are ready for the next lesson if you can answer these from memory:

Which non-model changes should trigger eval runs in CI?
What are smoke, regression, and holdout suites for?
What should a useful eval PR comment include?
When is it legitimate to update a golden case instead of reverting code?

Quick check

Merge; descriptions are harmless
Block merge, inspect regressed cases, fix wording or update tests with intent
Delete the failing cases
Upgrade the model to compensate

Cost and latency: full agent evals are too expensive for every commit
Smoke tests have no predictive value
Because LLM judges are illegal in CI
Because golden sets should not exist

Outputs are identical JSON schemas every time
Final wording varies but tool sequence should remain stable
The system is strictly single-turn
You never want human review

Remove all failing tests without documentation
Update golden expectations in the same PR with review and rationale
Bypass CI permanently
Skip holdout runs forever