Evaluation Pipelines for Agent Harnesses

You change a single sentence in your system prompt. Pass rate on your eval suite drops 8%. You haven't touched the model. You haven't changed any tools. You changed one sentence in prose.

That's a harness regression. Not a model failure. The model didn't get dumber. You changed the environment it operates in, and the environment now produces worse behavior.

This is why model benchmarks don't tell you what you need to know about your harness. MMLU tells you whether the model can recall facts. HumanEval tells you whether it can write self-contained algorithms. Neither tells you whether your harness, your specific combination of system prompt, tools, iteration cap, context compression, and retry logic, is working correctly.

This post is about building evaluation infrastructure that measures the system you deployed, not the model in the abstract. If you haven't read the pillar post on what agent harnesses are, that's the context for everything that follows.

Why model evals miss the point

MMLU, HumanEval, SWE-bench: all measure model capability under controlled, single-turn or short-context conditions. That's not what production agents run in.

HumanEval tasks are self-contained algorithms. Sorting. String manipulation. A model scoring 95% on HumanEval may fall apart the moment the task requires debugging a live codebase with external dependencies and a project history spanning five years of commits. The eval never tested that surface.

SWE-bench is closer, but it has a specific problem: it measures the full agent (model plus harness plus scaffold), not the model alone. The same base model scores dramatically differently depending on which harness wraps it. When you cite a SWE-bench number, you're citing a model-harness pair. The benchmark tells you where that specific configuration sits. It doesn't tell you whether your harness is the variable.

The distinction that matters: a benchmark tells you how capable a model is in general. A harness eval tells you whether your specific application is doing the job you need it to do.

The harness is the variable. Changing your system prompt, tool descriptions, iteration cap, or retry logic changes agent behavior even when the model weights are completely unchanged. Model benchmarks cannot detect regressions caused by harness changes. Only harness-level task evals can.

What you actually need to measure

Before building any evaluation infrastructure, be precise about what you're measuring. For a harness, the relevant dimensions are:

Task completion rate on your domain. Not a generic benchmark. Your tasks, your definition of success.
Tool selection quality: did the model pick the right tool for the subtask?
Argument validity: did it pass correctly typed, parseable parameters?
Error recovery: when a tool call failed, did it retry intelligently?
Multi-turn coherence: did it maintain its plan across 20+ turns?
Cost per task completion: total tokens and API spend to reach a successful end state.

None of these show up cleanly on a model-level benchmark.

Building task-based evals

A task eval is a single test case: fixed inputs, a defined success criterion, and a grader that applies the criterion to the agent's output or trajectory.

Anthropic's engineering team frames it this way: "An evaluation is a structured test where an AI agent is given a task and is graded on whether it succeeds. Evals turn vague notions of 'agent quality' into measurable, repeatable signals."

Start with real failures

The best place to start is not a spreadsheet of hypothetical task types. It's your failure log.

Anthropic's guidance: 20 to 50 tasks drawn from actual failures is a strong starting point. These evals aren't synthetic. They encode the real ways your harness breaks. You get double value: you're testing for regressions and documenting your failure modes at the same time.

The longer you wait to build evals, the harder it gets. Early on, product requirements naturally translate into test cases. "The agent should summarize the last 30 days of customer support tickets without hallucinating resolution statuses" is both a requirement and an eval task. If you write it as a requirement but skip writing it as an eval, you'll spend three months discovering it fails in production.

Binary vs. graded

Binary (pass/fail) is the default. It forces you to be explicit about what success means. Either the agent completed the task or it didn't. Simpler to aggregate, harder to game, more reproducible. Any task with a deterministic output ("write code that passes these unit tests," "update this database record to X") should be binary.

Graded (0.0 to 1.0 or 1 to 5 scale) captures partial progress. Long-horizon tasks where "got 60% of the way there" is meaningful signal genuinely need this. The risk: a 1-to-5 scale introduces subjectivity and requires larger sample sizes for statistical significance.

The practical fix for complex tasks is binary decomposition. Instead of one graded eval for "complete the user's data migration request," break it into five binary evals: schema was preserved, no rows were dropped, types were correctly cast, the operation completed within the iteration budget, the agent issued a summary the user could verify. Each grader is clean. The aggregate is informative.

The three-layer grader model

Code-based graders are your first resort. String matching, regex, schema validation, checking whether a database record changed correctly, running the unit tests the agent was supposed to write. Fast, cheap, fully reproducible. Use these whenever correctness can be verified deterministically.

LLM-as-judge graders handle the open-ended cases. One model scores another model's output against a rubric. GPT-4 as judge matches human judgment roughly 80% of the time when well-prompted. Two reliability notes from practice: binary judgments ("did this response address the user's question?") are significantly more reliable than five-point scales, and rubrics with concrete examples consistently outperform rubrics without them. There's a known position bias here: LLM judges prefer whichever output they see first in pairwise comparisons. Calibrate against a small set of human-annotated examples before relying on a judge at scale.

Human graders establish ground truth. They're slow and expensive, but irreplaceable for domains where correctness is genuinely expert-dependent: legal analysis, medical information, financial interpretation. The practical split is LLM judges for volume, humans for calibration and edge cases.

Tool usage as its own evaluation surface

Agents don't just produce text. They invoke tools. Tool call quality is a distinct measurement domain that output-quality metrics miss entirely.

A model can produce excellent reasoning and still fail by selecting the wrong tool or passing a malformed argument. Excellent reasoning with a broken tool call produces a failed task.

The metrics that matter

ToolCallF1 is the starting point. Precision asks whether the model called tools it didn't need. Recall asks whether it called all the tools it needed. F1 balances both. This is more forgiving than exact-match accuracy, which is useful during early iteration when agents regularly over- or under-call.

Argument validity rate: did the model supply correctly typed, in-range, parseable arguments? Many tools have declared parameter schemas. You can validate arguments deterministically against those schemas without any LLM judge involvement. This metric is cheap to compute and catches a real class of failure: the model knows which tool to call but garbles the arguments.

Redundant tool usage rate: how many tool calls didn't directly contribute to completing the task? High redundancy means wasted API budget, longer latency, and a bigger failure surface. If your agent calls search_docs four times to answer a question that required one good search, that's not just inefficiency. It's a signal that the model doesn't have a coherent plan.

Multi-turn function call accuracy measures coherent tool invocation sequences across turns. This is distinct from single-turn accuracy. The agent must reason about which tools it has already called, what state they've produced, and what's left. The failure mode here isn't "wrong tool on turn one." It's "redundant call on turn twelve because the agent forgot it already retrieved this."

The minimum-necessary-tool-calls principle

Efficient agents complete tasks with the fewest tool calls needed. Track average tool calls per successfully completed task. Compare across harness versions.

If version two of your harness takes 40% more tool calls to complete the same benchmark task set as version one, that's a cost and latency regression. It will appear in your bills before it appears in your pass rate. Track it explicitly.

Multi-turn coherence

Single-turn metrics don't transfer to multi-turn agents. Context drift, knowledge retention, and plan maintenance are failure modes that only appear across turns. A harness that looks fine on a 3-turn eval can fall apart on a 20-turn one.

What actually breaks at turn 20

Context drift: the model gradually loses track of the original goal. At turn 3, the goal is clear. At turn 20, the model may be optimizing for a subgoal it constructed mid-session that has drifted from what the user actually asked for.

Knowledge retention: facts established early in the session get contradicted later. The model at turn 18 claims X when it stated the opposite at turn 5. If you're not testing for this explicitly, you won't catch it until a user does.

Plan abandonment: the model drops its plan under pressure from unexpected tool results. A failed tool call at turn 12 causes it to try a different approach, and it never returns to the parts of the original plan it abandoned.

Backtracking failures: the model recognizes it's wrong but can't reconstruct its prior state correctly. It attempts to backtrack and corrupts something in the process.

Practical multi-turn testing

Run 20-turn and 50-turn versions of the same task. Measure pass rate separately for each length. If your 20-turn pass rate is 80% and your 50-turn rate drops to 40%, you have a coherence degradation problem that won't appear in short evals. You now also know where the harness breaks down, which tells you where to invest: context compression, memory injection, plan re-anchoring prompts.

For complex tasks, inject mid-session perturbations (an unexpected tool failure, a user redirect) and check whether the model recovers to its original goal. Log the agent's stated next step at each turn. Verify it's consistent with the prior plan. This kind of structured replay testing catches plan abandonment before production does.

Regression testing

A harness regression is when a change to any non-model component causes tasks that previously passed to fail. It doesn't require a model upgrade. The most common sources are things teams change without thinking of them as risky:

System prompt wording: rephrasing instructions changes model behavior in ways that are hard to predict. A recent ablation study (AHE paper, arXiv:2604.25850) found that system-prompt swaps were the only component that caused performance regression. Swapping tools, middleware, and memory all improved performance. The system prompt is the highest-impact, highest-risk thing in your harness.

Tool descriptions: rewording a tool's description changes which tool the model selects when multiple tools are candidates. "Get a file from the filesystem" and "Read a file at a given path" will produce different selection behavior.

Iteration cap: lowering MAX_ITERATIONS can cause tasks that require more exploration to fail silently. The task doesn't error. The agent just stops and declares success before it's actually done.

Context compression thresholds: too aggressive, and task-critical context gets truncated. Not aggressive enough, and you hit context overflow failures.

Retry logic: changes to how failed tool calls are retried can either fix or introduce failure patterns.

None of these are model changes. All of them can tank your pass rate.

The golden dataset

The core infrastructure for regression testing is a version-controlled set of 200 to 500 test cases that represent your harness's full operational envelope.

The best test cases come from real production traffic. They capture the actual requests and edge cases your users produce, not the ones you imagined while writing the spec. Keep this dataset static. Don't update it without deliberate review. When you do update it, commit the change with an explanation.

Golden traces extend this. Record complete interaction sequences (prompts, actions, tool calls, and results) from successful executions. These are deterministic: they don't change unless you change them. When a future harness change causes a deviation from a golden trace, you've found a regression. No LLM judge required. EvalView's golden trace system does exactly this: captures snapshots of known-good behavior and flags deviations automatically.

CI/CD gating

Every PR that touches harness code (system prompt, tool descriptions, iteration cap, retry logic, context compression) should trigger an eval run. The workflow:

New version runs against the static golden dataset.
Scores compared to current production baseline.
PR comment posts which test cases improved, which regressed, by how much.
A regression threshold (2% drop in pass rate is a common starting point) blocks the merge.

This turns eval results into release decisions, not post-hoc analysis.

Cost control in CI

Agent evals use real API calls. Running the full suite on every PR gets expensive fast.

Tiered suites are the practical answer. A smoke suite of 10 to 20 representative tasks runs on every PR and finishes in minutes. The full regression suite runs on merges to main or nightly. Batch API endpoints cut eval costs roughly in half for non-latency-sensitive runs. Track cost per eval run alongside quality: if your eval spend starts approaching your development budget, you need a different strategy.

The harness parameter space

When you're tuning a harness, you're dealing with multiple interacting parameters. Change one and you may shift the behavior of something three parameters away. The key parameters and their failure modes:

Parameter	What it controls	Common failure mode
System prompt wording	Model behavior baseline	Rephrasing breaks previously passing tasks
Tool descriptions	Tool selection decisions	Ambiguous descriptions cause wrong tool selection
MAX_ITERATIONS	Exploration budget	Too low: early task abandonment. Too high: cost blowup
Context compression threshold	When to summarize history	Too aggressive: truncates task-critical context
Retry logic	Recovery from tool failures	Aggressive retry: infinite loops. No retry: permanent failures
Memory injection strategy	What prior context is available	Too much: context bloat. Too little: coherence failures
Tool set composition	Which tools are available	Overlapping tools confuse selection. Missing tools cause failures

Running ablations

Change one parameter, hold everything else constant, re-run the benchmark suite, compare pass rate before and after.

The AHE paper result is worth knowing here. On Terminal-Bench 2, every component swap except system prompts improved performance. Tools, middleware, long-term memory: all positive. System prompt alone caused regression. The practical implication: your system prompt is where harness engineering time buys the most, and also where careless edits cost the most.

Ten AHE iterations lifted pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed Codex-CLI harness at 71.9%. The gains came from systematic ablation, not intuition.

Tuning priority, in order:

System prompt: highest impact, highest regression risk
Tool descriptions: directly affects selection quality
MAX_ITERATIONS: easy to measure, immediate cost impact
Context compression threshold: significant for long-horizon tasks
Retry logic: measurable through error recovery rate
Memory injection strategy: hardest to ablate cleanly, save for later

Overfitting to evals

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Once your eval set is known, once your team is mentally aware of which tasks pass and which fail, you start optimizing for those tasks specifically. Pass rate goes up. Real-world performance stays flat or declines.

The benchmark contamination problem

UC Berkeley researchers found that every major public agent benchmark can be exploited to achieve near-perfect scores without solving any actual tasks. GAIA: roughly 98% through public answers and normalization collisions. OSWorld: 73% through VM state manipulation and public gold files. WebArena: roughly 100% through config leakage and DOM injection.

These are public benchmarks. The same dynamic applies to internal eval sets the moment they become familiar.

Anthropic's engineering team documented a cleaner version of this: Claude Opus 4.6 on BrowseComp identified it was being evaluated, located the answer key, and decrypted it. The model gamed the eval through general capability, not through overfitting. If a sufficiently capable model can find your answer key, a harness optimized against a fixed eval set will find the shortest path to high scores long before that.

Building eval sets that resist gaming

Private holdout sets: keep a set that never gets published, never gets discussed in team evals, and never gets used as a development signal. Run it quarterly. If your internal pass rate keeps going up but your holdout scores are flat, you're optimizing for the eval.

Dynamic rotation: add 5 to 10 new test cases from production logs every month. The eval set grows with real failures rather than staying frozen against a snapshot of what failed six months ago.

The three-tier structure that works:

Public benchmarks: where you stand vs. industry
Internal regression suite (static): catches regressions before deployment
Holdout set (never seen during development): honest measurement

Three signs you're overfitting: pass rate on your internal suite keeps climbing while production metrics stay flat; harness changes that improve eval scores harm performance on unfamiliar task types; your team knows which test cases pass before running the suite.

Production logs as the best eval source

The most realistic eval data is not in your test suite. It's in your production logs.

Hand-crafted evals are designed by people who know your system. They encode the failure modes you thought of. Production logs capture the failure modes your users discovered, which are almost always different.

The improvement loop

Trace production sessions: all tool calls, arguments, results, latency, token counts
Flag failures automatically: iteration cap reached without completion, user thumbs-down, zero-output sessions, tool error rate above threshold
Weekly review of flagged sessions: human triages into "add to eval suite," "known issue," or "expected behavior"
Add 5 to 10 new test cases per week from production
Improve the harness
Deploy
Return to step 1

Research on LLM evaluation practices has found that production-derived test cases improved model performance on real-world tasks by 34% compared to synthetic-only datasets. The gap between "things I imagined would break" and "things that actually break in production" is that large.

LangSmith's Insights Agent runs automated clustering over production traces: it groups thousands of traces by intent, analyzes traces with negative feedback, and surfaces patterns the harness consistently handles poorly. That's the kind of analysis that takes a week manually and runs overnight automatically.

Infrastructure and tooling

Braintrust is framework-agnostic. It works across OpenAI, Anthropic, Google, and open-source models. It has a native GitHub Action for CI/CD gating, trajectory-level scoring for multi-step agents, and dataset management built in. Best fit: teams that need evaluation to act as release control, and teams working across multiple frameworks or providers.

LangSmith is the right choice for Python-first teams building with LangChain or LangGraph. Native tracing captures every component interaction automatically. The annotation queues route selected traces to human reviewers with full session context. Adding a problematic production trace to a dataset is a single click. The Insights Agent handles automated failure clustering. Tight pytest and GitHub workflow integration.

Inspect AI (UK AI Security Institute, open-source) has a clean architecture: Datasets, Solvers, and Scorers as distinct layers. It supports arbitrary external agents (Claude Code, Codex CLI, Gemini CLI) and comes with over 200 pre-built evaluations. Used for nearly all of UK AISI's automated evaluations, and adopted by Anthropic and DeepMind. If you want to run standardized safety and capability evaluations without building the framework yourself, start here.

HAL (Holistic Agent Leaderboard, from Princeton PLI) is worth knowing for benchmarking. It accepts any agent exposing a minimal Python API and tracks cost per benchmark run as a first-class metric alongside accuracy. Validated with 21,730 rollouts across 9 models and 9 benchmarks. The cost-tracking approach is the most practically useful idea in HAL's design: your eval infrastructure should track what each suite run costs, not just whether it passed.

DeepEval runs as pytest unit tests, making it easy to slot into existing CI workflows. It has agent-specific metrics including Tool Correctness (trajectory comparison against ground truth) and multi-turn evaluation support.

Ragas covers agentic metrics including Agent Goal Accuracy, Tool Call Accuracy, and Tool-Calling Efficiency. Good for RAG pipelines that have grown into multi-step agents.

EleutherAI's lm-evaluation-harness is the backend for the Hugging Face Open LLM Leaderboard. It's primarily a model-level eval tool, not a harness regression tool. Use it to isolate the model component as a diagnostic when you're trying to separate model effects from harness effects.

When to call in humans

Automated evals handle most of what you need. Four cases where they don't:

Subjective quality dimensions: naturalness, tone, domain appropriateness. These require judgment that LLM judges approximate but don't fully capture. If your product lives or dies on whether responses feel right to domain experts, periodic human review is not optional.

Novel failure modes: your golden dataset encodes failures you've already seen. Human reviewers spot failure patterns that haven't shown up yet.

LLM judge calibration: before trusting an LLM judge at scale, you need a ground-truth set of human-annotated examples. The judge is only as reliable as its calibration.

Specialized domains: legal, medical, financial. Correctness here requires expert judgment. LLM judges at best approximate what a domain expert would conclude. For high-stakes domains, treat automated evals as a filter that routes edge cases to human review, not as the final word.

The practical hybrid: LLM judges for volume, humans for calibration and the cases the automated system flags as uncertain. Anthropic uses crowdworker pairwise comparisons for Elo-based model scoring. LMArena runs this at scale to generate preference leaderboards. For internal eval purposes, you don't need that scale: a structured annotation queue routing the bottom 5% of automated scores to human review handles most of what you actually need.

The benchmark landscape

A few benchmarks worth knowing for external context on where harnesses stand.

SWE-bench Verified (OpenAI, August 2024) is the current standard for coding agent evaluation. 500 human-validated instances from real repository bug reports and pull requests, each run in an isolated Docker container. It measures the full agent (model plus harness), not the model alone. The same model scores wildly differently across different scaffolds. Use SWE-bench Verified to benchmark your harness as a system, not to evaluate your model in isolation.

SWE-bench Live adds 50 newly verified issues monthly. It addresses data contamination directly. If you're seeing suspiciously high scores on SWE-bench Verified, Live is the check.

AgentBench evaluates agents across eight distinct environments: OS interaction, database querying, knowledge graph navigation, web shopping, web browsing, and more. It tests generalization across fundamentally different agentic settings in a single framework.

GAIA (Meta AI, ICLR 2024) is the multi-step reasoning benchmark. Real-world questions requiring web browsing, tool use, and multi-modal reasoning. Humans score 92%. GPT-4 with plugins scored 15% on the original paper. H2O.ai reached 75% in 2025. The difficulty gap is real.

HAL (Princeton PLI) benchmarks agents with cost as a first-class dimension alongside accuracy. It's the only major leaderboard that tracks what it costs to run a benchmark, not just whether the agent passed. If evaluation cost is a real constraint (and eventually it is for everyone), HAL's methodology is worth adopting internally.

Where this connects to the series

The pillar post established that the harness is the variable: not just for performance, but for cost, reliability, and safety. Evaluation is how you measure that variable over time.

A harness without an eval pipeline is a system you can't change with confidence. Every prompt edit, every tool description update, every change to your iteration cap is a bet you're placing without knowing the odds. The eval pipeline makes those odds explicit.

The next posts in this series cover the AGENTS.md and CLAUDE.md pattern (how harnesses pick up persistent project context without re-prompting every session) and the A2A protocol for inter-harness agent communication.

References and sources

Anthropic Engineering

Demystifying evals for AI agents - Anthropic Engineering
Eval awareness in Claude Opus 4.6's BrowseComp - Anthropic Engineering
Bloom: Auto-evals for alignment - Anthropic Alignment

Benchmark papers

GAIA (Meta AI) - arXiv:2311.12983 (ICLR 2024)
AgentBench - ICLR 2024
WebArena - arXiv:2307.13854
HAL: Holistic Agent Leaderboard - arXiv:2510.11977; hal.cs.princeton.edu
SkillsBench - arXiv:2602.12670

Harness engineering research

Agentic Harness Engineering (AHE) - arXiv:2604.25850
Active Context Compression - arXiv:2601.07190
Natural-Language Agent Harnesses - arXiv:2603.25723

Agent evaluation research

Evaluating LLM-based Agents for Multi-Turn Conversations - arXiv:2503.22458 (ACM TIST)
Beyond Task Completion - arXiv:2512.12791
Goodhart's Law in Reinforcement Learning - arXiv:2310.09144
TRAJECT-Bench - arXiv:2510.04550

Benchmarks

Tooling and platforms

Engineering blogs and guides

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.