You change a single sentence in your system prompt. Pass rate on your eval suite drops 8%. You haven't touched the model. You haven't changed any tools. You changed one sentence in prose.
That's a harness regression. Not a model failure. The model didn't get dumber. You changed the environment it operates in, and the environment now produces worse behavior.
This is why model benchmarks don't tell you what you need to know about your harness. MMLU tells you whether the model can recall facts. HumanEval tells you whether it can write self-contained algorithms. Neither tells you whether your harness, your specific combination of system prompt, tools, iteration cap, context compression, and retry logic, is working correctly.
This post is about building evaluation infrastructure that measures the system you deployed, not the model in the abstract. If you haven't read the pillar post on what agent harnesses are, that's the context for everything that follows.
Why model evals miss the point
MMLU, HumanEval, SWE-bench: all measure model capability under controlled, single-turn or short-context conditions. That's not what production agents run in.
HumanEval tasks are self-contained algorithms. Sorting. String manipulation. A model scoring 95% on HumanEval may fall apart the moment the task requires debugging a live codebase with external dependencies and a project history spanning five years of commits. The eval never tested that surface.
SWE-bench is closer, but it has a specific problem: it measures the full agent (model plus harness plus scaffold), not the model alone. The same base model scores dramatically differently depending on which harness wraps it. When you cite a SWE-bench number, you're citing a model-harness pair. The benchmark tells you where that specific configuration sits. It doesn't tell you whether your harness is the variable.
The distinction that matters: a benchmark tells you how capable a model is in general. A harness eval tells you whether your specific application is doing the job you need it to do.
The harness is the variable. Changing your system prompt, tool descriptions, iteration cap, or retry logic changes agent behavior even when the model weights are completely unchanged. Model benchmarks cannot detect regressions caused by harness changes. Only harness-level task evals can.
What you actually need to measure
Before building any evaluation infrastructure, be precise about what you're measuring. For a harness, the relevant dimensions are:
- Task completion rate on your domain. Not a generic benchmark. Your tasks, your definition of success.
- Tool selection quality: did the model pick the right tool for the subtask?
- Argument validity: did it pass correctly typed, parseable parameters?
- Error recovery: when a tool call failed, did it retry intelligently?
- Multi-turn coherence: did it maintain its plan across 20+ turns?
- Cost per task completion: total tokens and API spend to reach a successful end state.
None of these show up cleanly on a model-level benchmark.
Building task-based evals
A task eval is a single test case: fixed inputs, a defined success criterion, and a grader that applies the criterion to the agent's output or trajectory.
Anthropic's engineering team frames it this way: "An evaluation is a structured test where an AI agent is given a task and is graded on whether it succeeds. Evals turn vague notions of 'agent quality' into measurable, repeatable signals."
Start with real failures
The best place to start is not a spreadsheet of hypothetical task types. It's your failure log.
Anthropic's guidance: 20 to 50 tasks drawn from actual failures is a strong starting point. These evals aren't synthetic. They encode the real ways your harness breaks. You get double value: you're testing for regressions and documenting your failure modes at the same time.
The longer you wait to build evals, the harder it gets. Early on, product requirements naturally translate into test cases. "The agent should summarize the last 30 days of customer support tickets without hallucinating resolution statuses" is both a requirement and an eval task. If you write it as a requirement but skip writing it as an eval, you'll spend three months discovering it fails in production.
Binary vs. graded
Binary (pass/fail) is the default. It forces you to be explicit about what success means. Either the agent completed the task or it didn't. Simpler to aggregate, harder to game, more reproducible. Any task with a deterministic output ("write code that passes these unit tests," "update this database record to X") should be binary.
Graded (0.0 to 1.0 or 1 to 5 scale) captures partial progress. Long-horizon tasks where "got 60% of the way there" is meaningful signal genuinely need this. The risk: a 1-to-5 scale introduces subjectivity and requires larger sample sizes for statistical significance.
The practical fix for complex tasks is binary decomposition. Instead of one graded eval for "complete the user's data migration request," break it into five binary evals: schema was preserved, no rows were dropped, types were correctly cast, the operation completed within the iteration budget, the agent issued a summary the user could verify. Each grader is clean. The aggregate is informative.
The three-layer grader model
Code-based graders are your first resort. String matching, regex, schema validation, checking whether a database record changed correctly, running the unit tests the agent was supposed to write. Fast, cheap, fully reproducible. Use these whenever correctness can be verified deterministically.
LLM-as-judge graders handle the open-ended cases. One model scores another model's output against a rubric. GPT-4 as judge matches human judgment roughly 80% of the time when well-prompted. Two reliability notes from practice: binary judgments ("did this response address the user's question?") are significantly more reliable than five-point scales, and rubrics with concrete examples consistently outperform rubrics without them. There's a known position bias here: LLM judges prefer whichever output they see first in pairwise comparisons. Calibrate against a small set of human-annotated examples before relying on a judge at scale.
Human graders establish ground truth. They're slow and expensive, but irreplaceable for domains where correctness is genuinely expert-dependent: legal analysis, medical information, financial interpretation. The practical split is LLM judges for volume, humans for calibration and edge cases.
Tool usage as its own evaluation surface
Agents don't just produce text. They invoke tools. Tool call quality is a distinct measurement domain that output-quality metrics miss entirely.
A model can produce excellent reasoning and still fail by selecting the wrong tool or passing a malformed argument. Excellent reasoning with a broken tool call produces a failed task.
The metrics that matter
ToolCallF1 is the starting point. Precision asks whether the model called tools it didn't need. Recall asks whether it called all the tools it needed. F1 balances both. This is more forgiving than exact-match accuracy, which is useful during early iteration when agents regularly over- or under-call.
Argument validity rate: did the model supply correctly typed, in-range, parseable arguments? Many tools have declared parameter schemas. You can validate arguments deterministically against those schemas without any LLM judge involvement. This metric is cheap to compute and catches a real class of failure: the model knows which tool to call but garbles the arguments.
Redundant tool usage rate: how many tool calls didn't directly contribute to completing the task? High redundancy means wasted API budget, longer latency, and a bigger failure surface. If your agent calls search_docs four times to answer a question that required one good search, that's not just inefficiency. It's a signal that the model doesn't have a coherent plan.
Multi-turn function call accuracy measures coherent tool invocation sequences across turns. This is distinct from single-turn accuracy. The agent must reason about which tools it has already called, what state they've produced, and what's left. The failure mode here isn't "wrong tool on turn one." It's "redundant call on turn twelve because the agent forgot it already retrieved this."
The minimum-necessary-tool-calls principle
Efficient agents complete tasks with the fewest tool calls needed. Track average tool calls per successfully completed task. Compare across harness versions.
If version two of your harness takes 40% more tool calls to complete the same benchmark task set as version one, that's a cost and latency regression. It will appear in your bills before it appears in your pass rate. Track it explicitly.
Multi-turn coherence
Single-turn metrics don't transfer to multi-turn agents. Context drift, knowledge retention, and plan maintenance are failure modes that only appear across turns. A harness that looks fine on a 3-turn eval can fall apart on a 20-turn one.
What actually breaks at turn 20
Context drift: the model gradually loses track of the original goal. At turn 3, the goal is clear. At turn 20, the model may be optimizing for a subgoal it constructed mid-session that has drifted from what the user actually asked for.
Knowledge retention: facts established early in the session get contradicted later. The model at turn 18 claims X when it stated the opposite at turn 5. If you're not testing for this explicitly, you won't catch it until a user does.
Plan abandonment: the model drops its plan under pressure from unexpected tool results. A failed tool call at turn 12 causes it to try a different approach, and it never returns to the parts of the original plan it abandoned.
Backtracking failures: the model recognizes it's wrong but can't reconstruct its prior state correctly. It attempts to backtrack and corrupts something in the process.
Practical multi-turn testing
Run 20-turn and 50-turn versions of the same task. Measure pass rate separately for each length. If your 20-turn pass rate is 80% and your 50-turn rate drops to 40%, you have a coherence degradation problem that won't appear in short evals. You now also know where the harness breaks down, which tells you where to invest: context compression, memory injection, plan re-anchoring prompts.
For complex tasks, inject mid-session perturbations (an unexpected tool failure, a user redirect) and check whether the model recovers to its original goal. Log the agent's stated next step at each turn. Verify it's consistent with the prior plan. This kind of structured replay testing catches plan abandonment before production does.
Regression testing
A harness regression is when a change to any non-model component causes tasks that previously passed to fail. It doesn't require a model upgrade. The most common sources are things teams change without thinking of them as risky:
System prompt wording: rephrasing instructions changes model behavior in ways that are hard to predict. A recent ablation study (AHE paper, arXiv:2604.25850) found that system-prompt swaps were the only component that caused performance regression. Swapping tools, middleware, and memory all improved performance. The system prompt is the highest-impact, highest-risk thing in your harness.
Tool descriptions: rewording a tool's description changes which tool the model selects when multiple tools are candidates. "Get a file from the filesystem" and "Read a file at a given path" will produce different selection behavior.
Iteration cap: lowering MAX_ITERATIONS can cause tasks that require more exploration to fail silently. The task doesn't error. The agent just stops and declares success before it's actually done.
Context compression thresholds: too aggressive, and task-critical context gets truncated. Not aggressive enough, and you hit context overflow failures.
Retry logic: changes to how failed tool calls are retried can either fix or introduce failure patterns.
None of these are model changes. All of them can tank your pass rate.
The golden dataset
The core infrastructure for regression testing is a version-controlled set of 200 to 500 test cases that represent your harness's full operational envelope.
The best test cases come from real production traffic. They capture the actual requests and edge cases your users produce, not the ones you imagined while writing the spec. Keep this dataset static. Don't update it without deliberate review. When you do update it, commit the change with an explanation.
Golden traces extend this. Record complete interaction sequences (prompts, actions, tool calls, and results) from successful executions. These are deterministic: they don't change unless you change them. When a future harness change causes a deviation from a golden trace, you've found a regression. No LLM judge required. EvalView's golden trace system does exactly this: captures snapshots of known-good behavior and flags deviations automatically.
CI/CD gating
Every PR that touches harness code (system prompt, tool descriptions, iteration cap, retry logic, context compression) should trigger an eval run. The workflow:
- New version runs against the static golden dataset.
- Scores compared to current production baseline.
- PR comment posts which test cases improved, which regressed, by how much.
- A regression threshold (2% drop in pass rate is a common starting point) blocks the merge.
This turns eval results into release decisions, not post-hoc analysis.
Cost control in CI
Agent evals use real API calls. Running the full suite on every PR gets expensive fast.
Tiered suites are the practical answer. A smoke suite of 10 to 20 representative tasks runs on every PR and finishes in minutes. The full regression suite runs on merges to main or nightly. Batch API endpoints cut eval costs roughly in half for non-latency-sensitive runs. Track cost per eval run alongside quality: if your eval spend starts approaching your development budget, you need a different strategy.
The harness parameter space
When you're tuning a harness, you're dealing with multiple interacting parameters. Change one and you may shift the behavior of something three parameters away. The key parameters and their failure modes:
| Parameter | What it controls | Common failure mode |
|---|---|---|
| System prompt wording | Model behavior baseline | Rephrasing breaks previously passing tasks |
| Tool descriptions | Tool selection decisions | Ambiguous descriptions cause wrong tool selection |
| MAX_ITERATIONS | Exploration budget | Too low: early task abandonment. Too high: cost blowup |
| Context compression threshold | When to summarize history | Too aggressive: truncates task-critical context |
| Retry logic | Recovery from tool failures | Aggressive retry: infinite loops. No retry: permanent failures |
| Memory injection strategy | What prior context is available | Too much: context bloat. Too little: coherence failures |
| Tool set composition | Which tools are available | Overlapping tools confuse selection. Missing tools cause failures |
Running ablations
Change one parameter, hold everything else constant, re-run the benchmark suite, compare pass rate before and after.
The AHE paper result is worth knowing here. On Terminal-Bench 2, every component swap except system prompts improved performance. Tools, middleware, long-term memory: all positive. System prompt alone caused regression. The practical implication: your system prompt is where harness engineering time buys the most, and also where careless edits cost the most.
Ten AHE iterations lifted pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed Codex-CLI harness at 71.9%. The gains came from systematic ablation, not intuition.
Tuning priority, in order:
- System prompt: highest impact, highest regression risk
- Tool descriptions: directly affects selection quality
- MAX_ITERATIONS: easy to measure, immediate cost impact
- Context compression threshold: significant for long-horizon tasks
- Retry logic: measurable through error recovery rate
- Memory injection strategy: hardest to ablate cleanly, save for later
Overfitting to evals
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
Once your eval set is known, once your team is mentally aware of which tasks pass and which fail, you start optimizing for those tasks specifically. Pass rate goes up. Real-world performance stays flat or declines.
The benchmark contamination problem
UC Berkeley researchers found that every major public agent benchmark can be exploited to achieve near-perfect scores without solving any actual tasks. GAIA: roughly 98% through public answers and normalization collisions. OSWorld: 73% through VM state manipulation and public gold files. WebArena: roughly 100% through config leakage and DOM injection.
These are public benchmarks. The same dynamic applies to internal eval sets the moment they become familiar.
Anthropic's engineering team documented a cleaner version of this: Claude Opus 4.6 on BrowseComp identified it was being evaluated, located the answer key, and decrypted it. The model gamed the eval through general capability, not through overfitting. If a sufficiently capable model can find your answer key, a harness optimized against a fixed eval set will find the shortest path to high scores long before that.
Building eval sets that resist gaming
Private holdout sets: keep a set that never gets published, never gets discussed in team evals, and never gets used as a development signal. Run it quarterly. If your internal pass rate keeps going up but your holdout scores are flat, you're optimizing for the eval.
Dynamic rotation: add 5 to 10 new test cases from production logs every month. The eval set grows with real failures rather than staying frozen against a snapshot of what failed six months ago.
The three-tier structure that works:
- Public benchmarks: where you stand vs. industry
- Internal regression suite (static): catches regressions before deployment
- Holdout set (never seen during development): honest measurement
Three signs you're overfitting: pass rate on your internal suite keeps climbing while production metrics stay flat; harness changes that improve eval scores harm performance on unfamiliar task types; your team knows which test cases pass before running the suite.
Production logs as the best eval source
The most realistic eval data is not in your test suite. It's in your production logs.
Hand-crafted evals are designed by people who know your system. They encode the failure modes you thought of. Production logs capture the failure modes your users discovered, which are almost always different.
The improvement loop
- Trace production sessions: all tool calls, arguments, results, latency, token counts
- Flag failures automatically: iteration cap reached without completion, user thumbs-down, zero-output sessions, tool error rate above threshold
- Weekly review of flagged sessions: human triages into "add to eval suite," "known issue," or "expected behavior"
- Add 5 to 10 new test cases per week from production
- Improve the harness
- Deploy
- Return to step 1
Research on LLM evaluation practices has found that production-derived test cases improved model performance on real-world tasks by 34% compared to synthetic-only datasets. The gap between "things I imagined would break" and "things that actually break in production" is that large.
LangSmith's Insights Agent runs automated clustering over production traces: it groups thousands of traces by intent, analyzes traces with negative feedback, and surfaces patterns the harness consistently handles poorly. That's the kind of analysis that takes a week manually and runs overnight automatically.
Infrastructure and tooling
Braintrust is framework-agnostic. It works across OpenAI, Anthropic, Google, and open-source models. It has a native GitHub Action for CI/CD gating, trajectory-level scoring for multi-step agents, and dataset management built in. Best fit: teams that need evaluation to act as release control, and teams working across multiple frameworks or providers.
LangSmith is the right choice for Python-first teams building with LangChain or LangGraph. Native tracing captures every component interaction automatically. The annotation queues route selected traces to human reviewers with full session context. Adding a problematic production trace to a dataset is a single click. The Insights Agent handles automated failure clustering. Tight pytest and GitHub workflow integration.
Inspect AI (UK AI Security Institute, open-source) has a clean architecture: Datasets, Solvers, and Scorers as distinct layers. It supports arbitrary external agents (Claude Code, Codex CLI, Gemini CLI) and comes with over 200 pre-built evaluations. Used for nearly all of UK AISI's automated evaluations, and adopted by Anthropic and DeepMind. If you want to run standardized safety and capability evaluations without building the framework yourself, start here.
HAL (Holistic Agent Leaderboard, from Princeton PLI) is worth knowing for benchmarking. It accepts any agent exposing a minimal Python API and tracks cost per benchmark run as a first-class metric alongside accuracy. Validated with 21,730 rollouts across 9 models and 9 benchmarks. The cost-tracking approach is the most practically useful idea in HAL's design: your eval infrastructure should track what each suite run costs, not just whether it passed.
DeepEval runs as pytest unit tests, making it easy to slot into existing CI workflows. It has agent-specific metrics including Tool Correctness (trajectory comparison against ground truth) and multi-turn evaluation support.
Ragas covers agentic metrics including Agent Goal Accuracy, Tool Call Accuracy, and Tool-Calling Efficiency. Good for RAG pipelines that have grown into multi-step agents.
EleutherAI's lm-evaluation-harness is the backend for the Hugging Face Open LLM Leaderboard. It's primarily a model-level eval tool, not a harness regression tool. Use it to isolate the model component as a diagnostic when you're trying to separate model effects from harness effects.
When to call in humans
Automated evals handle most of what you need. Four cases where they don't:
Subjective quality dimensions: naturalness, tone, domain appropriateness. These require judgment that LLM judges approximate but don't fully capture. If your product lives or dies on whether responses feel right to domain experts, periodic human review is not optional.
Novel failure modes: your golden dataset encodes failures you've already seen. Human reviewers spot failure patterns that haven't shown up yet.
LLM judge calibration: before trusting an LLM judge at scale, you need a ground-truth set of human-annotated examples. The judge is only as reliable as its calibration.
Specialized domains: legal, medical, financial. Correctness here requires expert judgment. LLM judges at best approximate what a domain expert would conclude. For high-stakes domains, treat automated evals as a filter that routes edge cases to human review, not as the final word.
The practical hybrid: LLM judges for volume, humans for calibration and the cases the automated system flags as uncertain. Anthropic uses crowdworker pairwise comparisons for Elo-based model scoring. LMArena runs this at scale to generate preference leaderboards. For internal eval purposes, you don't need that scale: a structured annotation queue routing the bottom 5% of automated scores to human review handles most of what you actually need.
The benchmark landscape
A few benchmarks worth knowing for external context on where harnesses stand.
SWE-bench Verified (OpenAI, August 2024) is the current standard for coding agent evaluation. 500 human-validated instances from real repository bug reports and pull requests, each run in an isolated Docker container. It measures the full agent (model plus harness), not the model alone. The same model scores wildly differently across different scaffolds. Use SWE-bench Verified to benchmark your harness as a system, not to evaluate your model in isolation.
SWE-bench Live adds 50 newly verified issues monthly. It addresses data contamination directly. If you're seeing suspiciously high scores on SWE-bench Verified, Live is the check.
AgentBench evaluates agents across eight distinct environments: OS interaction, database querying, knowledge graph navigation, web shopping, web browsing, and more. It tests generalization across fundamentally different agentic settings in a single framework.
GAIA (Meta AI, ICLR 2024) is the multi-step reasoning benchmark. Real-world questions requiring web browsing, tool use, and multi-modal reasoning. Humans score 92%. GPT-4 with plugins scored 15% on the original paper. H2O.ai reached 75% in 2025. The difficulty gap is real.
HAL (Princeton PLI) benchmarks agents with cost as a first-class dimension alongside accuracy. It's the only major leaderboard that tracks what it costs to run a benchmark, not just whether the agent passed. If evaluation cost is a real constraint (and eventually it is for everyone), HAL's methodology is worth adopting internally.
Where this connects to the series
The pillar post established that the harness is the variable: not just for performance, but for cost, reliability, and safety. Evaluation is how you measure that variable over time.
A harness without an eval pipeline is a system you can't change with confidence. Every prompt edit, every tool description update, every change to your iteration cap is a bet you're placing without knowing the odds. The eval pipeline makes those odds explicit.
The next posts in this series cover the AGENTS.md and CLAUDE.md pattern (how harnesses pick up persistent project context without re-prompting every session) and the A2A protocol for inter-harness agent communication.
References and sources
Anthropic Engineering
- Demystifying evals for AI agents - Anthropic Engineering
- Eval awareness in Claude Opus 4.6's BrowseComp - Anthropic Engineering
- Bloom: Auto-evals for alignment - Anthropic Alignment
Benchmark papers
- GAIA (Meta AI) - arXiv:2311.12983 (ICLR 2024)
- AgentBench - ICLR 2024
- WebArena - arXiv:2307.13854
- HAL: Holistic Agent Leaderboard - arXiv:2510.11977; hal.cs.princeton.edu
- SkillsBench - arXiv:2602.12670
Harness engineering research
- Agentic Harness Engineering (AHE) - arXiv:2604.25850
- Active Context Compression - arXiv:2601.07190
- Natural-Language Agent Harnesses - arXiv:2603.25723
Agent evaluation research
- Evaluating LLM-based Agents for Multi-Turn Conversations - arXiv:2503.22458 (ACM TIST)
- Beyond Task Completion - arXiv:2512.12791
- Goodhart's Law in Reinforcement Learning - arXiv:2310.09144
- TRAJECT-Bench - arXiv:2510.04550
Benchmarks
- SWE-bench | SWE-bench Verified | SWE-bench Multimodal
- SWE-bench Verified (OpenAI announcement)
- GAIA Leaderboard
- Agentic AI Benchmarks Leaderboard
Tooling and platforms
- Braintrust | Agent eval framework
- LangSmith Evaluation | Insights Agent
- Inspect AI (AISI) | Sandboxing Toolkit
- DeepEval / Confident AI
- Ragas agentic metrics
- EleutherAI lm-evaluation-harness
- HAL harness GitHub
Engineering blogs and guides
- Amazon AWS: Evaluating AI agents, real-world lessons
- Google Cloud: A methodical approach to agent evaluation
- The Pragmatic Engineer: A pragmatic guide to LLM evals for devs
- LangChain: The Agent Improvement Loop Starts with a Trace
- LangChain: Improving Deep Agents with Harness Engineering
- HuggingFace: AI evals are becoming the new compute bottleneck
- Evidently AI: 10 AI agent benchmarks
- UC Berkeley: How We Broke Top AI Agent Benchmarks
- Arize AI: What is an evaluation harness?
Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.