Harness failure modes and recovery strategies
When an agent fails in production, the model is often fine. The loop ran too long, context rotted, a tool returned poisoned text, or validation was missing. This lesson is a field guide to predictable harness failures and the defenses that actually work.
Most agent production failures are harness defects: missing budgets, bad context policy, weak tool gates, or untrusted data treated as instructions. Fix the infrastructure layer first.
Start with the right blame model
Teams often react to a bad agent run by changing the prompt or swapping models. Sometimes that helps. Often the failure mode would hit any model because the harness never enforced a basic invariant.
Useful default: if money burned with no user value, look at loop guards and logging. If the wrong file was deleted, look at permissions and validation. If the answer drifted on step 35, look at compaction and goal pinning.
Failure mode 1: infinite loops
The model keeps calling tools or re-planning because nothing external says stop. Classic signatures in logs:
- Same tool with identical arguments repeated.
- Alternating pattern: tool A, tool B, tool A, tool B.
- Token count climbs with no change in world state.
Real incidents have run for days and burned tens of thousands of dollars when two agents fed each other "needs more analysis" forever. No per-agent cap, no global cap, no stuck detection.
Recovery: hard iteration limits, per-session cost caps, identical-call detection, global pipeline budget for multi-agent setups. Soft prompt limits alone are not enough.
A stuck agent runs 50 iterations. Each iteration calls one cheap tool plus one model round:
- Tool overhead: 50 calls × $0.01 infra/API fee per call = $0.50
- Tokens: ~200k total (growing context) at illustrative Sonnet-class rates — 150k input @ $3/MTok ($0.45) + 50k output @ $15/MTok ($0.75) = $1.20
- Session total: about $1.70 for one bad hour—not catastrophic alone.
Now multiply: 200 concurrent sessions with a missing circuit breaker, or a multi-agent pipeline where each of five workers runs 50 steps. That is $1.70 × 200 = $340/hour, or $8,160/day before anyone notices a dashboard. Cost runaway is a harness enforcement bug, not bad luck.
Failure mode 2: context overflow and rot
Overflow is hitting the max context size. Rot is getting worse while still under the limit. Research across frontier models shows accuracy dropping as input grows, with middle content especially vulnerable.
Failure looks like: forgotten constraints from early messages, repeated destructive actions the model already performed, reinterpretation of the task mid-run.
Recovery: staged compaction, pinned user goals, memory pointers, truncate verbose tool output, monitor max_tokens stop reason. If that stop reason appears often, your harness let context grow unchecked.
Failure mode 3: prompt injection
Untrusted content in tool results, web pages, emails, or files contains instructions the model treats as legitimate. Agents are higher risk than chatbots because they execute, not just display.
Attack variants include MCP tool poisoning (malicious text in tool descriptions users never see) and rug pulls (tool definition changes after approval).
Recovery: instruction hierarchy (system high trust, user medium, tool results untrusted data), sanitize and sandbox tool outputs, human approval for exfiltration-prone tools, model hardening as a second layer not the only layer.
A chatbot might leak text in the reply. An agent can chain tools: read email, read internal doc, post to webhook. One injected paragraph can become a sequence of irreversible actions. The harness must treat external content as data, never promote it to system-level instructions.
Failure mode 4: hallucinated tool calls
The model invents a tool name, adds fields not in the schema, or passes plausible but wrong argument values. More tools in the registry increases confusion among similar names.
Recovery: strict schema validation, reject unknown fields, typed enums, least-privilege registry (only expose tools needed this phase), structured errors returned to the model for retry.
Failure mode 5: tool execution failures
Network timeouts, 403s, rate limits, malformed third-party JSON. If the harness surfaces a generic failure, the model loops uselessly.
Recovery: classify errors (transient vs permanent), retry transient with backoff, return actionable messages, circuit-break repeated failures on the same call signature.
Failure mode 6: premature completion
The model returns end_turn while the task is incomplete: code not run, tests not executed, ticket not created. The harness exits because the model said done.
Recovery: completion checks in harness code, required verification tools, eval hooks that reject stop when invariants fail.
Failure mode 7: observability gaps
Without structured per-iteration logs you cannot answer: loop or model? which tool? which iteration blew the budget? Free-text logs break when wording shifts.
Recovery: JSONL per session with iteration, tool name, latency, token counts, stop_reason, cumulative_tokens. Hash large outputs instead of logging secrets. OpenTelemetry traces when you scale beyond one machine.
Build a "replay from log" path early. If you cannot reconstruct what the model saw on iteration 17, you will fix bugs by tweaking prompts blindly. Session replay is harness engineering, not ML research.
Recovery playbook (short)
- Detect: stuck patterns, cost slope, stop_reason anomalies.
- Contain: circuit breakers, kill session, revoke write tools.
- Diagnose: replay context assembly for failed iteration.
- Remediate: harness code, not prompt lottery.
- Prevent: add eval case from production failure.
The loop from production failure to eval set is how harness quality compounds. One-off prompt edits do not.
Failure mode 8: multi-agent cascade
In pipelines, one agent's mistake becomes the next agent's input. Unstructured "bag of agents" topologies amplify errors far more than single-agent baselines. Central orchestrators reduce amplification but still need validation at merge points.
Recovery: structured handoffs, critic steps with external checks, per-agent caps, distrust subagent prose until schema-validated.
Failure mode 9: schema and state drift
The world changes mid-run (file moved, API version bumped) but context still describes the old state. The model acts on stale observations.
Recovery: invalidate summaries after writes, version external state in tool results, re-read critical facts before destructive actions.
Minimum viable observability
Before buying a platform, ship JSONL with: session_id, iteration, event_type, tool_name, latency_ms, input_tokens, output_tokens, cumulative_tokens, stop_reason. Hash large outputs. That schema alone answers most first questions.
Building a failure taxonomy
Tag production failures with harness labels, not just "bad output": loop_stuck, context_rot, tool_hallucination, injection_suspect, premature_stop, cascade_multi_agent. Review weekly. If one tag dominates, that is your engineering roadmap.
Pair tags with replay artifacts. A tag without a reproducible log line is gossip.
Incident response sketch
When spend spikes: freeze new sessions, identify sessions with high iteration counts, inspect last ten tool calls for loops, patch harness guard, add eval case, redeploy. When data leaks suspected: revoke tool credentials, audit logs for exfiltration tools, narrow registry, postmortem injection path.
Keep runbooks harness-centric. "Change the prompt" is not a runbook step.
Connecting failures to evals
Every production incident should become a regression test: replay the session input, mock or record tool results, assert the harness stops or recovers correctly. Over time you build a library of failure shapes the way SRE teams build alert playbooks. For trajectory grading, step budgets, and ToolCallF1 patterns, see Evaluation L02 — Agent trajectory evals and L05 — CI regression.
Model upgrades then run against harness evals first. You separate "did the loop break?" from "did answer quality change?"
For agents, add trajectory evals: grade the tool sequence and arguments, not only the final string. Record golden traces in CI; fail when stop reason, tool order, or cap behavior regresses. See Evaluation L02: Task evals and golden sets for trajectory graders and Evaluation L05: Regression testing CI for merge gates on harness changes.
User-visible failure copy
When the harness stops a run, users see an error message. Write those strings for humans, but log the machine reason separately (circuit_breaker:identical_calls). Support teams need both. Vague "something went wrong" without a session ID blocks every downstream investigation.
Regression tests from incidents
After each serious incident, add: input transcript snippet, harness version, expected stop reason, max iterations observed. Run on CI when harness code changes. Model evals can run nightly; harness regressions should run on every pull request because they are deterministic.
Harness eval: did the loop stop, validate, log, and recover correctly given fixed model outputs? Model eval: did the final answer meet quality rubric? Confusing them leads to prompt changes when you needed a circuit breaker, or vice versa.
Cost failures are harness failures
Runaway spend always means missing or misconfigured caps: per iteration, per session, per org, per agent in multi-agent setups. Finance alerts are not a substitute for enforcement at call time. Bill shock means the harness let another API request through when it should not have.
Alert on derivative signals too: tokens per minute spike, identical tool signature count, sessions over N iterations. Alerts should link to session replay.
Sharing learnings across teams
Publish internal harness postmortems the way you publish outage reports: timeline, root cause, harness change, eval added. Model vendors will not fix your missing circuit breaker. Cross-team catalogs of failure tags speed up reviews when someone proposes a new agent feature.
Encourage engineers to tag support tickets with harness failure classes. Patterns in support data often precede finance data by weeks.
Quarterly, review the top three tags and ship one harness fix each. That rhythm beats annual "agent strategy" decks. Small deterministic fixes compound faster than model upgrades alone. Write the fixes in code, not in prompt footnotes. Share eval diffs in the pull request.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What log pattern suggests an infinite loop?
- How is context rot different from hitting max context?
- Why should tool results be low trust?
- What is a doom loop / premature completion?
Quick check
- A model intelligence failure
- A harness loop guard failure
- A pretraining data problem
- Improve context management and compaction in the harness
- Switch to the largest available model only
- Lower temperature to 0
- Promote the email to system prompt so the model reads it first
- Treat the email as untrusted data; block exfiltration tools without approval
- Pass it through unchanged and trust the model
- Logs may contain PII or secrets and bloat storage
- JSONL cannot store large payloads
- Developers should never read logs
- Prompt injection
- Infinite loop (stuck tool pattern)
- Hallucinated tool call
- Premature completion
- Context overflow
- Premature completion / doom loop
- Multi-agent cascade
- Prompt injection via tool result
- Hallucinated tool schema
- Observability gap