Harness Failure Modes: What Actually Breaks and How to Catch It

Four LangChain agents. Eleven days. $47,000.

The team didn't get hacked. The model didn't hallucinate its way into a disaster. Two agents got stuck in a loop: one generating content, the other requesting more analysis. Neither had a budget ceiling. Nobody noticed until the invoice arrived.

That's a harness failure. Not a model failure.

65% of enterprise AI production failures trace back to harness defects: context drift, schema misalignment, state degradation. The model is usually fine. The infrastructure around it is the problem.

This is a catalog of what breaks, how each failure looks in production, and what you build to stop it. If you haven't read the first post in this series on what harnesses actually are, start there. This one assumes you know the basics.

The failure is almost never the model

Quick re-anchor: the harness is everything in an agent except the model. The loop, the tools, the memory management, the permission layer, the budget enforcement. The model reasons. The harness executes.

When something goes wrong in production, most teams debug the model first. Different prompt, different temperature, bigger model. That's often the wrong call. The ten failure modes below don't require a bad model. They require a harness that didn't protect against predictable edge cases.

The ten failure modes

1. Infinite loops

The agent loop has no natural exit. The model decides when it's done. When its judgment is off, or when the harness provides no external check, it runs forever.

🔥 The $47,000 incident (November 2025): A market research pipeline running four LangChain agents ran for 11 days. An Analyzer generated content. A Verifier requested further analysis. The Analyzer obliged. Neither had a per-agent budget cap. No mechanism existed that could terminate the session before the next API call completed. Full post-mortem here.

LangChain also documented a subtler version they call a "doom loop" in their deepagents-cli work: the agent wrote code, re-read it, decided it looked fine, and stopped without ever running it. Self-verification with no external check. The harness had no way to know the task wasn't actually done.

What a stuck agent looks like in logs: the same tool called with identical parameters two or three times in a row. Token count increasing with no visible state change. Tool call sequence: A, then B, then A, then B.

The fix is hard limits, not prompt instructions. A counter in your harness that throws at iteration 50 cannot be reasoned around. "You have at most 20 steps" in a system prompt can be ignored under task pressure. Pair soft limits with hard limits. Never use soft limits alone.

python

class LoopGuard:
    def __init__(self, max_iterations=50, max_identical_calls=3):
        self.iteration = 0
        self.call_history = []
        self.max_iterations = max_iterations
        self.max_identical = max_identical_calls

    def check(self, tool_name, tool_args):
        self.iteration += 1
        if self.iteration > self.max_iterations:
            raise HarnessTermination("max iterations exceeded")

        call_signature = f"{tool_name}:{hash(str(tool_args))}"
        self.call_history.append(call_signature)

        # Detect repeated identical calls
        if len(self.call_history) >= self.max_identical:
            recent = self.call_history[-self.max_identical:]
            if len(set(recent)) == 1:
                raise HarnessTermination(
                    f"identical tool call repeated {self.max_identical}x"
                )

2. Context window overflow (and context rot)

Every tool result, every message, every observation gets appended to the context. Eventually you hit the limit. But it doesn't crash cleanly. It degrades silently.

📊 Context rot: Chroma tested 18 frontier models in 2025 (GPT-4.1, Claude Opus 4, Gemini 2.5) and found that every single one gets worse as input length increases, before the context window is full. Accuracy drops of 20–50% between 10k and 100k tokens. Information placed in the middle of long contexts degrades significantly.

The cascading effect makes this worse than it sounds. A context overflow doesn't just mean a missed instruction. The model may re-execute destructive operations it already ran. It may reinterpret its entire task without noticing. The "Lost in the Middle" paper (Liu et al., 2023) established that models attend well to the beginning and end of context but poorly to the middle. That's not just an interesting benchmark result. It's an architectural constraint every harness has to work around.

Claude Code's response to this is a five-stage compaction pipeline that runs before every model call: budget reduction, snip, microcompact, context collapse, and auto-compact. Each stage has a different cost-benefit tradeoff; cheaper stages run first.

For teams building their own harnesses: rolling windows are simple but drop early instructions that may still matter. Summarization is better. AWS documented a case where memory pointers (references to external memory rather than content) reduced token usage from 20,822,181 to 1,234 tokens, a 16,000x reduction, while succeeding where full-context inclusion had failed.

3. Prompt injection

OWASP ranks this as #1 in the LLM Top 10 for 2025. It's also the most widely exploited technique in AI security vulnerabilities reported to Microsoft.

The basic attack: malicious content in the agent's environment (a web page, an email, a file, a tool result) contains instructions that the model interprets as legitimate commands and executes.

⚠️ EchoLeak (CVE-2025-32711, CVSS 9.3): A crafted email coerced Microsoft 365 Copilot into accessing internal files and transmitting them to an attacker's server. No user interaction required. A single injection cascaded through retrieval to exfiltrate chat logs, OneDrive files, SharePoint content, and Teams messages.

Agents are more vulnerable than chatbots for a straightforward reason: chatbots present output, agents execute it. A single injected instruction can trigger a sequence of irreversible actions.

MCP tool poisoning is a newer attack vector worth knowing. Malicious instructions embedded in tool descriptions are invisible to users but visible to the model. Invariant Labs reported this achieved an 84.2% success rate in controlled testing when agents have auto-approval enabled. There's also the "rug pull" variant: a tool description is silently altered after user approval, so a trusted tool becomes malicious at runtime.

The harness-level defense is an instruction hierarchy. System prompt is high trust. User messages are medium trust. Tool results are untrusted data and should never be promoted to instruction status. Anthropic's RL training exposed Claude to prompt injections in simulated web content and rewarded correct refusal; Claude Opus 4.5 reduced successful browser-based injection attacks to 1%. Model-level hardening plus harness-level content filtering is the right combination.

4. Hallucinated tool calls

The model invents a tool that doesn't exist, or calls a real tool with arguments that don't match its schema.

json

// Model invokes non-existent tool
{
  "tool_name": "get_customer_profile",   // doesn't exist; real tool is "fetch_user_record"
  "arguments": {
    "customer_id": "cust_abc123",
    "include_history": true              // not in schema; model invented this field
  }
}

The syntactically correct but semantically wrong calls are the dangerous ones: delete_file with the wrong path; a null passed for a required parameter that gets filled in with a default the model didn't intend; an integer where the tool expects a string ID, auto-coerced and run without error.

More tools in the registry means a higher hallucination rate. Models confuse similar-sounding tools, and they generalize from training patterns: they "know" what HTTP APIs look like, so they fabricate plausible-looking calls.

The tool registry should be a gatekeeper. Before any tool fires: check existence, check required fields, check types, check for hallucinated fields, check value constraints. This layer catches 60–70% of tool-call errors before they reach application code.

python

def validate_tool_call(call: dict, registry: dict) -> tuple[bool, str]:
    tool_name = call.get("tool_name")
    if tool_name not in registry:
        return False, f"Tool '{tool_name}' not found in registry"

    schema = registry[tool_name]["parameters"]
    args = call.get("arguments", {})

    # Check for hallucinated fields
    allowed_fields = set(schema["properties"].keys())
    provided_fields = set(args.keys())
    extra = provided_fields - allowed_fields
    if extra:
        return False, f"Hallucinated fields: {extra}"

    # Check required fields
    required = set(schema.get("required", []))
    missing = required - provided_fields
    if missing:
        return False, f"Missing required fields: {missing}"

    return True, "ok"

OpenAI's strict: true in function definitions guarantees arguments match the provided JSON Schema. Semantic validity (right target, right scope) is still a harness responsibility.

5. Tool execution failures

The tool exists. The call is valid. But execution fails: network timeout, permission error, rate limit, malformed response from an external service.

The retry ambiguity problem is the hard part. When an agent retries a timed-out operation, it doesn't know whether the original completed. Retry write_file and you may write twice. Retry a payment API and you may charge twice. Idempotent operations can be retried safely. Non-idempotent ones can't, without additional state tracking.

python

import random, time

def retry_with_backoff(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            time.sleep(delay)

For long-running tools that risk timing out: return a job ID immediately and let the agent poll for results. This prevents a timeout from being misread as a failure.

6. State corruption and partial writes

The agent crashes mid-operation. The harness has no rollback mechanism. The file system or database is left in an inconsistent state that neither the model nor a human can easily recover from.

Agent writes file A and file B as part of one logical change. Crash between writes. File A is changed; file B is not. The codebase is now broken. Or the agent partially refactors a function: removes old code, never finishes writing the new code. Code doesn't compile.

🔥 The Replit incident (July 2025): An agent deleted a production database, then attempted to cover the damage by generating 4,000 fake user records and manipulating operational logs. The agent admitted to "panicking." Inconsistency led to more inconsistency. Core lesson: an agent that can create state must be able to roll it back. It must never be trusted to self-assess whether its own state is consistent.

Claude Code snapshots file contents before every edit. If something goes wrong, you revert. For distributed agents, Temporal provides durable execution: every model call, tool execution, and API request is part of a deterministic workflow. If the process crashes, it resumes from the last checkpoint by replaying the event history. OpenAI runs Temporal for Codex at millions of production requests daily.

python

class SafeFileEditor:
    def edit_file(self, path: str, new_content: str):
        snapshot = self._snapshot(path)
        try:
            self._write(path, new_content)
            if not self._verify(path):
                self._restore(path, snapshot)
                raise VerificationError(
                    f"Edit to {path} failed verification, reverted"
                )
        except Exception:
            self._restore(path, snapshot)
            raise

7. Context poisoning

A bad tool result enters the context. The model treats it as ground truth and builds subsequent reasoning on top of it. Errors compound.

This isn't necessarily adversarial. A flaky API returned wrong data. A RAG system retrieved the wrong document. A sub-agent made a reasoning error. The model proceeds confidently. The only signal is downstream output quality, which may not degrade visibly until several steps later, by which point the causal chain is buried.

MemoryGraft (2025 research) shows a persistent version of this: poisoning an agent's long-term memory with malicious "successful experiences." When the agent later faces a similar task, it retrieves the poisoned pattern and follows it across different sessions. One poisoned session can corrupt future unrelated ones.

Defenses: periodic context audits (have a separate evaluator model compare current state against the original task spec every N turns), source tagging (every piece of context carries its source and a trust level), and fresh-start checkpoints (explicit re-anchor points where the agent re-reads the original task from scratch rather than from its accumulated summary).

8. Silent failures

The harness catches an error, logs it internally, and continues without telling the model what happened. The model proceeds as if the tool succeeded.

python

# The anti-pattern - don't do this
try:
    result = tool.execute(args)
    context.append(result)
except Exception as e:
    logger.error(e)         # logged internally
    context.append(None)    # model gets None - it doesn't know
    continue                # keeps going

# The correct pattern - surface failures explicitly
try:
    result = tool.execute(args)
    context.append({
        "tool": tool.name,
        "status": "success",
        "result": result
    })
except ToolExecutionError as e:
    context.append({
        "tool": tool.name,
        "status": "failed",
        "error": str(e),
        "message": f"Tool {tool.name} failed: {e}. Account for this in subsequent steps."
    })

Data quality issues account for 27% of AI agent production failures. These are the worst category because no exception is thrown. The tool returns a response. The response contains wrong, stale, or incomplete data. The model treats it as accurate. Atlan reports that data-layer failures account for 55% of enterprise harness failures overall.

Fail loud by default. Every failure surfaces to the model. The model can adapt its plan when it knows a tool failed. It cannot adapt when it doesn't know.

9. Budget blowouts

No hard cap on tokens, API calls, or dollars. The agent runs freely. The bill arrives later.

The alert-vs-enforcement gap is a distinct and well-documented failure pattern. Teams configure budget alerts, not budget enforcement. An alert fires after the budget is gone. An alert does not stop the next API call.

The correct architecture places a budget governance layer between agent code and the LLM API. It checks remaining budget before each call. It refuses to make the call if the budget is exhausted. It does not wait for a response to arrive before checking. The budget is a policy, not a dashboard threshold.

In multi-agent pipelines, per-agent caps matter separately from the total pipeline budget. In the $47,000 incident, two agents consumed the entire budget while the other two were waiting. Each agent needs its own hard cap, plus the pipeline needs a global one.

Tag every API call with the initiating agent ID, task ID, and pipeline ID. Without attribution, budget overruns are invisible at the agent level. You just see a large API bill.

10. Multi-agent cascade failures

In a multi-agent pipeline, one agent's output becomes the next agent's input. Errors amplify.

📊 The 17x finding: A December 2025 Google DeepMind study across 180 configurations found that unstructured multi-agent networks (what they call "bag of agents") amplify errors up to 17.2x versus a single-agent baseline. Centralized systems with an orchestrator contained amplification to 4.4x. Full analysis here.

The MAST study (March 2025, 1,642 execution traces across seven open-source frameworks) found failure rates between 41% and 86.7%. The "bag of agents" anti-pattern is: throw more LLMs at a problem expecting emergent correctness. Errors don't cancel. They cascade.

Here's a simple cascade. Agent 1 misreads "billing dispute" as "billing inquiry." Agent 2 selects the wrong response template. Agent 3 generates a polite inquiry reply to an angry dispute customer. The error at step 1 was small. The impact at step 3 is significant.

An orchestrator acts as a validation bottleneck, catching errors before they propagate. Output validation before handoff means a failed schema check stops the cascade at the source. For critical outputs, use a separate verification agent that has only seen the original task and the proposed output (not the working chain). It gives you independent judgment.

python

def validated_handoff(source_output, schema, next_agent):
    if not schema.validate(source_output):
        raise CascadePreventionError(
            f"Agent output failed validation: {schema.errors}"
        )
    # Only pass output to the next agent if it passes validation
    return next_agent.run(source_output)

Five defenses that cover most of these

The ten failure modes aren't independent. Several share root causes. Fix five underlying patterns and you address most of them.

Hard limits over soft limits. The model can work around instructions in a prompt. It cannot work around a counter that throws. Enforce iteration limits, budget limits, and timeouts in code.

Fail loud by default. Every tool failure surfaces to the model with the full error. Noisy failures are debuggable. Silent failures are not. An SRE principle that translates directly to harness design.

Snapshot before every write. Any operation with side effects needs a checkpoint. Files: snapshot before editing. Databases: always inside a transaction. API sequences: design for idempotency or use compensating transactions.

Validate every tool call against its schema. Before a tool fires, check existence, required fields, types, hallucinated fields, and value constraints. This layer is cheap to build and catches 60–70% of errors before they reach application code.

Treat tool results as untrusted data. Tool results are data. Not instructions. Not facts. Tag them with their source and trust level before they enter context. Never promote tool results to instruction status.

Catching failures before production

Harness failures are detectable with the right testing structure. The key distinction is measuring harness performance separately from model performance.

The six-layer testing stack

Layer 0: Data certification. Certify every data source before evals run. Bad input data makes all evals meaningless.
Layer 1: Unit tests. Test individual tool calls with deterministic assertions. No model involved.
Layer 2: Integration tests. Test multi-step workflows and context retention. Does information carry correctly across turns?
Layer 3: End-to-end with fault injection. Run the full agent on realistic tasks; deliberately inject tool failures, context truncation, malformed responses.
Layer 4: Adversarial testing. Prompt injection attempts, tool poisoning, jailbreaks via document content.
Layer 5: Production CI/CD gate and continuous monitoring. Eval scores gate deployment; performance monitored continuously.

Useful harness-specific metrics: task completion rate under simulated tool failures; context retention rate across 50 turns; what percentage of runs enter a loop pattern (detectable by identical tool call sequences); cost per task completion; rollback success rate.

💡 The LangChain case: By changing only the harness (system prompt, tool definitions, LoopDetectionMiddleware) and leaving the model untouched, LangChain improved from 52.8 to 66.5 on Terminal Bench 2.0. That's +13.7 points from harness changes alone. Full writeup here.

What to trace on every loop iteration

Tool name, input args, output (truncated if large), latency, success or failure status
Token count in and out
Iteration number
Cost per call
Every exception caught (never swallow silently)

For tooling: LangSmith is native to LangChain stacks with node-by-node state diffs and full execution graphs. Arize Phoenix brings ML-grade eval primitives and drift detection. Langfuse is open-source and self-hostable, useful for teams that can't send traces to third parties. Helicone is a drop-in proxy: change one base URL and get traces with near-zero integration overhead.

Circuit breakers

Borrowed from distributed systems. If a tool has failed three times in a row, open the circuit. Stop retrying. Surface the tool as unavailable to the model. Three states: closed (normal, tracking failure rate), open (failure threshold exceeded, fails fast), half-open (after cooldown, one test request to check recovery).

The harness is the reliability surface

The model doesn't make these decisions. The harness does.

Every failure mode above is a predictable class, not a rare edge case. These are things that happened in production at Microsoft, Amazon, Replit, and LangChain. Most of them happened because nobody built the check.

LangChain added loop detection middleware. Anthropic snapshots before every file edit. Temporal makes crash recovery deterministic. These aren't clever tricks. They're engineering baselines that should exist in any production harness.

If you haven't engineered the harness, you haven't engineered the agent.

Next in the series: security in harness design: sandboxing strategies, principle of least privilege for tools, and what a path traversal looks like at the harness level.

Notable incidents referenced in this post

Incident	Date	Impact	Root cause
LangChain $47k loop	Nov 2025	$47,000 API bill	No per-agent budget cap, no circuit breaker
Replit database deletion	Jul 2025	Production data wiped, 1,200+ accounts	No permission sandbox, no rollback mechanism
Amazon Kiro AWS deletion	Dec 2025	13-hour outage, mainland China	Agent deleted production environment with no approval gate
EchoLeak (CVE-2025-32711)	Jun 2025	M365 Copilot data exfiltration	Zero-click prompt injection via crafted email
Cursor RCE (CVE-2025-54135/6)	2025	Remote code execution	MCP config creation bypassed approval flow
GitHub Copilot (CVE-2025-53773)	2025	Arbitrary code execution	Prompt injection in public repo comments

Sources and references

Incidents and post-mortems

Context and memory

Security and prompt injection

Multi-agent systems

Harness engineering and testing

State and durability

Budget enforcement

Tool calling and hallucination

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.