Your agent ran for 47 minutes. It consumed $12 of tokens. It returned nothing useful. Without logs, you have no idea why. Did it get stuck in a loop? Did a tool fail silently? Did the context overflow and the model start hallucinating steps? You genuinely cannot tell.

This is the production reality that most harness writeups skip. The model gets all the attention. The harness gets the blame when things go wrong, and zero credit when they don't. Observability is how you change that. It's what turns a black-box agent run into something you can actually debug, replay, and improve.

This is Blog 05 in the Agent Harnesses series. Blog 01 is useful background on what harnesses are, but you don't need it to follow this one. It covers the minimum log schema, token accounting, stuck-agent detection, multi-agent tracing, privacy, and closing the loop from production failures to evals.


The minimum log schema

Start here. Before any tracing infrastructure, before any observability platform, log every harness event as a structured JSON object. One event per line. One file per session. That's it.

Here's what each log entry should contain at minimum:

json
{
  "timestamp": "2026-05-07T14:23:01.412Z",
  "session_id": "ses_a1b2c3d4",
  "trace_id": "4bf92f3577b34da6",
  "iteration": 7,
  "event_type": "tool_call",
  "tool_name": "read_file",
  "tool_inputs": { "path": "/src/main.py" },
  "tool_output_hash": "sha256:...",
  "tool_output_truncated": "File read (1,842 tokens, truncated)...",
  "input_tokens": 12400,
  "output_tokens": 341,
  "cache_creation_tokens": 4200,
  "cache_read_tokens": 8000,
  "cumulative_tokens": { "input": 58200, "output": 1820 },
  "stop_reason": "tool_use",
  "latency_ms": 1820,
  "model": "claude-sonnet-4-6",
  "agent_id": "agent_root",
  "parent_agent_id": null
}

Two fields get systematically underestimated: stop_reason and cumulative_tokens.

stop_reason tells you why the model stopped generating. The vocabulary matters:

  • tool_use: model issued tool calls, loop continues as expected
  • end_turn: model produced a final answer, loop exits cleanly
  • max_tokens: context window hit the ceiling. This needs immediate attention.
  • circuit_breaker: your harness killed the loop (you configured this deliberately)
  • error: a tool threw an exception
  • user_interrupt: something upstream stopped execution

If you see max_tokens in your logs more than occasionally, you have a context management problem, not a model problem. The harness let the context grow unchecked.

cumulative_tokens gives you a running total across the session. Plot input tokens against iteration number and you'll immediately see if context is ballooning linearly (expected) or exponentially (someone's tool output is enormous and getting re-injected every turn).

Structured JSON, not free text

Free text logs look like this:

text
[14:23:01] read_file called on /src/main.py, returned 1842 tokens

That's fine for reading in a terminal. It's useless at scale. You can't filter, aggregate, or query it without regex that breaks on any wording change.

Structured JSON means you can ask: "How often does read_file fail across sessions?" with a single filter query. "What is the P95 latency for tool calls this week?" Same filter. "Which sessions burned more than 200k tokens?" Trivial. The schema is the contract. Keep field names stable across deployments, because changing them breaks every downstream query silently.

The fastest path to baseline visibility

You don't need Datadog on day one. The fastest path is hook-based JSONL logging. Write a hook script that fires on every tool call event and appends a JSON object to a file. No infrastructure. Readable immediately from the filesystem. Greppable.

Claude Code already does this. Sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, one JSON object per line, capturing every user prompt, Claude response, tool call, and timestamp. Several community tools parse these files directly for dashboards and replay (claude-session-dashboard, agents-observe, agent-flow). That's a reference implementation you can borrow from.

⚠️ Never log raw tool outputs. Log the hash and byte count instead. Full file contents, database query results, and API responses will bloat your logs and almost certainly contain something sensitive.

OpenTelemetry for agent loops

JSONL on the filesystem works for one agent on one machine. Once you have multiple agents, multiple services, or a team that needs to share session data, you need a trace.

OpenTelemetry (OTel) is becoming the standard for agent telemetry. Every major agent framework (LangGraph, CrewAI, Autogen, SmolAgents, Agno) is converging on OTel exports. The benefit is vendor neutrality. You instrument once and route to Phoenix, Langfuse, Datadog, Honeycomb, or anything that accepts OTLP. Swap backends without re-instrumenting.

Mapping agent concepts onto OTel primitives

OTel concept Agent loop equivalent
Trace Full agent session (session_id = trace_id)
Span Single model call, tool execution, or subagent delegation
Parent span One iteration of the agent loop
Child span An individual tool call or nested subagent
Span event Mid-span events: token stream chunks, retry attempts
Baggage Context propagated across agent, tool, and subprocess boundaries

The GenAI semantic conventions

The OTel GenAI semantic conventions define a standard schema for tracking model calls across providers. These are the span attributes that matter for Anthropic specifically:

text
gen_ai.system                        = "anthropic"
gen_ai.operation.name                = "chat"
gen_ai.request.model                 = "claude-sonnet-4-6"
gen_ai.response.model                = "claude-sonnet-4-6"
gen_ai.usage.input_tokens            = 12400
gen_ai.usage.output_tokens           = 341
gen_ai.usage.cache_creation.input_tokens = 4200
gen_ai.usage.cache_read.input_tokens = 8000

These attributes give you a consistent vocabulary across every provider you use. When you switch from one model to another, or add a second provider, your dashboards don't break because the field names are the same.

For spans covering agent-specific work, Arize Phoenix defines a useful taxonomy of span kinds: CHAIN, LLM, TOOL, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR. Use these labels consistently and your waterfall views will actually tell you something.

Claude Agent SDK and W3C traceparent

The Claude Agent SDK reads TRACEPARENT and TRACESTATE from its environment when starting each interaction span. When tracing is active, bash and PowerShell subprocesses automatically inherit TRACEPARENT. That means shell commands your agent runs are automatically part of the trace, without any manual instrumentation.

To export Claude Code telemetry to your backend:

bash
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend:4317
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"

For Langfuse specifically, the Python path uses configure_claude_agent_sdk() with an OTel exporter pointed at the Langfuse endpoint. For JS/TS, the ClaudeAgentSDKInstrumentation library from OpenInference auto-instruments SDK calls. The Langfuse integration docs cover the exact setup.

The Claude Console (for managed agents) has built-in session tracing with a visual timeline. If you're self-hosting, OTLP export is the recommended path. See the Claude API session tracing docs for the full reference.


Token accounting

Every Anthropic API response includes a usage block. Read it every turn and log it. At production scale, there is no acceptable reason not to.

json
{
  "usage": {
    "input_tokens": 12400,
    "output_tokens": 341,
    "cache_creation_input_tokens": 4200,
    "cache_read_input_tokens": 8000
  }
}

input_tokens is the total input processed: system prompt, tool definitions, full conversation history. output_tokens is what the model generated. cache_creation_input_tokens are tokens written to the prompt cache this turn (billed at 1.25x base for a 5-minute TTL, 2x for 1-hour TTL). cache_read_input_tokens are tokens served from cache (0.1x base, ten times cheaper than fresh input).

What cache_control looks like in your logs

When you mark a block with cache_control: {"type": "ephemeral"}, Anthropic caches the prompt prefix up to and including that block. On subsequent calls, if the prefix matches, those tokens become cache_read_input_tokens instead of input_tokens.

In your logs, the pattern looks like this:

  • Turn 1: high cache_creation_input_tokens, near-zero cache_read_input_tokens
  • Turn 2+: cache_read_input_tokens rises sharply, cost per call drops

ProjectDiscovery documented a 59% LLM cost reduction using prompt caching, with cache_read_input_tokens reaching over 80% of total input tokens after the first turn. Tool definitions are the best candidates for caching since they're large, stable, and prepended to every call. If you're not caching them, you're paying full rate for the same tokens on every iteration.

The context burn chart

Build this one chart and keep it. Plot iteration number on the x-axis, cumulative input tokens on the y-axis, per session. A healthy session looks like a gradual slope. A runaway session looks like a hockey stick. If you're seeing hockey sticks, a tool is returning massive outputs and they're all staying in context.

Set alerts when session_total_input exceeds your threshold. Set a second alert on spend velocity (e.g. $50/hour), not just per-session cap. A session cap catches long sessions. A velocity cap catches fast runaway loops before they do serious damage. You need both.


Detecting a stuck agent

An agent gets stuck when it calls the same tool with the same arguments and gets the same output, but keeps calling it anyway. The model doesn't realize it's repeating itself. There's no built-in "are we making progress?" signal in the agent loop. That signal has to come from your harness.

The important point is that loop detection belongs inside the harness. Third-party harness builders should implement it directly instead of waiting for a model to notice repetition on its own.

These patterns work in practice:

Pattern 1: Tool fingerprint hashing

Hash (tool_name + sorted(tool_inputs)). If the same fingerprint appears two or three consecutive times with the same output hash, the agent is stuck. Keep a sliding window of the last five tool call fingerprints and check for repeats. Two to three repetitions is the practical trigger in production.

Pattern 2: Output hash tracking

Track the hash of the model's text response across turns. If the semantic content is nearly identical (cosine similarity above 0.95, or exact hash match) for three or more consecutive turns, the model is looping its reasoning without making progress. Flag it.

Pattern 3: Thrashing detection

Keep a sliding window of the last N tool calls. If the sequence alternates ABABAB between two tools, the agent is bouncing between two states. Exit after K oscillations. This pattern shows up a lot when an agent is trying to confirm an action, failing, and retrying the setup step indefinitely.

Pattern 4: Context drift

As context grows, models can lose track of the original goal. The symptoms are recognizable: increasingly generic responses, restating the task without acting on it, tool call arguments becoming vaguer. If you're not catching this with the above heuristics, look at the semantic drift in consecutive outputs.

Pattern 5: Hard stops (non-negotiable)

Every agent loop must have an absolute hard stop. Not a soft limit. An actual kill switch.

  • Max turn count (e.g. 50 iterations)
  • Max elapsed time (e.g. 10 minutes)
  • Max token consumption (e.g. 500k input tokens per session)

These live in your harness, not in a prompt. A prompt-level instruction like "stop after 50 steps" is something the model can forget, ignore, or reinterpret. A harness-level kill switch is not. The system running the agent is ultimately responsible for guaranteeing termination. The model is not.

💡 The root problem: the agent loop has no built-in progress signal. The harness is responsible for defining and enforcing what "stuck" means. Build this before you go to production.

Circuit breakers

A circuit breaker is a control in your harness that monitors agent behavior against predefined thresholds and terminates execution when those thresholds are exceeded. The key word is "harness." An agent stuck in a loop cannot talk its way past a budget ceiling that lives outside its own context.

Four dimensions your circuit breaker should enforce:

  1. Iteration limit - maximum steps before forced termination
  2. Budget ceiling - hard cost limit in dollars or tokens per session
  3. Failure threshold - N consecutive errors on the same operation triggers a stop
  4. Scope enforcement - permission boundary violations trigger immediate termination and logging

When a circuit breaker trips, the escalation sequence should be:

  1. Stop the agent immediately. Don't let it complete the current turn.
  2. Save the full session state to a crash dump file.
  3. Log the trip reason: which threshold, what the agent was doing.
  4. Emit an alert (webhook, Slack, PagerDuty) with session ID, iteration count, cost to date, last tool called, and last model output.
  5. If escalation is configured: page an engineer with a session replay link.

The session replay link is what makes a tripped circuit breaker actionable. Without it, you know something went wrong. With the link, you can see exactly what.


Multi-agent tracing

A single user action might trigger an orchestrator that spawns three subagents, each of which makes tool calls. Tracing this correctly means attributing every cost, every failure, and every latency spike back to the specific agent that caused it.

W3C traceparent propagation

The approach:

  1. Root agent creates a trace and assigns a trace ID.
  2. When spawning a subagent, inject traceparent (W3C format) into the subagent's environment or initialization params.
  3. Subagent starts a new child span under the parent trace, not a new trace.
  4. All tool calls from the subagent appear nested under the subagent's span, which nests under the orchestrator's span.

In the Claude Agent SDK, when tracing is active, subprocesses automatically inherit TRACEPARENT. You get this for free for anything the agent shells out to. For MCP servers, OTel context can be propagated between the MCP client and server, unifying the trace across that boundary too.

What a correct multi-agent trace looks like

The waterfall view should show:

  • Root span: the orchestrator session
  • Child spans: each subagent invocation (with agent_id and parent_agent_id)
  • Grandchild spans: each tool call within each subagent
  • Cost attribution: which agent spent how much
  • Latency waterfall: which subagent or tool is the bottleneck

Langfuse handles this well. It supports multi-service LLM pipeline tracing with @observe decorators and contextvars for thread-safe context management across async boundaries and subagent threads. The automatic hierarchy is: parent agent trace, child span per tool call, nested span per subagent delegation.


Privacy and log security

Agents read files, hit APIs, query databases, and write code. Everything they touch passes through prompts and tool outputs, which may end up in your logs. This is not hypothetical. Snyk's research found 280+ OpenClaw skills actively exposing API keys and PII through tool call logs and session histories.

Categories of sensitive data that routinely appear in agent logs:

CategoryExamples
API keys / secretsAWS keys, Stripe keys, OAuth tokens in tool inputs
PIINames, emails, SSNs in files the agent reads
Financial dataDatabase query results with account numbers
Health recordsHIPAA-covered content in documents the agent processes
Internal credentialsDB connection strings, .env file contents
Source codeProprietary code or code with embedded secrets

What to never log raw

  • API keys or tokens, even test keys
  • Full contents of files the agent reads (log path + byte count + hash instead)
  • Database query results that may contain user PII
  • Environment variable dumps
  • Multi-line private key material
  • OAuth or session tokens from API responses

Redact before writing, not after

The redaction pipeline order matters. The original unredacted content should never reach central storage:

  1. Tool output arrives
  2. Redaction layer runs (pattern matching for PII types and secrets patterns)
  3. Redacted version is written to the log
  4. If the raw version is needed for debugging, store it in a separate high-security store with access controls

Microsoft Presidio handles common PII pattern detection well. For secrets specifically, the patterns need to be broad. Secrets rarely show up as tidy environment variables. They show up buried in whatever text an API returned or a file contained. Test your pattern library against real log samples, not idealized ones.

Retention and access controls

  • Most agent logs should expire after 30 to 90 days unless they're compliance-required audit trails
  • Separate redacted logs (broad access) from full-fidelity logs (restricted access)
  • Encrypt logs at rest
  • For GDPR and CCPA: logging full user-provided content without redaction may be a violation, not just a risk

Session replay

A session replay is the ability to re-execute (or inspect) an agent run step by step, using the same inputs, the same tool responses, and the same intermediate state. It turns a failure into something you can step through and reproduce.

JSONL is the natural foundation. Events serialized as JSONL are append-safe, streamable, greppable, and loadable incrementally without special tooling. Every LLM call, tool use, decision point, and state change goes into the file. Replaying it means reading events in order and stepping through the same path.

One practical issue to handle: corrupted session JSONL files.

Incomplete tool_use blocks or stop_sequence interruptions can truncate a JSONL file mid-write. Your harness should validate the schema on load and handle truncated files gracefully instead of crashing. This is a real production issue, not a theoretical one.

Several tools support replay at different levels of sophistication. AgentOps has a session replay dashboard with point-in-time precision. agent-flow watches ~/.claude/projects/ and supports JSONL file replay directly. LangGraph's Time Travel feature uses checkpoint-based state persistence that lets you branch at any checkpoint, not just replay from the start.


From logs to evals

Production logs are raw material for eval datasets. The agent runs, logs are captured. Sessions with anomalies get flagged and go into an annotation queue. Engineers label them: "this session failed because X." Those labeled sessions become regression test cases. Regression tests run on every harness change. Each iteration expands coverage.

Real-world failures become the tests that prevent the same failure from reaching users again. The dataset stays current because production never stops generating new edge cases.

Which sessions to flag for eval harvest

  • Sessions that tripped a circuit breaker
  • Sessions where stop_reason was max_tokens
  • Sessions with consecutive tool errors
  • Sessions with iteration count above the P90 for your workload
  • Sessions with cost above the P95
  • Sessions that required human escalation
  • Sessions flagged by users as incorrect or unhelpful

Trajectory evals, not just output evals

An output eval asks: "Did the final answer match what was expected?" Binary and easy to automate. A trajectory eval asks: "Did the agent take a sensible path to get there?" Richer signal, harder to automate, but it requires logs to run at all.

You want both. An agent can arrive at the correct answer via a completely broken path that happens to recover. Knowing the output was right tells you nothing about whether the harness will hold up next time on a slightly different input.

Braintrust, Langfuse, and LangSmith all support adding flagged production traces to eval datasets directly from their UIs. Braintrust's CI/CD integration can block merges when quality degrades below a threshold. That's the loop closing properly.


Platform comparison

A quick comparison:

Platform Best for What sets it apart
Langfuse Open-source first, self-hosted control OTel-native, multi-turn tracing, self-hostable, free tier (50k obs/month)
Arize Phoenix Open-source, framework-agnostic 10 span kinds, broad auto-instrumentation, OTel-native, free OSS
Braintrust Eval-driven teams, CI/CD integration One-click log-to-eval, CI/CD blocking on quality regression
AgentOps Framework-agnostic, fast setup 400+ integrations, session replay, single decorator instrumentation
Helicone API gateway-first, cost tracking One-line proxy integration, unified gateway across providers
W&B Weave ML teams already on W&B MCP auto-logging, code/data/metadata lineage tracking
LangSmith (LangChain) LangChain/LangGraph stacks Zero-code setup for LangGraph, best debugging UX for LangGraph
Datadog LLM Observability Teams already on Datadog Integrated with existing APM, OTel GenAI semantic convention support

If you're starting fresh with no existing observability stack: Langfuse or Phoenix, both open-source, both OTel-native. If you need eval automation from day one: Braintrust. If you're already in LangGraph: LangSmith with near-zero setup overhead.


The point

The observability layer is not overhead. It's how the harness gets better over time. A blind agent doesn't improve. You don't know what it's doing, why it failed, or what path it took to get the wrong answer.

Log every event as structured JSON. Add the trace IDs. Read the usage block on every turn. Build the stuck-agent detection before you hit production. Set hard stops that live in your code, not in your prompts. Redact before writing. Replay sessions when things break. Feed failures into evals so the same failure can't happen twice.

This is what makes a harness worth running in production.

Next in the series: Blog 06 covers cost at scale. Why a 1,000-token system prompt is worth engineering for, and what the difference between Pi's lean prompt and Claude Code's 10,000-token one actually costs at 6.5M monthly developers.


References and sources

Official documentation

Engineering posts and analyses

Platform documentation

Community tooling


Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.