Tracing and observability
Evals tell you whether the system passed. Traces tell you why it failed. Without structured logs and spans, a forty-seven-minute agent run that burned twelve dollars is a mystery, not an incident you can fix.
Log every harness event as structured data, map the agent loop onto traces and spans, and read token usage every turn. Observability is what makes eval failures debuggable and production failures reproducible.
Logs first, platforms second
You do not need Datadog on day one. You need a contract: one JSON object per event, one session identifier, stable field names. Append to JSONL per session if that is all you have. You can grep it, replay it, and later pipe it to OpenTelemetry without rewriting your mental model.
Minimum fields worth capturing every model or tool step:
timestamp,session_id,trace_id,iterationevent_type(model_call, tool_call, retrieval, error)model,latency_ms,stop_reasoninput_tokens,output_tokens, cache read/create if applicablecumulative_tokensrunning totals per session- For tools:
tool_name, inputs (redacted), output hash and byte count, not raw secrets
stop_reason is underrated. end_turn is a clean exit. tool_use means the loop continues. max_tokens means context blew up. circuit_breaker means your harness killed a runaway loop. If you see max_tokens often, you have a context management bug, not a model bug.
Full file reads, SQL results, and API responses bloat storage and leak secrets. Log path, size, and a hash so you can confirm repetition without storing PII. Keep a restricted debug store if engineers need occasional raw access under audit controls.
Traces and spans for agent loops
When multiple services, tools, or subagents participate, a flat log file gets painful. OpenTelemetry (OTel) models work as nested spans inside a trace.
Typical mapping:
- Trace = full user session (
session_idaligns withtrace_id). - Parent span = one agent loop iteration.
- Child spans = model call, retrieval, tool execution, subagent delegation.
- Span events = retries, streamed chunks, guardrail triggers.
- Baggage = context propagated across processes (user id, experiment flag).
OpenTelemetry GenAI conventions
OTel's GenAI semantic conventions standardize span attributes across providers so dashboards survive model swaps. Fields you will see often:
gen_ai.system,gen_ai.request.model,gen_ai.operation.namegen_ai.usage.input_tokens,gen_ai.usage.output_tokens- Cache fields:
gen_ai.usage.cache_read.input_tokens, cache creation counterparts
Frameworks like LangGraph, CrewAI, and others are converging on OTLP export. Instrument once, send to Langfuse, Arize Phoenix, Datadog, or Honeycomb. Swap backends without re-instrumenting every call site.
For span kinds, Phoenix's taxonomy is a useful shared vocabulary: LLM, TOOL, RETRIEVER, AGENT, GUARDRAIL, EVALUATOR. Consistent labels make waterfalls readable across teams.
W3C traceparent propagation matters for subagents and shell commands. The Claude Agent SDK can inherit trace context into subprocesses so bash your agent runs appears under the same trace without manual glue.
Token accounting every turn
Every provider response should include a usage block. Log it every turn. Plot iteration on the x-axis and cumulative input tokens on the y-axis. Healthy sessions slope gradually. Runaway sessions hockey-stick when a tool dumps huge text into context every loop.
Track cache read vs creation separately. Prompt caching can cut repeated prefix cost sharply when tool definitions and system prompts stay stable. If cache read stays near zero across long sessions, you are probably paying full rate for the same prefix each turn.
Set two alerts: per-session token cap (catches long burns) and spend velocity (catches fast loops). A session cap alone misses an agent that spends fifty dollars in ten minutes within a "allowed" token budget spread over an hour. Observability without alerts is archaeology. Alerts without trace links are noise. Wire PagerDuty or Slack with session id and a replay URL.
Detecting stuck agents
Models do not reliably notice they are repeating themselves. The harness must. Practical patterns:
- Tool fingerprint repeats: hash
(tool_name, sorted inputs). Same fingerprint and output hash two or three times in a row is a loop. - Output hash repeats: near-identical model text across turns without progress.
- Thrashing: ABAB alternation between two tools. Exit after K oscillations.
- Hard stops: max iterations, max wall time, max tokens. These live in code, not in prompts. Prompts asking the model to stop are suggestions. Harness kill switches are guarantees.
When a circuit breaker trips: stop immediately, persist session state, log trip reason, alert with cost and last tool. The replay link is what makes the alert actionable.
Multi-agent and privacy
Multi-agent systems need one trace per user request with nested spans per subagent. Propagate traceparent at spawn time so costs and failures attribute to the right worker. MCP boundaries can carry OTel context too if client and server cooperate.
Agents touch secrets constantly. Redact before write: pattern-match API keys, tokens, and PII in tool outputs. Never ship raw env dumps to central logs. Retention defaults of thirty to ninety days are enough for most debugging; compliance trails are separate with tighter access.
From traces to evals
Traces power trajectory evals from lesson 02. Flag sessions that tripped breakers, hit max_tokens, exceeded P95 cost, or got user thumbs-down. Export anonymized inputs and expected behaviors into the golden set. Observability and evals are one loop: trace explains failure, eval prevents recurrence.
Session replay and debugging workflow
When on-call opens an incident, the workflow should be: find trace, replay session, diff against last known good trace or eval case, patch harness, add case, rerun golden set. JSONL event logs are enough for replay if each event stores inputs and tool results needed to reconstruct state.
Handle truncated JSONL from crashed sessions gracefully. Partial writes happen when circuits trip mid-event. Validators should load what exists and mark the tail as incomplete instead of failing the whole import.
Correlate replay with eval case ids. If case billing-041 fails in CI, the trace from the incident that created billing-041 should be one click away in your observability UI. That link shortens postmortems from hours to minutes.
Choosing an observability backend
OTel-native open tools (Langfuse, Phoenix) fit teams that want self-hosting and eval export. Braintrust leans eval-first with CI integration and run comparison. Datadog and Honeycomb fit teams already paying for APM. LangSmith fits LangGraph stacks with minimal wiring.
The wrong choice is usually "none because we will add it later." Later is when you are debugging a forty-seven-minute run with printf. Start with JSONL plus one hosted or self-hosted trace UI before you need it, not after.
RAG and retrieval spans
In RAG pipelines, instrument retrieval as its own span kind: query text, candidate ids, scores, filter reasons, reranker latency. When answers go wrong, the waterfall should show whether embedding search, keyword merge, or rerank dropped the right doc. Without retrieval spans, every miss looks like a "hallucination."
Log chunk ids in the trace, not full chunk text. Attach byte counts and content hashes so you can verify the model received what retrieval selected.
Include session_id, iteration, tool_name, redacted inputs, output_hash, output_bytes, latency_ms, error if any, and cumulative token totals. That is enough to debug eighty percent of agent incidents without storing secrets.
Fine-tuning and deployment correlation
When you ship a new adapter or fine-tune, log model_revision on every span. Slice traces and eval pass rate by revision from day one. Fine-tune regressions often appear as slow citation quality drift or tool-call format breakage, not as hard errors. Compare revision N vs N-1 on the same golden set before promoting traffic.
Keep baseline traces from before the fine-tune on representative sessions. Side-by-side waterfalls make "it feels worse" debates shorter.
Alert when median tokens per successful task moves more than twenty percent week over week without a planned change. That often precedes a retrieval or compression bug that eval pass rate has not caught yet.
Standardize span names across services so dashboards compose. Ad-hoc strings like llm_call_v2 and model_invoke for the same step will fragment metrics.
Checkpoint
You are ready for the next lesson if you can answer these from memory:
- What is the difference between a trace and a span in an agent loop?
- Which
stop_reasonvalues point to harness context problems? - Why log output hashes instead of full tool responses?
- Name two stuck-agent heuristics and one hard stop every harness needs.
Quick check
- The model became less capable
- Context is growing unchecked or limits are misconfigured
- Users are typing longer messages only
- You should delete logs to save cost
- The entire trace
- A child span under the iteration span
- Baggage only
- A span event with no span
- Prompts are always ignored
- Code-level limits are enforced regardless of model behavior
- Models cannot count tokens
- OpenTelemetry replaces the need for limits
- Prompt caching is hitting for repeated prefix content
- You are being double billed for output tokens
- Output quality is degrading
- You should disable tracing