Learn/Evaluation & Observability/Lesson 04

Lesson 04

Tracing and observability

Evals tell you whether the system passed. Traces tell you why it failed. Without structured logs and spans, a forty-seven-minute agent run that burned twelve dollars is a mystery, not an incident you can fix.

The one idea

Log every harness event as structured data, map the agent loop onto traces and spans, and read token usage every turn. Observability is what makes eval failures debuggable and production failures reproducible.

Logs first, platforms second

You do not need Datadog on day one. You need a contract: one JSON object per event, one session identifier, stable field names. Append to JSONL per session if that is all you have. You can grep it, replay it, and later pipe it to OpenTelemetry without rewriting your mental model.

Minimum fields worth capturing every model or tool step:

timestamp, session_id, trace_id, iteration
event_type (model_call, tool_call, retrieval, error)
model, latency_ms, stop_reason
input_tokens, output_tokens, cache read/create if applicable
cumulative_tokens running totals per session
For tools: tool_name, inputs (redacted), output hash and byte count, not raw secrets

stop_reason is underrated. end_turn is a clean exit. tool_use means the loop continues. max_tokens means context blew up. circuit_breaker means your harness killed a runaway loop. If you see max_tokens often, you have a context management bug, not a model bug.

Full file reads, SQL results, and API responses bloat storage and leak secrets. Log path, size, and a hash so you can confirm repetition without storing PII. Keep a restricted debug store if engineers need occasional raw access under audit controls.

Traces and spans for agent loops

When multiple services, tools, or subagents participate, a flat log file gets painful. OpenTelemetry (OTel) models work as nested spans inside a trace.

Typical mapping:

Trace = full user session (session_id aligns with trace_id).
Parent span = one agent loop iteration.
Child spans = model call, retrieval, tool execution, subagent delegation.
Span events = retries, streamed chunks, guardrail triggers.
Baggage = context propagated across processes (user id, experiment flag).

A waterfall view answers "where did the time and money go?" faster than reading forty pages of text logs.

OpenTelemetry GenAI conventions

OTel's GenAI semantic conventions standardize span attributes across providers so dashboards survive model swaps. Fields you will see often:

gen_ai.system, gen_ai.request.model, gen_ai.operation.name
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
Cache fields: gen_ai.usage.cache_read.input_tokens, cache creation counterparts

Frameworks like LangGraph, CrewAI, and others are converging on OTLP export. Instrument once, send to Langfuse, Arize Phoenix, Datadog, or Honeycomb. Swap backends without re-instrumenting every call site.

For span kinds, Phoenix's taxonomy is a useful shared vocabulary: LLM, TOOL, RETRIEVER, AGENT, GUARDRAIL, EVALUATOR. Consistent labels make waterfalls readable across teams.

W3C traceparent propagation matters for subagents and shell commands. The Claude Agent SDK can inherit trace context into subprocesses so bash your agent runs appears under the same trace without manual glue.

Token accounting every turn

Every provider response should include a usage block. Log it every turn. Plot iteration on the x-axis and cumulative input tokens on the y-axis. Healthy sessions slope gradually. Runaway sessions hockey-stick when a tool dumps huge text into context every loop.

Track cache read vs creation separately. Prompt caching can cut repeated prefix cost sharply when tool definitions and system prompts stay stable. If cache read stays near zero across long sessions, you are probably paying full rate for the same prefix each turn.

Engineering reality

Set two alerts: per-session token cap (catches long burns) and spend velocity (catches fast loops). A session cap alone misses an agent that spends fifty dollars in ten minutes within a "allowed" token budget spread over an hour. Observability without alerts is archaeology. Alerts without trace links are noise. Wire PagerDuty or Slack with session id and a replay URL.

Detecting stuck agents

Models do not reliably notice they are repeating themselves. The harness must. Practical patterns:

Tool fingerprint repeats: hash (tool_name, sorted inputs). Same fingerprint and output hash two or three times in a row is a loop.
Output hash repeats: near-identical model text across turns without progress.
Thrashing: ABAB alternation between two tools. Exit after K oscillations.
Hard stops: max iterations, max wall time, max tokens. These live in code, not in prompts. Prompts asking the model to stop are suggestions. Harness kill switches are guarantees.

When a circuit breaker trips: stop immediately, persist session state, log trip reason, alert with cost and last tool. The replay link is what makes the alert actionable.

Multi-agent and privacy

Multi-agent systems need one trace per user request with nested spans per subagent. Propagate traceparent at spawn time so costs and failures attribute to the right worker. MCP boundaries can carry OTel context too if client and server cooperate.

Agents touch secrets constantly. Redact before write: pattern-match API keys, tokens, and PII in tool outputs. Never ship raw env dumps to central logs. Retention defaults of thirty to ninety days are enough for most debugging; compliance trails are separate with tighter access.

From traces to evals

Traces power trajectory evals from lesson 02. Flag sessions that tripped breakers, hit max_tokens, exceeded P95 cost, or got user thumbs-down. Export anonymized inputs and expected behaviors into the golden set. Observability and evals are one loop: trace explains failure, eval prevents recurrence.

Session replay and debugging workflow

When on-call opens an incident, the workflow should be: find trace, replay session, diff against last known good trace or eval case, patch harness, add case, rerun golden set. JSONL event logs are enough for replay if each event stores inputs and tool results needed to reconstruct state.

Handle truncated JSONL from crashed sessions gracefully. Partial writes happen when circuits trip mid-event. Validators should load what exists and mark the tail as incomplete instead of failing the whole import.

Correlate replay with eval case ids. If case billing-041 fails in CI, the trace from the incident that created billing-041 should be one click away in your observability UI. That link shortens postmortems from hours to minutes.

Choosing an observability backend

OTel-native open tools (Langfuse, Phoenix) fit teams that want self-hosting and eval export. Braintrust leans eval-first with CI integration and run comparison. Datadog and Honeycomb fit teams already paying for APM. LangSmith fits LangGraph stacks with minimal wiring.

The wrong choice is usually "none because we will add it later." Later is when you are debugging a forty-seven-minute run with printf. Start with JSONL plus one hosted or self-hosted trace UI before you need it, not after.

RAG and retrieval spans

In RAG pipelines, instrument retrieval as its own span kind: query text, candidate ids, scores, filter reasons, reranker latency. When answers go wrong, the waterfall should show whether embedding search, keyword merge, or rerank dropped the right doc. Without retrieval spans, every miss looks like a "hallucination."

Log chunk ids in the trace, not full chunk text. Attach byte counts and content hashes so you can verify the model received what retrieval selected.

Include session_id, iteration, tool_name, redacted inputs, output_hash, output_bytes, latency_ms, error if any, and cumulative token totals. That is enough to debug eighty percent of agent incidents without storing secrets.

Fine-tuning and deployment correlation

When you ship a new adapter or fine-tune, log model_revision on every span. Slice traces and eval pass rate by revision from day one. Fine-tune regressions often appear as slow citation quality drift or tool-call format breakage, not as hard errors. Compare revision N vs N-1 on the same golden set before promoting traffic.

Keep baseline traces from before the fine-tune on representative sessions. Side-by-side waterfalls make "it feels worse" debates shorter.

Alert when median tokens per successful task moves more than twenty percent week over week without a planned change. That often precedes a retrieval or compression bug that eval pass rate has not caught yet.

Standardize span names across services so dashboards compose. Ad-hoc strings like llm_call_v2 and model_invoke for the same step will fragment metrics.

Checkpoint

You are ready for the next lesson if you can answer these from memory:

What is the difference between a trace and a span in an agent loop?
Which stop_reason values point to harness context problems?
Why log output hashes instead of full tool responses?
Name two stuck-agent heuristics and one hard stop every harness needs.

Quick check

The model became less capable
Context is growing unchecked or limits are misconfigured
Users are typing longer messages only
You should delete logs to save cost

The entire trace
A child span under the iteration span
Baggage only
A span event with no span

Prompts are always ignored
Code-level limits are enforced regardless of model behavior
Models cannot count tokens
OpenTelemetry replaces the need for limits

Prompt caching is hitting for repeated prefix content
You are being double billed for output tokens
Output quality is degrading
You should disable tracing