Learn/Evaluation & Observability/Lesson 06

Lesson 06

Production monitoring and incident loops

CI catches regressions before deploy. Production catches everything else: new user behaviors, data drift, provider outages, and the failure modes you never thought to test. This lesson is the operating rhythm that connects live traffic back to evals.

The one idea

Production monitoring without a log-to-eval loop is dashboards you stare at. The loop is: trace, flag, triage, add case, fix harness, deploy, repeat.

What to monitor beyond pass rate

Offline pass rate is necessary. Online health needs operational metrics tied to user pain:

Task success rate from explicit feedback, downstream outcomes, or sampled judges.
Latency P50/P95 for end-to-end and per-span (retrieval, model, tools).
Cost tokens and dollars per session, per successful task, per user cohort.
Error rates tool failures, parse errors, guardrail blocks, empty responses.
Safety signals injection attempts, policy violations, escalation rate.
Circuit breaker trips and stop_reason=max_tokens frequency.

Slice metrics by feature flag, model version, prompt hash, and customer tier. Aggregates hide regressions that only hit one locale or one document collection.

Each lap should leave you with more test coverage than the last lap.

Alerts that wake the right person

Alert on symptoms users feel, with links to traces:

Success rate drops below SLO for fifteen minutes.
Spend velocity exceeds budget (dollars per hour).
Circuit breaker rate spikes above baseline.
Tool error rate doubles week over week.
Queue depth for human review backs up.

Avoid alerting on every failed session. Sample and threshold. Page when the system is on fire, ticket when a cluster is forming. Every page should include session id, trace URL, prompt version, and model id.

Impact: refund assistant mis-states policy for EU users since 14:00 UTC. Scope: 3.2% of sessions, prompt hash a9f2, retrieval index version 2026-06-28. Trace exemplar: link. Mitigation: roll back prompt to a1c0, disable EU banner experiment. Follow-up: add cases EU-12 through EU-18 to golden set before re-enabling experiment.

Harvesting sessions into evals

Auto-flag sessions worth human review:

Circuit breaker or max tokens stop.
Consecutive tool errors.
Cost or iteration count above P95.
Thumbs-down or implicit negative signals (rephrase request, immediate abandon).
Judge or guardrail low confidence.

Weekly triage meeting: label each flagged session as add-to-eval, known issue, expected behavior, or needs product decision. Target five to ten new golden cases per week from production. Research consistently shows production-derived tests outperform synthetic-only sets on real task performance because users break systems in ways spec writers do not imagine.

Platforms with clustering (LangSmith Insights-style workflows) group thousands of traces by intent and surface recurring failure themes. That is weekly manual work compressed into a ranked queue.

Engineering reality

Anonymize before export to eval repos. Strip PII, rotate secrets, and keep production identifiers in a separate mapping table if you need to reconnect during debugging. GDPR and CCPA treat careless log retention as compliance risk, not just security risk. The eval artifact should be safe to share with every engineer and CI runner.

Thumbs-down to eval case pipeline

Explicit negative feedback is high-signal if you wire it into the golden set instead of letting it die in a dashboard.

Capture. Thumbs-down stores session_id, message_id, optional category (wrong, unsafe, slow, unhelpful), and prompt/model hashes.
Queue. Nightly job dedupes and ranks by frequency and revenue tier. Top items enter the annotation queue.
Label. Reviewer watches trace replay, writes pass/fail criteria, tags failure mode (retrieval miss, faithfulness, tool error, injection).
Promote. Anonymized case merges to golden set with source thumbs-down-2026-06-29. CI runs on next PR.
Close loop. When fix ships, notify user if policy allows; verify case passes in holdout before closing incident.

Target time-to-eval under seventy-two hours for P0 categories (unsafe, billing wrong). Slow promotion trains users that feedback is ignored.

Input distribution drift

Offline pass rate can stay flat while production degrades because users changed, not because your harness broke. Monitor input distribution drift:

New intent clusters (users ask about a product line you never eval'd).
Language or locale mix shifts.
Longer prompts, more attachments, new file types.
Seasonal queries (tax season, holidays) absent from golden set.

Compare weekly histograms of embedding clusters or intent labels on production queries vs golden set tags. When a cluster exceeds five percent of traffic but under two percent of eval cases, sample fifty live queries into the annotation queue. Drift detection is not optional once the product leaves demo stage—it explains "evals green, users angry."

Pair drift monitors with safety: if injection attempt rate rises, import probes from Safety L02 into the golden set before the attacker distribution spreads.

Incident response for AI features

AI incidents are often soft failures: wrong answer, not HTTP 500. Still run a disciplined loop:

Detect via alert or customer report.
Scope with traces: which prompt, model, cohort, time window.
Mitigate rollback, feature flag off, fallback model, or read-only mode.
Communicate status with known impact bounds.
Fix harness, data, or policy with eval proof.
Postmortem blameless, with new eval cases as action items.

Rollback artifacts should be ready before you need them: previous prompt files, model pins, index snapshots. AI deploys are configuration deploys as much as binary deploys.

Trajectory vs output in production

Sample production traffic for both output quality and path quality. An agent that succeeds after twelve redundant searches is a latency and cost incident waiting to happen. Monitor tool-call counts per success, redundant call rate, and argument validity. Regression in efficiency often precedes regression in pass rate.

Closing the course

Evaluation and observability are one system. Lesson 01 argued you cannot ship serious AI features without measuring them. Lesson 02 and 03 built the task and judge layer. Lesson 04 made failures legible. Lesson 05 blocked bad changes at the door. This lesson keeps the set honest after deploy.

If you only build one habit from this course, build the weekly loop: review flagged traces, promote failures to evals, rerun golden set before the next risky change. Everything else is tooling around that discipline.

SLOs and error budgets for AI features

Define SLOs users feel: "ninety-five percent of refund questions receive a grounded answer with citation" or "P95 end-to-end latency under eight seconds for tier-one customers." Pair each SLO with an error budget. When budget burns fast, freeze prompt experiments and spend budget on eval additions and fixes, not new features.

AI SLOs fail when they measure only uptime. The service can return HTTP 200 with a harmful answer. Blend outcome metrics (sampled correctness) with operational metrics (latency, cost, breaker rate).

Postmortem template for AI incidents

Capture: timeline, customer impact, prompt/model/index versions, exemplar trace ids, root cause layer (retrieval, prompt, tool, model, data), mitigation, eval cases added (with ids), and owners for follow-up. Blameless culture still demands concrete artifacts. "We will be more careful" is not an action item; "case EU-12 through EU-18 in golden set" is.

Schedule a thirty-day check: did the new cases catch a repeat attempt? If not, the postmortem is still open.

Online evaluation and shadow traffic

Before promoting a new prompt or model, shadow a slice of traffic: run the candidate path, log outcomes, do not show results to users. Compare pass rate, cost, and latency against production baseline on the same inputs. Shadowing catches interactions live fixtures miss without risking user-facing regressions.

Cap shadow percentage and duration. Five percent for forty-eight hours often beats a hundred percent shadow that doubles spend indefinitely.

Cost and quality dashboards for leadership

Executives ask two questions: "Is it working?" and "What does it cost?" Give them one page: success rate trend, cost per successful task, incident count, eval coverage growth (case count and production-sourced percentage). Tie incidents to eval cases closed so quality work is visible, not invisible hygiene.

Review top ten flagged traces by impact. Label each: add eval, fix now, known, wontfix. Assign owners for new cases. Check open incidents for missing eval follow-up. Note pass rate and cost trends. One action item must land in the golden set before next meeting.

Provider outages spike latency and errors across all cases at once. Harness bugs cluster on specific tags or tool paths. Compare failure distribution before rollback. Roll back prompts for localized clusters; wait or failover models for provider-wide spikes.

Feedback interfaces that feed the loop

Thumbs down on a response should capture session id, message id, and optional category (wrong, unsafe, slow). Feed structured feedback directly into the annotation queue. Free-text alone is harder to cluster. Even a three-bucket taxonomy beats a blank comment box for turning pain into labeled evals.

Close the loop with users when you can: "We fixed the issue you reported" builds trust and confirms the case was wired correctly. If the same report category spikes after a fix, your eval did not capture the real failure mode.

Run a monthly "eval debt" review: open incidents without matching golden cases, flaky cases quarantined but unfixed, and holdout regressions ignored. Pay down one item per sprint the same way you pay down flaky unit tests.

Pair on-call runbooks with eval case ids. When the playbook says "roll back prompt hash a9f2," it should also say "verify golden cases billing-041 through billing-048 pass before closing."

Track time-to-eval: hours from incident open to merged golden case. Shrinking that interval is one of the few metrics that predicts fewer repeat incidents better than raw model upgrades.

Run game days: inject a known-bad prompt in staging and verify alerts, trace links, and eval promotion workflows fire. A monitoring stack that never fails in rehearsal will fail for real during the first live incident.

Publish a single internal URL for "current production prompt hashes and model ids" so support, on-call, and PMs reference the same versions during incidents.

Consistency beats scattered wiki pages.

Checkpoint

You are ready to apply this course if you can answer these from memory:

Name four production metrics besides offline pass rate.
Which sessions should auto-enter the eval harvest queue?
What is the thumbs-down to golden-case pipeline?
How do you detect input distribution drift?
What belongs in an AI incident on-call packet?
What is the weekly log-to-eval loop in order?

Quick check

Celebrate stable success rate
Agent loops, stuck patterns, or misconfigured limits
Only whether user count grew
Disable tracing to reduce noise

The original launch brainstorm list
Flagged production sessions and support escalations
A random public benchmark
Cases you delete to improve pass rate

Merge immediately because rollback fixed users
Add eval cases from the incident and verify green on golden set
Skip holdout forever
Delete traces to save storage

Faster resolution
On-call spends time hunting for the session
Automatic fix without human review
You should disable eval suites