Lesson 06

Operate LLM serving in production

Once users depend on the model, serving becomes operations. You need capacity plans, rollout controls, cost alerts, traces, and a response plan for quality and latency incidents.

The one idea

Production LLM serving is a control loop: measure traffic, protect latency, control cost, release carefully, and feed incidents back into prompts, routing, evals, and capacity.

Define the SLO in user terms

Do not start with GPU utilization. Start with the user experience: time to first token, total response time, error rate, stream interruption rate, fallback rate, and output validity. Then map those numbers to server metrics.

A chat product might care about p95 time to first token. A batch summarizer might care about job completion time. A voice agent might care about turn latency. Different products need different SLOs.

Capacity plan from token shape

Capacity planning starts with request count, input tokens, output tokens, concurrency, and burst shape. A request is not a unit of work by itself. Ten short classification requests are not the same as ten long RAG answers.

Keep a simple forecast per route: expected requests, p50 and p95 input tokens, p50 and p95 output tokens, target latency, and expected growth. Use that to decide how much serving capacity and headroom you need.

Good serving operations turn production signals into safer releases and better workload control.

Autoscale carefully

LLM autoscaling is harder than normal web autoscaling because models are large, startup can be slow, and the useful signal is token pressure rather than request count alone. Scaling on CPU usage or raw requests can miss the real bottleneck.

Better signals include queue time, active sequences, pending tokens, GPU memory pressure, KV cache pressure, time to first token, and route-level traffic forecasts. Keep warm capacity for known spikes when cold starts would violate the SLO.

Put budgets in the serving path

Cost controls should not live only in a monthly finance dashboard. Add budget alerts by route, customer, model, and environment. Track input tokens, output tokens, retries, fallback calls, and cache hit rate.

When a prompt change doubles token use, you want to know during rollout, not after the invoice. Budget alerts should be part of release safety.

Release model changes like code

A serving release can change the model, runtime, tokenizer, prompt, quantization, scheduler, or routing policy. Treat each as a versioned change with eval gates, canary traffic, rollback, and a clear owner.

Canaries should watch more than error rate. Watch token counts, schema validity, refusal rate, latency percentiles, fallback rate, sampled quality, and user-visible complaints. A model can be "up" and still be wrong for the product.

Prepare for incidents

LLM serving incidents are not only outages. They include runaway cost, latency spikes, degraded quality, malformed outputs, provider errors, context overflows, bad routing, and safety regressions. Write playbooks for each class.

Good first actions are simple: cap output length, reduce concurrency, disable an expensive route, fall back to a hosted model, roll back a prompt, or shift traffic to a smaller safe model. Practice these actions before the incident.

Engineering reality

The most useful serving dashboard ties user pain to root cause: route, model version, token counts, queue time, TTFT, decode rate, fallback rate, and cost. GPU graphs alone are not enough.

Checkpoint

You have the course if you can answer these from memory:

Why should SLOs be defined in user terms first?
Why is request count a weak capacity-planning unit for LLMs?
Which signals are better for autoscaling an LLM server?
What should a serving canary watch besides error rate?
What are examples of LLM incidents that are not full outages?

Quick check

Requests per minute alone
Input tokens, output tokens, concurrency, and queue time
Only HTTP 200 rate

It can raise cost and latency without showing up as an error
It always means quality improved
Token count does not affect serving