Operate LLM serving in production
Once users depend on the model, serving becomes operations. You need capacity plans, rollout controls, cost alerts, traces, and a response plan for quality and latency incidents.
Production LLM serving is a control loop: measure traffic, protect latency, control cost, release carefully, and feed incidents back into prompts, routing, evals, and capacity.
Define the SLO in user terms
Do not start with GPU utilization. Start with the user experience: time to first token, total response time, error rate, stream interruption rate, fallback rate, and output validity. Then map those numbers to server metrics.
A chat product might care about p95 time to first token. A batch summarizer might care about job completion time. A voice agent might care about turn latency. Different products need different SLOs.
Capacity plan from token shape
Capacity planning starts with request count, input tokens, output tokens, concurrency, and burst shape. A request is not a unit of work by itself. Ten short classification requests are not the same as ten long RAG answers.
Keep a simple forecast per route: expected requests, p50 and p95 input tokens, p50 and p95 output tokens, target latency, and expected growth. Use that to decide how much serving capacity and headroom you need.
Autoscale carefully
LLM autoscaling is harder than normal web autoscaling because models are large, startup can be slow, and the useful signal is token pressure rather than request count alone. Scaling on CPU usage or raw requests can miss the real bottleneck.
Better signals include queue time, active sequences, pending tokens, GPU memory pressure, KV cache pressure, time to first token, and route-level traffic forecasts. Keep warm capacity for known spikes when cold starts would violate the SLO.
Put budgets in the serving path
Cost controls should not live only in a monthly finance dashboard. Add budget alerts by route, customer, model, and environment. Track input tokens, output tokens, retries, fallback calls, and cache hit rate.
When a prompt change doubles token use, you want to know during rollout, not after the invoice. Budget alerts should be part of release safety.
Release model changes like code
A serving release can change the model, runtime, tokenizer, prompt, quantization, scheduler, or routing policy. Treat each as a versioned change with eval gates, canary traffic, rollback, and a clear owner.
Canaries should watch more than error rate. Watch token counts, schema validity, refusal rate, latency percentiles, fallback rate, sampled quality, and user-visible complaints. A model can be "up" and still be wrong for the product.
Prepare for incidents
LLM serving incidents are not only outages. They include runaway cost, latency spikes, degraded quality, malformed outputs, provider errors, context overflows, bad routing, and safety regressions. Write playbooks for each class.
Good first actions are simple: cap output length, reduce concurrency, disable an expensive route, fall back to a hosted model, roll back a prompt, or shift traffic to a smaller safe model. Practice these actions before the incident.
The most useful serving dashboard ties user pain to root cause: route, model version, token counts, queue time, TTFT, decode rate, fallback rate, and cost. GPU graphs alone are not enough.
Checkpoint
You have the course if you can answer these from memory:
- Why should SLOs be defined in user terms first?
- Why is request count a weak capacity-planning unit for LLMs?
- Which signals are better for autoscaling an LLM server?
- What should a serving canary watch besides error rate?
- What are examples of LLM incidents that are not full outages?
Quick check
- Requests per minute alone
- Input tokens, output tokens, concurrency, and queue time
- Only HTTP 200 rate
- It can raise cost and latency without showing up as an error
- It always means quality improved
- Token count does not affect serving