Lesson 06

Measure LLM inference latency

"The model is slow" is not a diagnosis. A useful trace separates queueing, prefill, first token, decode rate, total latency, token counts, and tail behavior.

The one idea

Measure LLM inference as phases and rates, not one total duration. The right fix depends on whether the request is waiting in a queue, reading too much context, decoding too slowly, or generating too much text.

The core latency metrics

Start with a small set of measurements that map to the mechanics from this course.

Queue time: how long the request waits before the model server starts work.
Time to first token: how long until the first output token is available.
Output tokens per second: how quickly decode produces visible tokens after the first one.
Total latency: how long the full request takes from send to final token.
Input and output token counts: the context and answer sizes that explain the work.

Those five numbers already beat most vague dashboards. They tell you whether to inspect the scheduler, prompt, model, cache, or product output policy.

Use percentiles, not averages

Average latency hides the requests users complain about. In serving, the painful behavior lives in the tail: p95, p99, and the worst cases around traffic spikes or giant prompts.

A voice agent might feel good at p50 and broken at p95. A coding assistant might handle short prompts well and collapse on long files. Tail latency is where capacity problems show up first.

Averages can look fine while the slowest users wait long enough to abandon the task.

Benchmark with realistic shapes

A benchmark with 128 input tokens and 32 output tokens tells you almost nothing about a RAG support assistant that sends 6,000 input tokens and asks for 500 output tokens. The workload shape has to match the product.

Useful benchmark cases include short prompt and short output, long prompt and short output, short prompt and long output, and long prompt and long output. Add concurrency levels that reflect real traffic. Then record how TTFT, tokens per second, memory, and error rate change.

Separate throughput from latency

Throughput asks how much work the system completes per second: requests per second or tokens per second across all users. Latency asks how long one user waits. You need both.

Batching can improve throughput while hurting individual latency if requests wait too long to join a batch. Low latency can also waste hardware if the server refuses to batch enough. This is the central serving tradeoff, and the next course goes deeper into it.

A practical trace shape

For each LLM call, log a compact event with request class, model, input tokens, output tokens, queue time, TTFT, decode duration, total latency, finish reason, and error status. If privacy rules allow it, store prompt categories or feature names, not raw user text.

That trace lets you answer real questions: Did latency rise because prompts got longer? Did a new prompt template double output length? Did tail latency spike only when concurrency rose? Did errors cluster around max context?

Engineering reality

Do not optimize from a single demo request. Inference behavior changes with traffic mix. Measure by route, model, token bucket, and percentile, then tune the workload that actually matters.

Checkpoint

You have the course if you can answer these from memory:

Why is total latency alone not enough?
What does time to first token usually point toward?
Why should input and output tokens be logged separately?
Why can batching improve throughput while hurting latency?
Why are p95 and p99 important for user-facing AI products?

Quick check

Input token count and prefill work
Decode kernel speed
Temperature

They replace token counts
They expose the slow requests hidden by averages
They tell you which model was trained longest