Lesson 05

Streaming and the latency budget

Streaming does not make the model faster. It changes what the user feels and what the server has to manage. This lesson breaks latency into the parts a user actually experiences and covers the two things streaming forces you to handle: backpressure and cancellation.

The one idea

Without streaming, the user waits for the whole answer and feels total latency. With streaming, they feel two numbers separately: the wait for the first token (TTFT) and the rhythm of the rest (inter-token latency). Designing for perceived latency means budgeting those two, and handling what happens when the client is slow or walks away.

Perceived latency is not total latency

Suppose an answer takes four seconds to generate in full. Delivered all at once, the user stares at a spinner for four seconds, then gets a wall of text. Streamed token by token, the user sees the first words in maybe 300 milliseconds and then watches the answer arrive at reading speed. Same total time, completely different experience. The streamed version feels fast because the wait before anything happens is short, and humans weigh that opening silence far more heavily than the rest.

This is why streaming is the default for chat. It does not reduce the work the GPU does or the throughput of the fleet. It reshapes the same total latency into a form people tolerate, by paying out progress continuously instead of in one lump at the end.

The two numbers users feel

Streaming splits the user's experience into two clocks, and they have different causes and different fixes.

TTFT is the opening silence. Inter-token latency is the spacing of everything after. Users judge both, and they have different root causes.

TTFT (time to first token) is set by prefill cost and by queueing. Long prompts, cold starts, and a request waiting behind a full batch all inflate it. The fixes are the ones from earlier lessons: trim the prompt, schedule prefill well, keep enough capacity that requests do not queue. This is the same metric defined in the Inference L06 glossary: wall-clock from request accepted to first output token available.

TTFB (time to first byte) is the HTTP term for when the client receives the first byte of the response body. In a streamed LLM API, TTFB usually tracks TTFT closely, but they are not identical. TTFB includes TLS, proxy buffering, and SSE framing overhead. A deployment can have good server-side TTFT and poor TTFB if a gateway buffers events. When someone cites TTFB in a web benchmark, ask whether they mean server TTFT or bytes on the wire.

Inter-token latency (ITL, also called TPOT, time per output token) is set by decode speed: model size, hardware, batch size, and KV cache pressure. A useful sanity check is that ITL should sit comfortably under reading speed. People read at roughly 5 to 8 words per second, so once tokens arrive faster than that, making them even faster does little for perceived quality, and you are better off spending that GPU on throughput.

How tokens reach the client

Mechanically, the server holds the connection open and pushes each token as it is decoded, rather than buffering the whole answer. The common transports are server-sent events (SSE), which is what most LLM HTTP APIs use, plain chunked HTTP responses, and WebSockets for bidirectional cases like voice. The transport details differ, but the shape is the same: a long-lived response that emits a small event per token or per small group of tokens.

That long-lived connection is the source of the two problems streaming adds. The server is now coupled to a client for the entire duration of generation, and clients are slow, flaky, and prone to leaving. You have to plan for both.

Backpressure: when the client can't keep up

The GPU can produce tokens faster than a client on a weak connection can receive them. If the server keeps decoding and buffering tokens the client has not drained, that buffer grows, and memory you needed for KV cache and other requests gets eaten by the backlog. Multiply that across many slow clients and the server degrades for everyone.

Backpressure is the mechanism that lets a slow consumer signal "wait" up the chain. A well-behaved streaming server watches whether the client is keeping up and stops producing into a full buffer rather than letting it balloon. In practice you rely on the framework's flow control, cap per-connection buffers, and decide a policy for the truly stuck client: slow it, or cut it. The point to internalize is that a streamed request holds resources for its whole life, so a slow reader is a resource leak unless backpressure is handled.

Cancellation: when the user walks away

Users close tabs, hit stop, and navigate away mid-answer. If the server does not notice, it keeps decoding tokens nobody will ever read, holding a precious batch slot and its KV cache to finish an essay for an empty room. Under load this is pure waste, and it is common: people frequently stop a generation the moment they have seen enough.

So cancellation has to flow all the way to the scheduler. When the client disconnects, the server should detect it, evict that sequence from the running batch, and free its KV blocks immediately, the same way continuous batching evicts a finished sequence. Done right, a user pressing stop instantly returns capacity to everyone else. Done wrong, abandoned generations quietly consume a chunk of your fleet. This is one of the highest-leverage and most overlooked pieces of streaming hygiene.

Practical read

Test cancellation on purpose. Fire a request, drop the connection mid-stream, and confirm in your server metrics that the sequence actually leaves the batch and the KV memory is released. Many setups stream correctly but never wire disconnects through to the scheduler, so canceled work runs to completion invisibly. You only find out when peak-load capacity is mysteriously short.

Prefix caching and perceived TTFT

Many chat products reuse the same system prompt, tool definitions, or RAG document prefix across thousands of requests. Without caching, the server prefills that shared prefix from scratch every time, and TTFT scales with the full prompt length even when only the user's last message changed.

Prefix caching (also called prompt caching) stores the KV blocks for a shared prefix after the first request and reuses them on later requests that start with the same bytes. The second request skips most prefill work and TTFT drops to roughly queue time plus prefill of the new suffix. Users feel this as "the app got faster" even though decode speed is unchanged.

Caching helps most when prefixes are long and stable: big system prompts, repeated RAG chunks, multi-turn threads where earlier turns are identical. It does not help unique one-shot prompts. Pair it with the scheduler knobs in lesson 06: a cache hit is wasted if the request still queues behind a monolithic prefill for someone else.

Engineering reality

Set explicit SLOs on TTFT and ITL at the tail, not the average, and measure them at the client, not just at the GPU. A token the model produced is not a token the user saw: it still has to cross your gateway, load balancer, and the public internet. A deployment can hit a great server-side ITL and still feel choppy because tokens bunch up at a proxy that buffers. Budget the whole path from GPU to glass, and watch p95 and p99.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why does streaming improve perceived latency without changing total latency?
What causes high TTFT versus high inter-token latency, and how do the fixes differ?
What is backpressure, and why does a slow client threaten the whole server?
Why must client cancellation reach the scheduler, and what is wasted if it does not?
How does prefix caching change TTFT for repeated system prompts or RAG prefixes?
When does TTFB diverge from TTFT?

Quick check

Streaming shortens the opening silence and shows steady progress, which is what users judge
Streaming makes the GPU generate the answer faster
The streamed system produces a shorter answer

The client is reading too slowly
Disconnects are not reaching the scheduler, so abandoned generations finish and hold their slots
SSE is slower than WebSockets