Track 4 · Inference & serving
How Inference Works
Inference is where an LLM turns a prompt into tokens, one step at a time. This course explains the mechanics underneath the API call so latency, cost, context length, and hardware tradeoffs stop feeling mysterious.
01
02
03
04
05
06
What is LLM inference?
The serving-time view of a model call: tokens in, forward passes, tokens out, with no training happening.
Prefill vs decode
Why the prompt is processed in one phase and the answer is generated in another, and why they feel different to users.
KV cache explained
How cached keys and values keep generation from recomputing the whole prompt for every new token.
Why LLM inference is memory-bound
Why GPUs can have plenty of math available and still wait on weights, activations, and cache reads.
Context length and output tokens
How input length, chat history, retrieval chunks, and generated tokens drive work, memory, and cost.
Measure LLM inference latency
TTFT, tokens per second, tail latency, throughput, and the small benchmark shape every team should collect.