Learn/How Inference Works

Track 4 · Inference & serving

How Inference Works

Inference is where an LLM turns a prompt into tokens, one step at a time. This course explains the mechanics underneath the API call so latency, cost, context length, and hardware tradeoffs stop feeling mysterious.

6 lessons Intermediate After The Transformer & LLMs

What is LLM inference?

The serving-time view of a model call: tokens in, forward passes, tokens out, with no training happening.

Prefill vs decode

Why the prompt is processed in one phase and the answer is generated in another, and why they feel different to users.

KV cache explained

How cached keys and values keep generation from recomputing the whole prompt for every new token.

Why LLM inference is memory-bound

Why GPUs can have plenty of math available and still wait on weights, activations, and cache reads.

Context length and output tokens

How input length, chat history, retrieval chunks, and generated tokens drive work, memory, and cost.

Measure LLM inference latency

TTFT, tokens per second, tail latency, throughput, and the small benchmark shape every team should collect.