Lesson 01

What is LLM inference?

Inference is what happens after a model has already been trained. You send a prompt, the serving system runs the frozen model, and tokens come back. The model is not learning during this call. It is doing a lot of arithmetic, very quickly, under a tight latency budget.

The one idea

LLM inference is the repeated use of a frozen model to predict the next token. A response is not one big prediction. It is a loop that runs once for the prompt, then once for each generated token.

The model call as a pipeline

An API call hides a lot of machinery behind one request. The text is tokenized. Those token IDs are moved to the model server. The model runs a forward pass and produces scores for the next token. A sampler chooses one token. The server appends that token to the context and repeats the process until the answer stops.

No gradients are computed. No weights are updated. The model is a fixed table of numbers plus a fixed architecture. Inference is the work of reading those numbers and applying them to the current context.

The serving call is a loop. Every output token changes the context for the next step.

Why inference deserves its own course

It is tempting to think the hard part is training. Training is hard, but most teams building products do not train frontier models. They run models. They pay for input tokens, output tokens, memory, GPUs, queues, retries, and user wait time.

Inference also has different constraints than training. Training wants to keep huge clusters busy for long jobs. Inference wants to answer many small requests with acceptable latency. The server has to juggle short prompts, long prompts, short answers, long answers, streaming clients, and traffic spikes.

That is why the vocabulary changes. You start hearing about time to first token, tokens per second, KV cache memory, batch size, throughput, and tail latency. Those are not academic details. They decide whether your product feels instant, sluggish, or too expensive to ship.

The two kinds of work

An LLM inference request has two main phases. First, the server reads the prompt and builds internal state for all input tokens. That phase is called prefill. Then the server generates output tokens one by one. That phase is called decode.

Prefill can process many prompt tokens in parallel. Decode is more sequential because token 17 cannot be generated until token 16 exists. This split explains a lot of real behavior: long prompts make the first token slower, while long answers keep the server busy for longer.

Vocabulary

Input tokens are the prompt, system message, chat history, and retrieved context. Output tokens are the model's answer. They stress the server in different ways, so track them separately.

What the API does not show you

A hosted API gives you a clean interface: send messages, receive text. Underneath, a scheduler places your request on hardware, batches it with other requests, keeps cache memory around, streams tokens back, and frees resources when the request finishes.

If you only look at total request time, you miss the useful shape. A bad experience might be slow because the prompt is huge, because the model emits too many tokens, because the queue is backed up, because cache memory is full, or because the server is using a model too large for the hardware.

Engineering reality

When an AI feature feels slow, do not start by rewriting the prompt. First split the request into queue time, time to first token, output tokens per second, and total output length. The fix depends on which number is actually bad.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What does it mean that inference runs a frozen model?
Why is an LLM response a loop instead of one prediction?
What is the difference between input tokens and output tokens?
Why can inference dominate production AI cost?

Quick check

They are updated after every user request
They stay fixed and are read by the forward pass
They are replaced by token IDs

Every output token needs another model step
The prompt is tokenized again from scratch for each word
The sampler gets slower after each token