Lesson 04

How LLMs generate text

An LLM answer is not produced in one shot. The model predicts one token, appends it to the context, then predicts the next one. The whole response is that loop repeated until something tells it to stop.

The one idea

LLMs generate by repeatedly turning the current context into a probability distribution over the next token, choosing one token, adding it to the context, and running the model again.

Training teaches the next-token game

During pretraining, the model sees a huge amount of text and learns a simple task: given the previous tokens, predict the next token. It does this over and over at every position. That task sounds small, but solving it well forces the model to learn grammar, facts, style, code patterns, reasoning traces, and many other regularities in text.

At inference time, we use the same skill differently. We give the model a prompt, ask for the next token, append that token, then ask again.

From logits to probabilities

The model's final layer outputs one score for every token in the vocabulary. These raw scores are called logits. A softmax turns them into probabilities that add up to 1. High-probability tokens are the model's best guesses for what should come next.

Generation is iterative. The model does not draft a whole answer internally and reveal it. It commits one token at a time.

Greedy decoding vs sampling

The simplest strategy is greedy decoding: always pick the highest-probability token. This is deterministic and often useful for narrow tasks, but it can make text dull, repetitive, or brittle. Sampling instead treats the probabilities as a distribution and randomly chooses from them. More likely tokens are picked more often, but lower-probability tokens still have a chance.

That controlled randomness is why the same prompt can produce different good answers. The model is not "changing its mind." The sampling process is choosing different paths through the probability tree.

Temperature and top-p

Temperature reshapes the probability distribution. Low temperature sharpens it, making the model more likely to pick the obvious token. High temperature flattens it, giving unlikely tokens more room. Temperature does not make the model smarter. It changes how adventurous the sampler is.

Top-p, also called nucleus sampling, keeps the smallest set of tokens whose probabilities add up to a chosen threshold, then samples inside that set. It cuts off the long tail of weird options while still allowing variety among plausible ones.

Practical default

Use lower randomness for extraction, classification, code edits, and structured output. Allow more randomness for brainstorming, drafting, naming, and creative options. The right setting depends on the task, not on the model alone.

Stopping is part of generation

The loop needs a stop condition. It can stop when the model emits a special end token, when it hits a maximum output length, or when the serving layer sees a stop sequence like </json>. If your product expects JSON, SQL, Markdown, or a tool call, stop conditions and validation matter as much as the prompt.

Engineering reality

Streaming does not make the model compute the answer all at once. It sends tokens to the client as they are generated. This improves perceived latency, but the server still runs the decode loop token by token. Long answers are slow because every output token requires another model step.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why is LLM generation a loop?
What are logits, and how do they become probabilities?
When would you prefer greedy decoding over sampling?
What does temperature change, and what does it not change?
Why does streaming help perceived latency but not remove decode cost?

Quick check

A raw score for a possible next token
The hidden prompt stored by the model
The numeric ID of the chosen token

It makes the model more factually correct
It makes sampling more varied and less locked to the top token
It reduces the number of input tokens

Each output token requires another model step
They use rarer words
The tokenizer has to relearn the vocabulary