How LLMs generate text
An LLM answer is not produced in one shot. The model predicts one token, appends it to the context, then predicts the next one. The whole response is that loop repeated until something tells it to stop.
LLMs generate by repeatedly turning the current context into a probability distribution over the next token, choosing one token, adding it to the context, and running the model again.
Training teaches the next-token game
During pretraining, the model sees a huge amount of text and learns a simple task: given the previous tokens, predict the next token. It does this over and over at every position. That task sounds small, but solving it well forces the model to learn grammar, facts, style, code patterns, reasoning traces, and many other regularities in text.
At inference time, we use the same skill differently. We give the model a prompt, ask for the next token, append that token, then ask again.
From logits to probabilities
The model's final layer outputs one score for every token in the vocabulary. These raw scores are called logits. A softmax turns them into probabilities that add up to 1. High-probability tokens are the model's best guesses for what should come next.
Greedy decoding vs sampling
The simplest strategy is greedy decoding: always pick the highest-probability token. This is deterministic and often useful for narrow tasks, but it can make text dull, repetitive, or brittle. Sampling instead treats the probabilities as a distribution and randomly chooses from them. More likely tokens are picked more often, but lower-probability tokens still have a chance.
That controlled randomness is why the same prompt can produce different good answers. The model is not "changing its mind." The sampling process is choosing different paths through the probability tree.
Temperature and top-p
Temperature reshapes the probability distribution. Low temperature sharpens it, making the model more likely to pick the obvious token. High temperature flattens it, giving unlikely tokens more room. Temperature does not make the model smarter. It changes how adventurous the sampler is.
Top-p, also called nucleus sampling, keeps the smallest set of tokens whose probabilities add up to a chosen threshold, then samples inside that set. It cuts off the long tail of weird options while still allowing variety among plausible ones.
Use lower randomness for extraction, classification, code edits, and structured output. Allow more randomness for brainstorming, drafting, naming, and creative options. The right setting depends on the task, not on the model alone.
Stopping is part of generation
The loop needs a stop condition. It can stop when the model emits a special end token, when it hits a maximum output length, or when the serving layer sees a stop sequence like </json>. If your product expects JSON, SQL, Markdown, or a tool call, stop conditions and validation matter as much as the prompt.
Streaming does not make the model compute the answer all at once. It sends tokens to the client as they are generated. This improves perceived latency, but the server still runs the decode loop token by token. Long answers are slow because every output token requires another model step.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- Why is LLM generation a loop?
- What are logits, and how do they become probabilities?
- When would you prefer greedy decoding over sampling?
- What does temperature change, and what does it not change?
- Why does streaming help perceived latency but not remove decode cost?
Quick check
- A raw score for a possible next token
- The hidden prompt stored by the model
- The numeric ID of the chosen token
- It makes the model more factually correct
- It makes sampling more varied and less locked to the top token
- It reduces the number of input tokens
- Each output token requires another model step
- They use rarer words
- The tokenizer has to relearn the vocabulary