Transformer architecture: the LLM block
A modern LLM is mostly one transformer block repeated many times. Once you understand the block, the giant model stops looking like magic and starts looking like a deep stack of the same few operations.
A transformer block alternates between mixing information across tokens with attention and transforming each token's representation with an MLP. Residual paths and normalization keep the deep stack trainable.
From token IDs to vectors
The tokenizer gives the model token IDs. The first learned table in the model turns each ID into a vector, called a token embedding. If the sequence has 200 tokens, the model now has 200 vectors, one per position.
The model also needs position information. Without it, "dog bites man" and "man bites dog" would contain the same token vectors in a different order, but attention by itself does not know order. Transformers add or encode positional information so every token representation carries both identity and place.
The repeating block
After embeddings, the sequence flows through a stack of transformer blocks. A simplified decoder-only LLM block looks like this:
Attention mixes across tokens
The attention sublayer is where tokens exchange information. The representation at one position can pull from previous positions, so a pronoun can connect to a noun, a function call can connect to its name, and the last token can summarize what the prompt asked for.
In a decoder-only LLM, causal masking keeps this left-to-right. Each token can use the past and itself, not the future.
The MLP works position by position
The feed-forward network, often called the MLP, is different. It does not mix tokens with each other. It applies the same learned transformation to each token position independently. You can think of it as the part that refines what each position knows after attention has gathered context.
Many of the model's parameters live in these MLP layers. Attention gets the fame because it routes information, but the MLPs hold a lot of the model's learned transformations.
Residuals and layer norm keep the stack alive
Deep networks are hard to train if every layer must rewrite everything from scratch. Residual connections give each block a shortcut: the block adds a change to the existing representation instead of replacing it completely. That makes it easier for information and gradients to move through dozens of layers.
Layer normalization keeps vector values in a stable range as they pass through the stack. Without normalization, values can drift and training becomes brittle. The details matter to model builders, but the practical mental model is enough for now: residuals preserve paths, normalization stabilizes them.
When people talk about "model size," they usually mean parameter count across embeddings, attention projections, MLPs, and output layers. More parameters can store more patterns, but serving cost also rises. Bigger models need more memory, more bandwidth, and often more aggressive batching or quantization to run cheaply.
Where the output comes from
After the last transformer block, the model has a final vector for each token position. For text generation, we care about the final position: given everything so far, what token should come next? A final projection maps that vector to a score for every token in the vocabulary. Those scores become probabilities in the generation step.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What happens between token IDs and the first transformer block?
- What does attention do that the MLP does not?
- Why do transformer blocks use residual connections?
- How does the model turn the final hidden vector into next-token scores?
Quick check
- Self-attention
- The MLP only
- Layer normalization
- Because token IDs already include all word order
- Because order changes meaning, and attention alone has no built-in order
- Because the API bills tokens by position
- Deleting unimportant tokens
- Letting information flow around a sublayer and adding an update
- Changing token IDs back into text