Lesson 03

Transformer architecture: the LLM block

A modern LLM is mostly one transformer block repeated many times. Once you understand the block, the giant model stops looking like magic and starts looking like a deep stack of the same few operations.

The one idea

A transformer block alternates between mixing information across tokens with attention and transforming each token's representation with an MLP. Residual paths and normalization keep the deep stack trainable.

From token IDs to vectors

The tokenizer gives the model token IDs. The first learned table in the model turns each ID into a vector, called a token embedding. If the sequence has 200 tokens, the model now has 200 vectors, one per position.

The model also needs position information. Without it, "dog bites man" and "man bites dog" would contain the same token vectors in a different order, but attention by itself does not know order. Transformers add or encode positional information so every token representation carries both identity and place.

The repeating block

After embeddings, the sequence flows through a stack of transformer blocks. A simplified decoder-only LLM block looks like this:

Architectures vary in ordering details, but the core rhythm is stable: attention mixes across positions, the MLP transforms each position, and residual paths carry information around each sublayer.

Attention mixes across tokens

The attention sublayer is where tokens exchange information. The representation at one position can pull from previous positions, so a pronoun can connect to a noun, a function call can connect to its name, and the last token can summarize what the prompt asked for.

In a decoder-only LLM, causal masking keeps this left-to-right. Each token can use the past and itself, not the future.

The MLP works position by position

The feed-forward network, often called the MLP, is different. It does not mix tokens with each other. It applies the same learned transformation to each token position independently. You can think of it as the part that refines what each position knows after attention has gathered context.

Many of the model's parameters live in these MLP layers. Attention gets the fame because it routes information, but the MLPs hold a lot of the model's learned transformations.

Residuals and layer norm keep the stack alive

Deep networks are hard to train if every layer must rewrite everything from scratch. Residual connections give each block a shortcut: the block adds a change to the existing representation instead of replacing it completely. That makes it easier for information and gradients to move through dozens of layers.

Layer normalization keeps vector values in a stable range as they pass through the stack. Without normalization, values can drift and training becomes brittle. The details matter to model builders, but the practical mental model is enough for now: residuals preserve paths, normalization stabilizes them.

Engineering reality

When people talk about "model size," they usually mean parameter count across embeddings, attention projections, MLPs, and output layers. More parameters can store more patterns, but serving cost also rises. Bigger models need more memory, more bandwidth, and often more aggressive batching or quantization to run cheaply.

Where the output comes from

After the last transformer block, the model has a final vector for each token position. For text generation, we care about the final position: given everything so far, what token should come next? A final projection maps that vector to a score for every token in the vocabulary. Those scores become probabilities in the generation step.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What happens between token IDs and the first transformer block?
What does attention do that the MLP does not?
Why do transformer blocks use residual connections?
How does the model turn the final hidden vector into next-token scores?

Quick check

Self-attention
The MLP only
Layer normalization

Because token IDs already include all word order
Because order changes meaning, and attention alone has no built-in order
Because the API bills tokens by position

Deleting unimportant tokens
Letting information flow around a sublayer and adding an update
Changing token IDs back into text