Lesson 02

Attention: how tokens look at each other

A token's meaning depends on the tokens around it. Attention is the transformer mechanism that lets every token pull in the parts of the context that matter for the next representation.

The one idea

Attention lets each token build a weighted summary of other tokens. It asks "which previous tokens matter to me right now?" and mixes information from them.

The same word can mean different things

Take the word bank. In "the river bank was muddy," it means land near water. In "the bank approved the loan," it means a financial institution. The token is the same, but the context changes its meaning. A useful language model needs every token representation to absorb nearby clues.

Older sequence models pushed information forward step by step. Transformers do something more direct: each token can look across the sequence and decide which tokens to read from. That is attention.

Queries, keys, and values

The standard attention explanation has three names: query, key, and value. The names sound more mysterious than the idea.

  • Query: what this token is looking for.
  • Key: what each other token offers as a match.
  • Value: the information each token will contribute if it gets attended to.

The model compares a token's query with other tokens' keys. Better matches get higher weights. Then it takes a weighted mix of the values. That mixed vector becomes the information this token carries forward.

"it" decides what it refers to The trophy didn't fit because it High weight on "trophy", lower weights elsewhere The actual weights are learned numbers, not hand-written grammar rules.
Attention can route information across distance. The token "it" can pull strongly from "trophy" even though other words sit between them.

Attention is soft lookup

A normal lookup grabs one exact item by key. Attention is softer: it compares against many keys, assigns a score to each, turns those scores into probabilities, then blends the values. If one previous token is clearly relevant, it dominates. If several tokens matter, the model can mix them.

The formula is often written like this:

attention = softmax(query · keys) · values

Read it as: compare what I need with what every token offers, normalize the scores, then mix the information.

Causal attention only looks backward

LLMs that generate text are usually trained with causal masking. When predicting the next token, position 5 can look at positions 1 through 5, but not position 6 or 7. During training, the model may process a whole sequence at once, but the mask hides the future so the task stays honest.

This is why generation can run left to right. The model builds each next token from the prompt and the tokens it already generated. It never gets to peek at the answer that has not been written yet.

Multi-head attention

One attention pattern is not enough. A token might need grammar from one place, factual context from another, and formatting clues from a third. Multi-head attention runs several attention mechanisms in parallel, each with its own learned view of what matters. The outputs are joined and passed onward.

Do not over-interpret individual heads as if each one has a neat human label. Some heads learn visible behaviors, like tracking nearby punctuation or copying names. Many are harder to explain. What matters is that multiple heads give the block several routes for moving information.

Engineering reality

Attention is powerful, but it scales badly with sequence length. A sequence with twice as many tokens has roughly four times as many token pairs to score during the full prompt pass. This is one reason long context is expensive, even before you ask the model to generate anything.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What problem does attention solve for token representations?
  • What are queries, keys, and values in plain English?
  • Why does a causal language model mask future tokens?
  • Why does long context make attention more expensive?

Quick check

  • How strongly one token reads information from another token
  • How common a token is in the training data
  • How expensive a token is to process
  • To hide private training examples
  • To stop each position from looking at future tokens
  • To ignore punctuation tokens
  • It generates multiple final answers at once
  • It gives the model several learned ways to route information
  • It splits text into tokens more accurately