Lesson 01

Tokenization: how text becomes tokens

An LLM does not read text the way you do. Before the transformer sees anything, the text is chopped into tokens and each token is turned into a number. That small preprocessing step shapes cost, speed, context limits, and some surprisingly weird model behavior.

The one idea

A token is the unit an LLM actually processes. It can be a whole word, part of a word, punctuation, whitespace, or a byte-like chunk. The model predicts tokens, not words.

Why text needs a translation step

Neural networks work with numbers. Raw text is not numbers, so the first job is to turn a string like The dog ran. into a sequence of token IDs like [791, 5679, 4966, 13]. Those IDs are not meaningful by themselves. They are lookup keys into the model's embedding table, which gives every token a vector the transformer can process.

That means the actual input to an LLM is not a sentence. It is a list of token IDs, then a list of vectors. All the language behavior starts from that list.

The tokenizer is part of the model contract. Change it and the model is no longer reading the same input it was trained on.

Why not split on words?

A word-based tokenizer sounds obvious until you try to make it real. It needs a giant dictionary. It breaks on typos, names, product IDs, code, URLs, emoji, and languages that do not use spaces the same way English does. It also treats run, runs, running, and rerun as unrelated items unless you add even more rules.

Modern LLMs usually use subword tokenization. Common words can be single tokens. Rare words get split into pieces the model has seen before. So a word like tokenization might become something like token + ization. A strange identifier can be broken into smaller chunks instead of becoming "unknown."

Plain version

Subword tokenization is a compromise: smaller than full words, larger than single characters. It keeps the vocabulary manageable without giving up on weird text.

The tokenizer learns common chunks

A common approach is byte-pair encoding, usually shortened to BPE. The training idea is simple: start with tiny units, then repeatedly merge pairs that appear together often. If t and h often appear as th, merge them. If ing appears constantly, keep it as a useful chunk. After many merges, the tokenizer has a vocabulary of pieces that reflect the text it was trained on.

That history matters. A tokenizer trained mostly on English web text tends to pack English efficiently. A language, script, or domain that was underrepresented may need more tokens for the same amount of human-readable text. More tokens means less fits in context and every request costs more.

Tokenization is where cost starts

APIs usually bill by tokens, not words or characters. The transformer also spends compute per token. A prompt with 2,000 tokens is roughly twice as much sequence to process as a prompt with 1,000 tokens. That is why "just paste the whole document" gets expensive fast.

Engineering reality

Token count is a product constraint. It controls prompt budget, request cost, latency, retrieval chunk size, conversation history length, and the maximum output you can ask for. If a feature accepts user text, logs, documents, or code, count tokens early. Character count is a weak proxy, especially across languages and code-heavy input.

Token boundaries can change behavior

Because the model predicts tokens, not characters, tiny text changes can change the token sequence. Leading spaces, capitalization, punctuation, and uncommon spellings may produce different token splits. The model can still handle that, but the splits affect what embeddings it starts with and what next-token choices are available.

This is one reason LLMs sometimes feel oddly sensitive to formatting. A clean prompt with stable delimiters is not superstition. It gives the tokenizer and model a more regular pattern to work with.

Many tokenizers encode common words with the preceding space included, because text usually has spaces before words. So " hello" and "hello" can be different tokens. This looks strange until you remember that the tokenizer learned frequent byte patterns, not human grammar.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is a token, and why does the model need token IDs before it can process text?
Why do LLMs usually use subword tokens instead of full words?
How does token count affect context length, cost, and latency?
Why can punctuation, spaces, or uncommon words change the model's input?

Quick check

The next word
The next token
The next character only

They affect context budget, latency, and cost
They decide whether the answer sounds formal
They tell the model which language to use

It forces every word to be a single known dictionary entry
It handles common text efficiently and still represents rare text
It makes the model understand grammar before training