Lesson 06

Embeddings: meaning as geometry

Computers are good at math and bad at meaning. Embeddings bridge the gap: they turn a word, a sentence, an image, or a sound into a point in space, placed so that similar things sit close together. Once meaning is geometry, "find similar" becomes "find nearby", and that one move powers most of modern AI.

The one idea

An embedding is a list of numbers that pins a thing to a location in a high-dimensional space, arranged so that things with similar meaning land near each other. Distance becomes a stand-in for similarity.

From symbols to coordinates

To a computer, the word "cat" is just three characters. It carries no hint that "kitten" is close in meaning and "bulldozer" is not. Text, on its own, has no math you can do with it. You can check if two strings are equal, but "happy" and "joyful" are completely different strings, so equality tells you nothing about meaning.

An embedding fixes that by handing each thing a set of coordinates. Instead of the symbol "cat", you get a list of numbers like [0.12, -0.94, 0.08, ...]. That list is a vector, and a vector is just a point in space, the same way [3, 5] is a point on a 2D graph. The trick is that the model places these points carefully: things that mean similar stuff get put near each other, and unrelated things get put far apart. Meaning turns into position.

Plain version

An embedding is a thing's address in "meaning space." Two things with nearby addresses mean roughly the same thing.

What that space looks like

Real embedding spaces have hundreds to a few thousand dimensions, which no one can picture. But the intuition survives in 2D. Below, every word is a point. Similar words clump into neighborhoods, and unrelated neighborhoods sit far apart. The query word "puppy" lands inside the animal cluster, and its nearest neighbors are exactly the words you'd expect.

Words as points. "Puppy" lands in the animal neighborhood, far from the vehicles, and the dashed lines pick out its nearest neighbors. The axes have no human-readable meaning, which is the catch we get to below.

Notice what the picture buys you. To answer "what is most similar to puppy?", you don't parse grammar or look up a dictionary. You just measure which points are closest. The hard problem of meaning collapses into a geometry problem you can solve with arithmetic.

Where the numbers come from

You don't write these coordinates by hand. They come out of a model that learned them, which ties straight back to the idea from lesson 03: a model doesn't work on raw input, it builds its own internal representation of it. An embedding is exactly that representation, pulled out and used on its own.

The model gets pushed toward useful coordinates during training. The original word-vector models learned by a simple game: predict a word from the words around it. To do that well, the model had to give words that show up in similar contexts similar vectors. "Cat" and "dog" appear near words like "pet", "vet", and "feed", so they drift together. "Truck" appears near "drive", "road", and "diesel", so it ends up elsewhere. Nobody told the model that cats and dogs are both animals. The structure fell out of the patterns in the text. Modern embedding models are bigger and trained differently, but the principle holds: similar usage produces nearby vectors.

The classic word2vec demo showed that you could do arithmetic on word vectors: take the vector for king, subtract man, add woman, and the closest vector is roughly queen. The intuition is gorgeous: a direction in the space seems to encode "royalty" and another encodes "gender", so you can move along them like axes.

Treat it as intuition, not a law. When researchers checked these analogies carefully, a lot of the clean results lean on a quiet trick: the search for the answer excludes the input words themselves. If you don't exclude them, "king - man + woman" often just returns "king" again. The directions are real and the effect is suggestive, but it is fragile and depends on the exact words. The honest takeaway is the first one: similar meanings sit near each other. The neat algebra is a bonus that doesn't always hold.

How you compare two vectors

"Near" needs a number. The usual measure is cosine similarity: it looks at the angle between two vectors rather than how long they are. Point in the same direction and cosine similarity is 1 (identical meaning). At right angles it's 0 (unrelated). Pointing opposite, it's -1. Direction is what carries the meaning, so the angle is what you score, and the length of the vector is mostly ignored.

In code it's a one-liner over the two number lists. The idea, not the syntax, is the point:

cos(a, b) = dot(a, b) / (norm(a) * norm(b))   # 1 = same direction, 0 = unrelated

You'll also see plain Euclidean distance (straight-line distance between the points) and the dot product. For text embeddings cosine similarity is the common default, partly because it shrugs off differences in vector length that don't reflect meaning. The mental model stays the same either way: a single number that says how close two things are.

Why "near = similar" is so useful

Once everything is a point and you can measure closeness, a surprising number of products become the same operation: embed the query, then find the nearest points.

Semantic search. Old search matched keywords, so "how do I fix a flat tire" missed a page titled "repairing a punctured wheel." Embed both and they land near each other, because they mean the same thing. You search by meaning, not by exact words.
Recommendations. Embed songs, products, or articles, then suggest the nearest neighbors of what someone already liked. "More like this" is literally "closest points to this."
RAG (retrieval-augmented generation). Before an LLM answers, you embed the question, find the most similar chunks of your documents, and paste those into the prompt. The model answers from your data instead of guessing. The retrieval step is pure nearest-neighbor search over embeddings, which is why this lesson sits under everything in the RAG track.

And it isn't only text. The same idea applies to images and audio: a model can embed a photo or a sound clip into the same kind of vector, so "find similar images" or "find this song" is again "find nearby points." Some models even embed text and images into one shared space, so you can search images with a text query. Different modality, identical move.

The catch: the axes mean nothing to you

It's tempting to think dimension 1 is "animal-ness" and dimension 7 is "speed." It almost never is. The model arranges the space to make its training task work, not to be readable by humans. Individual dimensions usually carry no clean label. What's meaningful is the relative geometry: which points are near which, and roughly which directions separate groups. You get reliable "these two are similar," not "dimension 4 is the formality knob." Don't try to interpret a single coordinate.

Engineering reality

Embeddings are cheap to love in a demo and full of sharp edges in production. The ones that bite:

You need a vector database. Comparing a query against millions of vectors one by one is too slow. Real systems use approximate nearest neighbor (ANN) search in a vector store (FAISS, pgvector, Pinecone, and friends). "Approximate" is the deal: you trade a tiny bit of accuracy for a massive speedup, and you tune that trade-off.

Embeddings from different models are not comparable. A vector from one model and a vector from another live in different spaces. Their cosine similarity is meaningless. Everything you compare must come from the same model and the same version.

Swapping models means re-embedding everything. Because of the point above, the day you upgrade your embedding model, every stored vector is stale. You have to re-embed your whole corpus and rebuild the index. That's a real cost and a real outage risk, so plan for it.

Dimensions cost money. More dimensions (say 3072 vs 384) can capture more nuance but take more storage and make every search slower. For millions of vectors that adds up fast, so bigger is not automatically better.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is an embedding, and what does "distance" represent in an embedding space?
Where do the numbers in an embedding come from? Tie it back to what a neural network learns.
What does cosine similarity measure, and what do values of 1, 0, and -1 mean?
Name three products that are really just "find the nearest points," and say what gets embedded in each.
Why can't you compare an embedding from one model against an embedding from another?

Quick check

The two things they represent have similar meaning
The two words have a similar number of letters
The two words are next to each other alphabetically

They mean almost exactly the same thing
They are essentially unrelated in meaning
They have directly opposite meanings

Nothing, old and new embeddings can be compared directly
You must re-embed your whole corpus with the new model and rebuild the index
You just pad the old vectors with zeros to match the new size