Embeddings and vector search for RAG

Vector search is the usual entry point into RAG because it can find text that is similar in meaning, not just text that shares keywords. That makes it powerful, but it also gives people too much confidence too quickly.

The one idea

An embedding model turns text into a vector. Vector search retrieves chunks whose vectors are close to the query vector. Close usually means semantically related, not guaranteed relevant.

What gets embedded

During indexing, each chunk is passed through an embedding model. The result is a list of numbers. You store that vector alongside the chunk text and metadata. During a search, the user query is embedded with the same model, and the index returns nearby chunk vectors.

The embedding model matters because it defines the geometry of your search space. If it was not trained to handle your language, domain, code style, or query shape, the nearest chunks may still be wrong.

Similarity is not relevance

Vector similarity answers a narrow question: are these pieces of text close in embedding space? A relevant answer also depends on freshness, permissions, document type, exact terms, product version, and whether the chunk contains the answer rather than merely discusses the topic.

A query like "how do I rotate keys?" might retrieve chunks about API keys, encryption keys, keyboard shortcuts, or SSH keys. Semantic similarity can get you into the right neighborhood. It does not always pick the right door.

Approximate nearest neighbor search

Exact nearest neighbor search compares the query vector with every stored vector. That becomes expensive as the index grows. Vector databases usually use approximate nearest neighbor indexes, often called ANN indexes, to trade a little recall for speed.

That tradeoff is reasonable, but it is still a tradeoff. Index settings can affect whether the best chunk is returned, how long the query takes, and how much memory the index uses.

Vector database responsibilities

A vector database or vector index usually handles:

Storage for vectors, chunk IDs, and metadata.
Nearest neighbor search over vector embeddings.
Metadata filters such as tenant, source, language, product, or date.
Deletes and updates when source documents change.
Index tuning for latency, memory, and recall.

The database does not understand your product truth by itself. It only searches what you stored and filters by what you modeled.

Common vector search failures

Semantic drift. Results are topically related but do not answer the question.
Exact-token misses. Error codes, function names, SKUs, and IDs may need keyword search.
Embedding mismatch. The model embeds queries and documents poorly for your domain.
Filter mistakes. Search finds good chunks from the wrong tenant, version, region, or document status.
Score over-trust. A high similarity score is treated as proof that the chunk supports the answer.

Engineering reality

Track retrieval recall with known question and source pairs. If the right chunk is not in the first candidate set, prompt tweaks and better answer formatting will not fix the system.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why must query and document chunks use compatible embeddings?
Why is vector similarity different from answer relevance?
What tradeoff does approximate nearest neighbor search make?
What kinds of queries often need more than vector search?

Quick check

Chunks whose embeddings are near the query embedding
The final answer from the LLM
Only the freshest documents

Because identifiers are always removed before indexing
Because semantic closeness is not the same as exact string match
Because vector databases cannot store text