The RAG retrieval pipeline

A working RAG system has two paths: an offline path that prepares documents for search, and an online path that answers a user request. Keeping those paths separate makes the system much easier to debug.

The one idea

RAG quality depends on the whole pipeline, not just the model call. Bad parsing, missing metadata, weak ranking, or messy context can break the answer before generation starts.

Two paths, one answer

The offline path runs before the user asks anything. It turns source material into searchable units. The online path runs at request time. It interprets the question, searches the index, selects evidence, and asks the model to answer.

This split matters because the work has different constraints. Offline ingestion can be slow and careful. Online retrieval must be fast enough for a user-facing request.

The index is built ahead of time. The user request only searches and assembles evidence.

The offline path

Ingestion starts with source documents: HTML pages, PDFs, Markdown files, tickets, database rows, transcripts, code, or anything else the system should know how to search. The job is to turn those sources into chunks with stable IDs, clean text, metadata, and optional embeddings.

A good ingestion pipeline records where every chunk came from. At minimum, keep the source URL or file path, title, section heading, updated time, access scope, and a chunk ID. Later, when an answer cites something, those fields are what make the citation real.

Do not treat parsing as boring plumbing. PDF footers, nav menus, duplicated sidebars, broken tables, and weird Unicode can poison retrieval. If the index stores junk, search will confidently find junk.

The online path

At request time, the system receives a question and turns it into one or more searches. Sometimes the raw user question is enough. Sometimes you rewrite it, expand acronyms, add filters, or split a broad question into sub-queries.

The first retrieval pass usually grabs more candidates than the model will see. Then ranking, filtering, deduplication, and context assembly decide which chunks make it into the prompt. This is where a lot of quality is won or lost.

Metadata is part of retrieval

Embeddings capture semantic similarity, but metadata gives you control. If a user asks about the current vacation policy, you probably want filters like doc_type:policy, status:published, and department:hr. If a customer asks about their contract, access control and account ID matter more than semantic similarity.

Without metadata, the retriever has to infer everything from text similarity. That is fragile. With metadata, you can narrow the search space before ranking.

Common pipeline failures

Stale index. The source changed, but the indexed chunks did not.
Lost provenance. The chunk has text, but no useful source URL, title, or section.
Bad access control. A user can retrieve content they should not see.
Overstuffed context. Too many chunks bury the useful evidence.
No retrieval logs. You cannot inspect which chunks were returned for a bad answer.

Engineering reality

Log the query, filters, candidate IDs, scores, final context IDs, model response, and citations for every RAG request. Without that trace, debugging becomes guesswork.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What work belongs in the offline indexing path?
What work belongs in the online request path?
Why is metadata not optional in a serious RAG system?
What should you log for a RAG request?

Quick check

They have different latency and quality constraints
Because offline uses small models and online uses large models
Because offline quality does not affect answers

A short title that sounds official
A stable source ID, title, URL or path, and section metadata
Only the vector similarity score