Learn/RAG/Lesson 01
Lesson 01

What is retrieval-augmented generation?

RAG gives an LLM a search step before it answers. Instead of asking the model to rely only on what it learned during training, you retrieve relevant source material and ask the model to answer from that context.

The one idea

RAG is a system pattern: retrieve evidence first, generate second. The model still writes the answer, but the system decides what facts are allowed into the room.

The basic loop

A plain LLM call has one main input: the prompt. The model reads it, uses its learned weights and the current context window, then generates tokens. That works well for general reasoning and writing, but it is weak when the answer depends on private docs, fresh product state, internal policies, tickets, logs, or exact citations.

RAG adds a retrieval step before generation:

  1. The user asks a question.
  2. The system searches a document collection for relevant chunks.
  3. The system puts those chunks into the model context.
  4. The model answers using the retrieved evidence.
Question What does policy say? Retriever Find matching source chunks Prompt Question + evidence + rules A
RAG is not a model type. It is the retrieval and prompting machinery around a model.

Why RAG exists

Training bakes information into model weights. That is useful for broad language and reasoning patterns, but bad for facts that change or facts that should stay outside the model. Your company handbook, current pricing, recent incidents, customer-specific contract terms, and private codebase details should not require model training every time they change.

RAG keeps those facts in a source system you can update. The model gets relevant slices at request time. That gives you a cleaner operational story: update the docs, rebuild the index, and the next answer can use the new evidence.

What RAG is good at

RAG is a strong fit when the answer should be grounded in a known corpus:

  • Internal knowledge base search with natural language answers.
  • Support agents that cite help center pages.
  • Developer assistants that read project docs and code snippets.
  • Legal, policy, or compliance lookup where source traceability matters.
  • Research assistants that summarize a bounded set of documents.

The common thread is not "make the model smarter." The goal is narrower: put the right evidence in front of the model at the right time.

What RAG does not solve

RAG reduces unsupported answers, but it does not make a system automatically correct. Retrieval can miss the right document. Chunking can split away the needed detail. The model can ignore a source, over-read a source, or cite a chunk that does not support the claim. A stale index can serve stale facts.

This is why production RAG is more than vector search. It is content processing, retrieval, ranking, context budgeting, prompt design, citation control, logging, and evaluation.

Engineering reality

Most bad RAG systems fail before the LLM call. The evidence is missing, stale, too broad, duplicated, or badly ranked. The model then writes a polished answer from weak inputs, and everyone blames the model.

The boundary between retrieval and generation

Keep this separation clear:

  • Retrieval decides what evidence the model sees. This is where recall, ranking, filtering, access control, and freshness live.
  • Generation decides how to turn evidence into an answer. This is where summarization, tone, abstention, and citation formatting live.

When a RAG answer is wrong, debug the two halves separately. First ask whether the right evidence was retrieved. Then ask whether the model used it correctly.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What extra step does RAG add before generation?
  • Why is RAG useful for private or changing facts?
  • What is the difference between retrieval quality and generation quality?
  • Why does RAG not remove the need for evaluation?

Quick check

  • To write the final answer before the model sees it
  • To find relevant evidence and place it in the model context
  • To train the model on every document before each question
  • Whether the retrieved chunks contained the needed evidence
  • Whether the temperature was exactly zero
  • Whether the prompt can fit more documents