Learn/RAG/Lesson 07
Lesson 07

Evaluating RAG systems

You do not know a RAG system works because one demo answer looks good. You know it works when retrieval, grounding, citations, latency, and cost keep passing on a test set that looks like real use.

The one idea

Evaluate retrieval and generation separately first, then evaluate the full user-visible answer. That tells you where the failure lives.

Start with real questions

A RAG eval set should contain questions users actually ask or questions that behave like them. Include easy lookups, ambiguous questions, exact identifier searches, multi-hop questions, stale-document traps, permission-sensitive questions, and questions the system should refuse.

For each question, store the expected source chunks or source documents. If you only store a golden answer, you will miss retrieval failures that happen to produce a similar-looking response.

Evaluate retrieval first

Retrieval recall asks whether the right evidence appears in the candidate set. For example: did the expected source appear in the top 10 candidates? Did it survive reranking into the final context? If not, generation cannot be blamed for missing the fact.

Track recall at multiple points: initial vector search, hybrid candidate merge, post-filter candidates, post-rerank context. This turns "RAG is bad" into a specific pipeline bug.

Evaluate the answer

Answer quality has several parts:

  • Faithfulness. The answer is supported by the retrieved sources.
  • Correctness. The answer resolves the user's task.
  • Citation quality. Citations point to sources that actually support the claim.
  • Abstention. The system refuses or asks for clarification when evidence is missing.
  • Usefulness. The answer is clear enough for the user to act on.

Some of these can be scored by humans, some by scripted checks, and some by an LLM judge calibrated against human labels. Do not trust an automated judge until you have checked its agreement on your task.

Measure production constraints

RAG systems also fail by being too slow or too expensive. Track retrieval latency, reranker latency, model latency, input tokens, output tokens, total cost per request, cache hit rate, and index freshness. These numbers shape the product as much as answer quality does.

A pipeline that answers perfectly after 18 seconds may be useless for chat support. A cheap pipeline with weak citations may be useless for compliance review. Evaluation has to match the product.

Build a failure taxonomy

Every bad answer should get a cause label. Useful labels include:

  • Right source not indexed.
  • Right source indexed but not retrieved.
  • Right source retrieved but not selected for context.
  • Evidence selected but answer ignored it.
  • Answer cited the wrong source.
  • Source was stale or conflicting.
  • User lacked permission for the needed source.

Those labels tell you what to fix next. Without them, teams tend to randomly change chunk size, swap embedding models, rewrite prompts, and hope.

Engineering reality

Keep a small regression suite that runs on every retrieval change. Chunking, embedding model swaps, index settings, ranking weights, and prompt changes can all move quality. If you cannot compare before and after, you are tuning blind.

What good looks like

A healthy RAG system has a trace for every answer, a labeled eval set, retrieval metrics, answer metrics, latency and cost metrics, and a process for reviewing failures. It also has a clear answer to "what should happen when the evidence is missing?"

That is the main lesson of the course. RAG is not just embeddings plus a prompt. It is an evidence pipeline with a language model at the end.

Checkpoint

You're done with this course if you can answer these from memory:

  • Why should retrieval and generation be evaluated separately?
  • What does retrieval recall tell you?
  • What makes a citation faithful?
  • Which latency and cost metrics matter for RAG?
  • How does a failure taxonomy guide improvements?

Quick check

  • The retrieval trace for the needed source
  • The prose style of the answer
  • Whether the answer was long enough
  • So you never need logs
  • To compare quality before and after pipeline changes
  • To make reranking free