Learn/RAG/Lesson 03
Lesson 03

Chunking documents for RAG

Chunking is the part of RAG that sounds simple until it ruins your answers. The retriever does not search a whole document. It searches the pieces you chose to create.

The one idea

A chunk should be small enough to retrieve precisely and large enough to carry the context needed to answer. The right boundary is usually a semantic boundary, not an arbitrary character count.

Why chunks exist

Most source documents are too large to search, rank, and place into an LLM prompt as one unit. A policy page might cover vacation, sick leave, holidays, payroll, and approval rules. If you embed the whole page, a question about sick leave may retrieve the page, but the model still has to sift through unrelated text.

Chunks make retrieval more precise. They also make citations better. A citation to a specific section is more useful than a citation to a 40-page PDF.

The chunk size tradeoff

Small chunks are easy to match precisely, but they can lose the surrounding context. Large chunks preserve context, but they can match too many questions and waste prompt budget. There is no universal best number.

For prose, a reasonable starting point is often a few hundred tokens per chunk with modest overlap. For API docs, code, legal text, tables, and transcripts, the right unit may be a heading section, function, clause, row group, or speaker turn.

Practical rule

Start with semantic boundaries first, then use size limits as guardrails. Split by headings, paragraphs, list items, or code blocks before falling back to fixed token windows.

Overlap helps, but it is not free

Overlap repeats a little text between adjacent chunks so an answer is less likely to depend on a sentence that got cut away. It helps with boundary problems, especially in plain prose.

But overlap increases index size, retrieval duplicates, and prompt repetition. Too much overlap can make the top results look diverse while actually returning the same paragraph five times. Use enough to preserve continuity, then dedupe during retrieval.

Titles and headings carry meaning

A chunk often makes more sense when you prepend its document title and section path. The text "Requests must be approved by a manager" is vague on its own. "Expense policy > Travel meals > Approval" is much easier for retrieval and generation to interpret.

This is one reason HTML and Markdown are easier to index than messy PDFs. Their structure is explicit. When structure exists, keep it.

Special cases

  • Tables. Preserve headers with each row or row group. A row without column names is often meaningless.
  • Code. Keep functions, classes, imports, and comments together when possible. Splitting in the middle of a function hurts both retrieval and explanation.
  • Transcripts. Keep speaker labels and timestamps. The same sentence can mean different things depending on who said it.
  • FAQs. Keep question and answer together. Embedding only the answer loses the phrasing users may search for.

Chunk IDs and source traceability

Every chunk should have a stable ID. That ID should point back to a source, version, and position. If a user reports a bad answer, you want to see exactly which chunk was used and whether that chunk still exists in the current source.

Good chunk metadata lets you rebuild indexes, compare old and new retrieval behavior, enforce permissions, and render citations that users can inspect.

Engineering reality

Chunking changes are data migrations. If you change chunk size or boundaries, old retrieval scores, cached answers, and evaluation baselines may stop being comparable. Version your chunking strategy.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • Why can both tiny chunks and huge chunks hurt answer quality?
  • When does overlap help?
  • Why should titles and headings be attached to chunks?
  • What metadata should every chunk carry?

Quick check

  • Because fixed windows are always too slow
  • Because fixed windows can cut through the middle of a meaningful unit
  • Because embedding models cannot embed fixed windows
  • They tell the retriever and model what the chunk is about
  • They reduce token count
  • They automatically enforce access control