Hybrid search and reranking

Vector search is useful, but production retrieval usually needs more than one signal. Hybrid search combines semantic matching, exact matching, metadata filters, and a final ranking pass.

The one idea

Use broad retrieval to collect plausible candidates, then use stronger ranking to choose the few chunks the model should actually see.

Why keyword search still matters

Keyword search is good at exact terms: error codes, function names, part numbers, product SKUs, quoted phrases, and legal terms. Vector search is good at semantic similarity. They fail differently, which makes them useful together.

If a user asks about ERR_AUTH_4017, you do not want a semantically similar authentication overview. You want the exact error code. If a user asks "can contractors expense meals?", exact terms may not be enough, and semantic matching can help.

Hybrid retrieval

A common pattern is to run multiple retrieval strategies, merge their candidates, remove duplicates, then rank the combined set. The candidate pool might include vector results, keyword results, filtered results, and boosted results from trusted sources.

The point is not to make retrieval fancy. The point is to improve recall before you make the final selection. A reranker cannot rank a chunk it never receives.

Reranking

A reranker scores query and chunk pairs more carefully than the first retrieval pass. It is usually slower than vector search, so you run it on a limited candidate set: maybe the top 50 candidates from hybrid retrieval, not the whole corpus.

Reranking often improves the top few chunks because it can look at the query and candidate text together. That is different from vector search, where query and document vectors are created separately and compared by distance.

Filters before ranking

Some constraints are not ranking preferences. They are hard filters. Tenant, user permissions, document status, language, product version, and region should usually filter the candidate set before ranking. A highly relevant forbidden document is still forbidden.

Separate hard filters from soft boosts. "Only show documents this user can access" is a hard filter. "Prefer newer docs" might be a boost or a tie breaker, depending on the product.

How many chunks should go to the model?

More chunks increase the chance that the right evidence is present. They also increase cost, latency, distraction, and the chance that conflicting evidence slips into the prompt. The right number depends on task shape and chunk size.

A practical pattern is to retrieve broadly, rerank, dedupe, then pass only the best evidence that fits the prompt budget. Log both the candidate pool and the final selected chunks so you can inspect where a miss happened.

Engineering reality

Rerankers add latency. They are often worth it for high-value answers, but you should measure the tradeoff. A support search box may tolerate a slower but better answer. Autocomplete probably will not.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why do vector and keyword search complement each other?
What is the difference between retrieval and reranking?
Which constraints should be hard filters?
Why can too much context make answers worse?

Quick check

To bypass metadata filters
To choose the best evidence from a candidate pool
To make every query faster

Document permissions for the current user
A slight preference for newer docs
A preference for shorter chunks