Fetch and Filter the Right Documents for RAG: Recall, Precision, and Accurate Answers

RAG succeeds or fails on retrieval: fetch the right chunks, filter the noise. Use hybrid search, context-enriched chunks, reranking, and LLM checks for precise, cheaper answers.

Published on: Sep 20, 2025
Fetch and Filter the Right Documents for RAG: Recall, Precision, and Accurate Answers

RAG Retrieval That Actually Works: Fetch and Filter What Matters

Your RAG system lives or dies by document retrieval. If the right context doesn't reach the model, answers go off-track, costs rise, and users lose trust. The goal here is simple: fetch the most relevant documents and filter out the noise.

Table of contents

  • Why is optimal document retrieval important?
  • Traditional approaches
  • Techniques to fetch more relevant documents (Recall)
  • Precision: filter away irrelevant documents
  • Benefits of improving document retrieval
  • Summary

Why is optimal document retrieval important?

A typical RAG flow looks like this:

  • User submits a query.
  • The query is embedded and compared to document (or chunk) embeddings.
  • The top K most similar chunks are fetched.
  • Those chunks plus the query are fed to an LLM to produce an answer.

Every component matters-your embedding model, your LLM, chunking strategy, and K. But selection beats everything. If you don't fetch the right chunks, the rest can't save the answer.

Traditional approaches

Embedding similarity

This is the default for most RAG stacks. You embed the query, compare it against precomputed document or chunk embeddings, and keep the top K matches (often 10-20, but test for your domain). It's fast and good enough for many use cases.

Keyword search

Keyword search methods like TF-IDF or BM25 still work well, especially for exact terms and domains where vocabulary is consistent. The trade-off: they miss paraphrases and semantic matches. A hybrid setup often performs better than either alone.

BM25 is a solid baseline to pair with embeddings.

Techniques to fetch more relevant documents (Recall)

Recall means pulling in as many truly relevant chunks as possible. These tactics increase the odds that the right evidence makes it into your context window.

Contextual retrieval

Two ideas make this work:

  • Add context to chunks: For each chunk, use an LLM to rewrite it with key context from its parent document (e.g., add the project name, address, date, or parties mentioned earlier). This helps semantic search "see" relevance that would otherwise be split across chunks.
  • Hybrid search: Retrieve with both BM25 and vector similarity, then merge and deduplicate before reranking.

Reference: Anthropic's write-up on contextual retrieval.

Fetch more chunks (carefully)

  • Increasing K boosts recall. You're more likely to capture the right evidence.
  • Downsides: more irrelevant context (lower precision) and more tokens, which may degrade answer quality and increase cost.

Use this in combination with reranking and filtering to keep quality high.

Reranking for recall

Rerankers reorder candidates so relevant chunks climb into the top K. A simple pattern:

  • Initial retrieval: take top N by embeddings and BM25 (e.g., N=100).
  • Rerank with a cross-encoder or dedicated reranker (e.g., Qwen Reranker or similar).
  • Keep the top K after reranking.

This ensures borderline-but-relevant chunks aren't lost due to raw similarity scores.

Precision: filter away irrelevant documents

Precision keeps the junk out. The goal is to prevent noise, wrong context, and token bloat from polluting the answer.

Reranking for precision

Rerankers also improve precision by pushing unrelated chunks down the list. This is where a strong cross-encoder shines. You'll see tighter context and fewer off-topic citations in answers.

LLM verification

Use the LLM as a judge on chunk relevance before final context assembly. A simple flow:

  • For each candidate chunk, ask the LLM: "Is this relevant to the query? Return {relevant: true/false}."
  • Only keep chunks marked relevant.

Trade-offs:

  • Cost: Many small LLM calls add up.
  • Latency: Extra calls can slow responses. Consider caching and running this only on near-threshold chunks.

Practical setup and defaults

  • Chunking: 300-800 tokens with 10-20% overlap is a solid starting point. Include titles and metadata in every chunk.
  • K: Start with 10-20. If answers miss details, raise K and add reranking + filtering.
  • Hybrid retrieval: Combine BM25 and embeddings. Merge top candidates, deduplicate, then rerank.
  • Reranking: Retrieve N=50-200, rerank down to K.
  • LLM verification: Optional. Use for high-stakes queries or when precision is critical.
  • Telemetry: Log which chunks were used and whether the answer was correct for continuous tuning.

Benefits of improving document retrieval

  • Better answers: The LLM spends tokens on the right evidence.
  • Fewer hallucinations: Less noise and misdirection in the context.
  • Higher success rate: More user queries answered correctly end to end.
  • Lower cost over time: Cleaner context often lets you use smaller models or fewer tokens.

Summary

Get retrieval right, and your RAG stack becomes consistent and trustworthy. Use hybrid search, add context to chunks, increase K strategically, rerank aggressively, and filter with LLM verification when accuracy matters.

Keep it practical: measure success by correct answers per query, not just similarity scores. Iterate on chunking, K, and reranking until your metrics move in the right direction.

Want more hands-on training and resources for building AI systems? Check out the latest curated courses at Complete AI Training.