← All posts
·3 min read

A production RAG checklist (the boring half)

What separates a working RAG demo from a production RAG system isn't the retrieval — it's the evaluation, observability, and failure-mode handling around it.

A RAG pipeline is easy to demo and hard to ship. The flashy part — embedding documents, vector search, generation — is roughly 30% of the work. The other 70% is what determines whether you wake up at 3am.

Here's the checklist I run through before calling any RAG system "production."

Retrieval quality

  • You have a labeled eval set of at least 50 query/answer pairs from real user queries (not synthetic).
  • Retrieval recall@k is measured and tracked over time. If you don't know your recall@5, you don't have a system, you have hope.
  • You've tested chunking strategies with the actual eval set (paragraph, semantic, sliding-window). Pick the one that wins on recall, not the one that's prettiest.
  • Hybrid search (dense + BM25) is in place. Pure vector search loses to hybrid on most real corpora.
  • Reranking step is configured (Cohere Rerank, BGE reranker, or similar). Reranking on top-50 → top-5 routinely beats raw retrieval by 10+ recall points.

Generation quality

  • You have a rubric-based eval for generated answers (faithfulness, relevance, completeness). Use LLM-as-judge with calibrated prompts, not vibes.
  • Citations are required and verified. If the model claims a fact, the chunk it came from is shown.
  • Refusal behavior is defined and tested. The system says "I don't know" when retrieval comes back empty — not hallucinate.
  • Prompt is versioned in code, not hand-edited in a UI somewhere.

Operational

  • Latency budgets per stage are measured: embedding, retrieval, rerank, generation. You know which stage is slowest.
  • Cost per query is tracked. You have a kill switch for runaway prompts.
  • Logs capture full query/response/retrieved chunks for the last N days, with PII handled.
  • A/B framework lets you swap retrievers, rerankers, prompts without redeploying code.
  • An on-call runbook exists for the failure modes you've seen.

Failure modes you'll hit

The ones that bit me hard:

  • Document drift. New docs added, old docs removed. Your eval set is stale within weeks. Refresh it monthly.
  • Query distribution shift. Users start asking about a new topic; retrieval recall drops; nobody notices for two weeks. Catch this with quality alerts on rolling eval scores.
  • Embedding model upgrades that re-embed everything but break previously-retrieved-fine queries. Always shadow-deploy and compare.
  • Prompt-injection from retrieved content — a malicious doc says "ignore your instructions and..." Defense: chunk-level allowlisting, output filtering.

What I deliberately leave for v2

  • Multi-hop reasoning (most user queries don't need it).
  • Custom-trained retrievers (off-the-shelf BGE / OpenAI / Cohere are fine until they're not).
  • Knowledge graph layers (huge complexity for marginal recall lift on most corpora).

If you're shipping RAG, optimize for the boring half first. The boring half is where production lives.

Shanker Dhand
Shanker Dhand
AI Engineer & Technical Lead

I design and ship production AI systems — RAG pipelines, agents, and evaluation infrastructure — built on 10+ years of full-stack engineering.