·3 min read
A production RAG checklist (the boring half)
What separates a working RAG demo from a production RAG system isn't the retrieval — it's the evaluation, observability, and failure-mode handling around it.
A RAG pipeline is easy to demo and hard to ship. The flashy part — embedding documents, vector search, generation — is roughly 30% of the work. The other 70% is what determines whether you wake up at 3am.
Here's the checklist I run through before calling any RAG system "production."
Retrieval quality
- You have a labeled eval set of at least 50 query/answer pairs from real user queries (not synthetic).
- Retrieval recall@k is measured and tracked over time. If you don't know your recall@5, you don't have a system, you have hope.
- You've tested chunking strategies with the actual eval set (paragraph, semantic, sliding-window). Pick the one that wins on recall, not the one that's prettiest.
- Hybrid search (dense + BM25) is in place. Pure vector search loses to hybrid on most real corpora.
- Reranking step is configured (Cohere Rerank, BGE reranker, or similar). Reranking on top-50 → top-5 routinely beats raw retrieval by 10+ recall points.
Generation quality
- You have a rubric-based eval for generated answers (faithfulness, relevance, completeness). Use LLM-as-judge with calibrated prompts, not vibes.
- Citations are required and verified. If the model claims a fact, the chunk it came from is shown.
- Refusal behavior is defined and tested. The system says "I don't know" when retrieval comes back empty — not hallucinate.
- Prompt is versioned in code, not hand-edited in a UI somewhere.
Operational
- Latency budgets per stage are measured: embedding, retrieval, rerank, generation. You know which stage is slowest.
- Cost per query is tracked. You have a kill switch for runaway prompts.
- Logs capture full query/response/retrieved chunks for the last N days, with PII handled.
- A/B framework lets you swap retrievers, rerankers, prompts without redeploying code.
- An on-call runbook exists for the failure modes you've seen.
Failure modes you'll hit
The ones that bit me hard:
- Document drift. New docs added, old docs removed. Your eval set is stale within weeks. Refresh it monthly.
- Query distribution shift. Users start asking about a new topic; retrieval recall drops; nobody notices for two weeks. Catch this with quality alerts on rolling eval scores.
- Embedding model upgrades that re-embed everything but break previously-retrieved-fine queries. Always shadow-deploy and compare.
- Prompt-injection from retrieved content — a malicious doc says "ignore your instructions and..." Defense: chunk-level allowlisting, output filtering.
What I deliberately leave for v2
- Multi-hop reasoning (most user queries don't need it).
- Custom-trained retrievers (off-the-shelf BGE / OpenAI / Cohere are fine until they're not).
- Knowledge graph layers (huge complexity for marginal recall lift on most corpora).
If you're shipping RAG, optimize for the boring half first. The boring half is where production lives.