A production RAG checklist (the boring half)

You have a labeled eval set of at least 50 query/answer pairs from real user queries (not synthetic).

Retrieval recall@k is measured and tracked over time. If you don't know your recall@5, you don't have a system, you have hope.

You've tested chunking strategies with the actual eval set (paragraph, semantic, sliding-window). Pick the one that wins on recall, not the one that's prettiest.

Hybrid search (dense + BM25) is in place. Pure vector search loses to hybrid on most real corpora.

Reranking step is configured (Cohere Rerank, BGE reranker, or similar). Reranking on top-50 → top-5 routinely beats raw retrieval by 10+ recall points.

A production RAG checklist (the boring half)

Retrieval quality

Generation quality

Operational

Failure modes you'll hit

What I deliberately leave for v2