The RAG demo always works. The first thirty queries are the ones the founder thought of when they built it, and the model handles them beautifully. Then real users arrive and the system breaks at six predictable points — chunking, embedding drift, retrieval relevance, context overflow, freshness, and citation faithfulness. The good news: each one has a known control, and you can fit all six into a fortnight of work.
Retrieval-augmented generation has the friendliest demo in AI. Drop a thousand PDFs into a vector store, hand the model the top-five chunks, ask a question, get a credible answer. The first founder demo lands in a weekend. The investor demo lands in a fortnight. Then the system meets a real user, and one of six things happens.
None of these are about the model. All of them are about the pipeline around it. Audit work confirms the pattern: when a RAG product underperforms in production, the retrieval step is the cause four times out of five, not the generation step. Fix retrieval, and the answer quality follows.
The default chunking strategy in every RAG tutorial is “split every 1,000 characters with 200 overlap.” It is fine for blog posts. It is catastrophic for almost everything else. A 1,000-character split through a contract lands halfway through a clause, separating “notwithstanding the foregoing” from the obligation it modifies. A split through a how-to article cuts a numbered list across two chunks, so the model sees step 4 without steps 1–3.
The control is to chunk on semantic boundaries, not byte counts. For most corpora that means a hierarchical splitter that respects, in order: document → section heading → paragraph → sentence. Each chunk is annotated with its parent heading so the embedding picks up topical context the chunk itself lacks. Token-length distribution is monitored: when a single document produces a 50-token chunk and a 4,000-token chunk, the splitter is wrong, not the document.
For tabular data, code, and structured records, do not chunk text at all. Render the row or function as a small canonical document, embed that, and store a back-pointer to the source.
The embedding model that indexed your corpus eighteen months ago is not the embedding model that should be running today. Providers ship new versions. The new one is usually better — but it produces vectors in a different space. If you re-embed the queries with the new model and search against vectors built with the old one, you are comparing nonsense to nonsense.
Two controls. First, version-pin the embedding model in the index metadata: every record carries the model name and version that produced it. The query path checks that the query embedder matches before searching, and refuses to compare across versions. Second, when you upgrade the embedder, treat it as a full reindex — blue/green the new index, run an eval comparing top-k recall on a frozen test set, and only switch when the new index clears the threshold.
The thing that breaks teams here is doing it casually. Someone bumps the model name in a config file because the docs recommend it. Recall silently drops 12%. The team blames the prompt for three weeks before tracing it back.
Dense vector search is great at semantic similarity and bad at exact-match. A user asks “What does error E_AUTH_437 mean?” and the cosine-similarity search returns five chunks about authentication in general, none of which contain the literal string. The right chunk — one that mentions E_AUTH_437 by name — ranks tenth.
The control is hybrid search: dense retrieval combined with a sparse lexical retriever (BM25 over the same corpus), then fused with reciprocal rank fusion. Adding the sparse channel costs almost nothing and recovers a class of queries that pure-vector search consistently misses — product codes, acronyms, names, IDs, anything where the user remembers the exact token.
On top of fused retrieval, add a small reranker. Cross-encoder rerankers (a few hundred ms of latency, models like BGE-reranker or Cohere’s rerank API) take the top-25 fused candidates and re-score them against the query in a way pure embedding similarity cannot. On most production corpora, fused retrieval plus a reranker pulls top-3 precision from somewhere around 0.55 to somewhere around 0.85. That is the difference between “sometimes useful” and “product”.
You have a 200,000-token context window so you stuff thirty chunks into the prompt. The model produces a confident answer that ignores chunks five through twenty-five. This is the lost-in-the-middle phenomenon, and it has been documented across every long-context model on the market. Adding more context past a point makes performance worse, not better — and adds latency and cost on the way down.
The control is discipline. Keep the packed context tight: typically three to eight chunks for a question-answering task, ranked by reranker score, deduplicated against each other. If a chunk overlaps another by more than 60% on a tri-gram comparison, drop the lower-ranked one. Order chunks by relevance descending, not by source order. Include a short header per chunk with its source document and section so the model can produce citations — this is also the foundation for failure 06.
Token budget is monitored: context_tokens_used / context_window as a histogram, alerts when P95 exceeds 60%. If you are routinely filling the window, you are losing answer quality, not gaining it.
The index was built off a Notion export from December. The team rewrote the refund policy in February. The bot is still telling customers about the December version, confidently and with a citation, because the citation is — technically — correct. It is the corpus that is wrong.
The control is a freshness SLO with the same seriousness as a latency SLO. Every chunk has a source_updated_at. The pipeline has a target lag (e.g. source change to indexed and serving in under one hour). A monitor compares the latest source_updated_at in the upstream system to the latest in the index every five minutes; lag > SLO pages oncall. The reindexer is owned by a person, not a cron job that nobody checks.
Bonus control: filter retrieval results by recency where the data is time-sensitive. The bot answering a 2026 customer should not be matching against 2022 docs unless explicitly asked.
The model produces an answer with footnotes. The footnotes look authoritative. Half of them point to chunks the answer did not actually use. The other half cite the right chunk for the wrong claim. Users trust the footnotes for two weeks until one of them clicks through and the trust cliff arrives.
The control is structural, not prompt-engineering. Pack each chunk with an explicit ID in the context ([doc_id: 84]), and instruct the model to return its answer as JSON with each claim mapped to the IDs that support it. A post-hoc validator checks that (a) every cited ID is one we actually retrieved, (b) the cited chunk lexically supports the claim using a small judge prompt, and (c) the answer contains no claims without at least one valid citation. Failures rerun once with a corrective system message; if they fail twice, the answer is suppressed and a fallback (“I don’t have a confident answer for that”) is returned.
This is the single highest-leverage control in the whole pipeline. It catches hallucination before the user sees it, gives compliance reviewers a defensible audit trail, and turns the eval from “does this look right” into something a CI gate can score.
Stack the controls together and you get a system that looks like this. Hierarchical chunking, version-pinned embeddings, hybrid retrieval (dense + BM25), cross-encoder reranker, three-to-eight-chunk packed context with explicit chunk IDs, freshness SLO on the index, structured cited-JSON output, post-hoc citation validator, eval set with faithfulness and citation-precision as gates in CI.
Ten components. Roughly two engineer-weeks to assemble from open-source pieces if you have someone who has done it before, four weeks if you do not. None of it is novel work. All of it is the difference between a RAG that wins the demo and a RAG that survives the year.
The two-week production-readiness audit covers retrieval pipelines end-to-end — chunking strategy, embedding hygiene, retrieval-eval baseline, citation faithfulness, freshness controls. Written report, ranked gaps, fixed price.
See the audit engagement →