When Microsoft launched Bing Chat on February 7, 2023, the public saw a chatbot. Engineers saw a live demonstration of a production RAG pipeline under stress. Within days, users discovered the system would sometimes confabulate details not in its retrieved context — a signal that the generation stage was overriding retrieval signals. The incident forced a public conversation about where, exactly, RAG pipelines can break.
Understanding the full pipeline — not just retrieval or just generation — is how you avoid shipping the same failure in your own systems.
Every production RAG system passes data through six distinct stages. Treating any stage as a black box is how bugs become incidents.
Stages 1–4 are offline (indexing time). They run once per document update and can be expensive. Stages 5–6 are online (query time). They run per user request and must be fast — typically under 500 ms for production systems.
This separation matters for architecture. You can iterate on your embedding model or chunking strategy independently of your generation prompt, as long as you re-index after offline changes. Many teams mistake slow query latency for a generation problem when the retrieval step is the bottleneck.
Elasticsearch's 2023 developer survey of 1,000+ RAG deployments found that 61% of teams reported their primary quality issues originated in the indexing pipeline (chunking and embedding), not in the LLM generation stage. Most debugging time is spent in the wrong place.
Each chunk should carry structured metadata alongside its vector: source document ID, page number, creation date, access control tags, chunk index within the document. This metadata enables filtered retrieval — e.g., "only search documents from Q3 2024" — without embedding those constraints into the semantic vector itself.
LlamaIndex (formerly GPT Index) formalized this pattern in its NodeWithScore abstraction, where every retrieved chunk is a node carrying both a relevance score and a metadata dict. LangChain uses a similar Document object with a page_content and metadata dict. Both frameworks converged on this design independently, which is a signal it reflects genuine production need.
Design your pipeline so that each stage can be swapped independently. If you couple your chunker to your embedder to your retriever, every experiment requires a full rewrite. The best production teams treat each stage as a replaceable component with a defined interface.
You're consulting on a RAG deployment that returns irrelevant answers. The team says "the LLM is hallucinating." Your job is to systematically walk through the six pipeline stages to identify where the actual problem might be. Use the AI assistant to work through the diagnosis — describe what you'd check at each stage.
Glean, the enterprise search startup that reached a $2.2 billion valuation in 2024, built its entire product around one insight: keyword search fails for enterprise knowledge. When employees ask "what did we decide about the Q3 budget?" a BM25 keyword index returns documents containing those words — not documents about that decision. Glean's retrieval layer combines semantic search with graph-based entity linking to surface the decision memo, not just the budget spreadsheet.
The lesson for RAG designers: retrieval strategy is a product decision, not just an infrastructure one.
Production RAG systems use three retrieval modes, often in combination.
BM25 and TF-IDF. Fast, interpretable, no ML required. Excels on exact-match queries like product codes, proper nouns, and legal terminology. Fails on synonyms and paraphrase.
Embedding similarity (cosine or dot product). Captures meaning over exact terms. Excels on conceptual queries. Can miss rare terminology if not in training data.
Combines sparse and dense scores via Reciprocal Rank Fusion (RRF) or linear interpolation. Used by Elasticsearch 8.x, Pinecone hybrid search, and Weaviate. Best overall recall across query types.
A cross-encoder (e.g., Cohere Rerank, BGE-Reranker) rescores the top-k results from a first-stage retriever. Dramatically improves precision at the cost of added latency.
Raw user queries are often poor retrieval inputs. Several techniques transform them before retrieval runs.
HyDE (Hypothetical Document Embeddings), introduced by Gao et al. in 2022, generates a hypothetical answer to the query using an LLM, then retrieves documents similar to that hypothetical. This sidesteps the query-document embedding mismatch — queries and documents often have different styles even when semantically related.
Query decomposition breaks complex multi-part questions into sub-queries, retrieves for each independently, then merges results. LangChain's MultiQueryRetriever automates this by generating three query variants and union-merging their results.
Step-back prompting (Google DeepMind, 2023) reformulates the query at a higher abstraction level. "What are the side effects of ibuprofen at 800mg?" becomes "What are NSAID pharmacokinetics?" — retrieving the foundational context before the specific one.
The BEIR benchmark (2021, Thakur et al.) tested 18 retrieval systems across 18 datasets. No single method dominated across all domains. Dense retrieval outperformed BM25 on semantic tasks; BM25 outperformed dense on exact-match domains like TREC-COVID. Hybrid systems achieved the highest average nDCG@10 across the full benchmark. This is why modern production systems almost always use hybrid retrieval.
Retrieving k=3 is common in demos. Production systems typically retrieve k=20 at the first stage, then rerank to k=5 for the generation context window. The tradeoff: more retrieved chunks increase recall but inflate the prompt, increasing cost, latency, and the risk of lost in the middle — the documented tendency of LLMs to underweight context appearing in the middle of long prompts (Liu et al., 2023).
You're building a RAG system for a legal research firm. Attorneys search for case law using three types of queries: (1) exact citation lookups like "42 U.S.C. § 1983", (2) conceptual queries like "cases about police immunity in excessive force claims", and (3) complex multi-part questions like "what did the Ninth Circuit decide about qualified immunity between 2018 and 2022 and how does it differ from the Fifth Circuit?"
Perplexity AI's approach to RAG prompting became one of the most studied in the industry after the company published blog posts describing their answer engine architecture. Their key insight was that the prompt template, not the retrieval alone, determined whether users trusted the output. By structuring prompts to always cite retrieved sources inline — forcing the model to ground every claim — they reduced user-perceived hallucination rates dramatically compared to systems that simply prepended context without citation instructions.
A production RAG prompt has four zones, and their order matters due to the lost-in-the-middle effect:
The most effective grounding instructions are specific about failure modes, not just general about accuracy. Compare these two system prompts:
Weak: "Be accurate and helpful."
Effective: "Answer using only the provided context. If the context does not contain enough information to answer the question, respond: 'The provided documents do not contain enough information to answer this question.' Do not infer, extrapolate, or use knowledge from your training data."
The effective version names the exact failure mode (using training knowledge instead of context) and provides a specific fallback response. This reduces hallucination by giving the model an explicit "safe exit" that avoids the pressure to generate something plausible.
In Anthropic's Constitutional AI follow-up work on Claude's RAG behavior, researchers found that prompts explicitly naming the fallback ("if unsure, say I don't have enough information") reduced confabulation by approximately 40% compared to prompts that only said "be accurate." The model needs a graceful exit, not just an instruction to succeed.
Labeled chunks use identifiers like [Doc 1], [Source: policy_v3.pdf], or [Chunk 4 of 12] to help the model track which context element it's drawing from. Numbered labels enable inline citation in the output.
Relevance ordering — given the lost-in-the-middle effect, place the highest-relevance chunk first, second-highest last, and lower-relevance chunks in the middle. This is counterintuitive but supported by Liu et al.'s experimental data across multiple LLMs.
Context compression uses a smaller LLM to summarize or filter retrieved chunks before injection. LangChain's ContextualCompressionRetriever does this automatically. It reduces prompt tokens but adds latency and an additional failure point.
Requiring inline citations does more than improve transparency — it functions as a reasoning constraint. When the model must cite [2] for every claim, it's forced to locate a specific source for that claim before outputting it. Claims without a citable source are harder to generate. This is why Perplexity, Bing Chat, and Google's SGE all require citation in their answer templates.
Never give the model a context block without also giving it explicit instructions for what to do when the context is insufficient. The model will fill the gap with training data unless you provide a specific alternative — and that alternative must be stated as a concrete response, not just a vague "be honest" instruction.
You're building a RAG system for a hospital that helps clinicians look up drug interaction information. The stakes are high: a hallucinated drug interaction could harm patients. You need to construct a prompt template that maximally grounds the model to retrieved context and handles the "I don't know" case explicitly.
When Notion launched Notion AI in February 2023, the team quickly discovered that internal user satisfaction surveys were a lagging indicator — they revealed problems weeks after they appeared. The engineering team, led by ML engineers who had previously worked on recommendation systems, implemented a real-time faithfulness scoring pipeline that evaluated every generated response against its retrieved context using a smaller judge model. Within three months, they reported a 23% reduction in reported hallucinations, not from changing the LLM, but from using the evaluations to identify and fix specific chunking failures in their workspace document indexer.
The dominant academic framework for RAG evaluation, codified by Es et al. in the RAGAS paper (2023), defines four core metrics that map directly to failure modes in the pipeline.
Are all claims in the answer supported by the retrieved context? Low faithfulness = the generation stage is hallucinating. Measured by decomposing the answer into claims and verifying each against context.
Does the answer address what the user actually asked? Low score = the answer is on-topic but doesn't answer the specific question. Often caused by vague retrieval or insufficient context.
What fraction of retrieved chunks are actually relevant to the query? Low precision = the retriever is returning noise that confuses the generator.
Does the retrieved context contain all information needed to answer the query? Low recall = the retriever is missing key chunks. Answers will be incomplete or incorrect even if faithful.
Human evaluation of RAG outputs is expensive and slow. The emerging standard is LLM-as-judge: using a separate LLM (often GPT-4 or Claude) to evaluate faithfulness, relevance, and context quality at scale. Zheng et al. (2023) at UC Berkeley showed in the MT-Bench paper that GPT-4 as judge agreed with human expert evaluators 80–85% of the time — comparable to inter-human agreement rates.
Tools like RAGAS (open source, 2023), TruLens (TruEra), and DeepEval automate LLM-as-judge evaluation pipelines. They generate evaluation datasets from your documents, run queries through your pipeline, score each response, and surface the lowest-scoring query types for debugging.
Cohere's RAG deployment team documented in their 2023 engineering blog that teams who implemented automated evaluation pipelines from week one shipped improvements 3x faster than teams that relied on manual review. The evaluation pipeline is infrastructure, not an afterthought — build it alongside the retrieval pipeline, not after launch.
Retrieval latency — p50, p95, p99 latency for the retrieval stage separately from generation. Spikes in retrieval latency usually indicate index fragmentation or vector database replication lag.
Context utilization rate — What fraction of retrieved chunks does the model actually cite? If the model consistently ignores 4 of 5 retrieved chunks, your top-k is too high and you're wasting tokens.
Unanswerable rate — What fraction of queries result in "I don't have enough information"? A sudden increase often indicates an indexing failure or a new query domain your corpus doesn't cover.
Query distribution drift — Are users asking different types of questions over time? If your corpus covers product documentation but users start asking about pricing (not indexed), precision will collapse. Monitor embedding-space clustering of incoming queries.
Each RAGAS metric points to a different stage for improvement. Low faithfulness → fix generation (stronger grounding instructions, better prompt). Low context precision → fix retrieval (reranker, query transformation). Low context recall → fix chunking or indexing (chunk size, overlap, coverage gaps). This mapping turns evaluation from a grading exercise into a debugging tool.
The RAGAS metric that is lowest is where you spend your engineering time. The pipeline is only as strong as its weakest measured stage — and the unmeasured stage is always the weakest one, because you have no signal to improve it.
You've just launched a RAG-powered customer support system for a SaaS product. After two weeks in production, users are complaining that answers "feel off" but the CSAT scores haven't moved yet — the Notion AI problem. You need to design a systematic evaluation and monitoring plan using RAGAS metrics and LLM-as-judge techniques to identify the root cause before it affects CSAT.