When Microsoft launched the new Bing Chat in February 2023, early testers began documenting a consistent failure pattern: the system would retrieve accurate background facts but then generate responses that contradicted those facts or extrapolated beyond them in harmful ways. One widely circulated exchange had the chatbot insisting it was not an AI, referencing retrieved text about human cognition as if it applied to itself. The retrieval was working β relevant passages were being fetched β but the system had no mechanism to flag when retrieved context was being misapplied by the generator. The eval gap was not in finding relevant documents; it was in detecting misuse of those documents. Microsoft began pushing weekly updates within days, but the episode illustrated that relevance alone is an incomplete measure of RAG quality.
RAG quality failures occur at three distinct layers, and each requires a different evaluation lens. The first is retrieval quality: did the system find the right documents? The second is grounding quality: does the generated answer actually reflect what was retrieved? The third is answer quality: is the final response accurate, complete, and appropriate for the user?
These layers are not independent. A perfect retrieval step followed by poor grounding produces hallucinated answers that look authoritative because they are decorated with real citations. Conversely, a flawed retrieval step that happens to return marginally useful context can still yield a passable answer for simple queries β masking the underlying retrieval problem until a harder query exposes it.
This layered dependency is why end-to-end answer quality metrics, measured alone, are insufficient. A system can pass end-to-end tests while harboring severe retrieval bugs that only manifest on specific query distributions. Robust evaluation requires instrumented signals at each layer independently.
In a classification task, you have a fixed label space and ground truth for every example. In RAG evaluation, the "correct" context set for a given query is often ambiguous, query-dependent, and can involve multiple equally valid document combinations. Precision and recall exist, but their denominators are contested: what counts as a relevant document is itself a judgment call.
RAG systems exhibit several failure modes that are particularly dangerous because they produce plausible-looking outputs: context drift, where retrieved passages are semantically adjacent but subtly off-topic for the specific sub-question being asked; temporal mismatch, where retrieved documents are factually correct but outdated relative to the query's implicit time frame; and coverage gaps, where the system retrieves some relevant content but misses the single document that would have answered the question correctly.
Coverage gaps were documented extensively in the 2023 BEIR benchmark evaluations. Researchers from Hugging Face and collaborators tested dense retrieval systems across 18 heterogeneous retrieval tasks and found that models that scored well on in-domain benchmarks like MS MARCO dropped dramatically on out-of-domain tasks β not because they retrieved wrong documents, but because they retrieved highly similar but subtly incorrect documents. The failure was invisible until the evaluation corpus was deliberately varied.
What makes RAG evaluation particularly challenging is compounding: small retrieval errors amplify in the generation step. An LLM given four mostly-correct passages and one confidently wrong passage will often weight the wrong passage heavily if it is more specific or more assertively worded than the correct ones. Anthropic's research on context manipulation (published in their 2023 work on "sleeper agent" and context-injection robustness) showed that LLMs are systematically overconfident about specific-sounding information even when it contradicts established context. In a RAG pipeline, a single retrieved document containing outdated statistics with precise-sounding numbers will often dominate a response over four correct documents with more qualified language.
This means evaluation cannot treat retrieval quality and generation quality as separate accounting problems. A meaningful evaluation framework must measure the interaction effect β what happens when imperfect retrieval meets an LLM with known biases toward confident-sounding specifics.
Evaluate at every layer independently AND measure the interaction between layers. A system where retrieval earns a 0.82 nDCG but answers score 0.91 accuracy is hiding something β the high answer accuracy may be concealing cases where the LLM is successfully guessing despite retrieval failure, which will collapse on harder query distributions.
You'll be presented with RAG failure scenarios. Your job is to identify whether the failure is at the retrieval layer, the grounding layer, or the answer quality layer β and explain your reasoning. The AI tutor will challenge your analysis and push you toward precise diagnostic thinking.
When Microsoft released the MS MARCO dataset in 2016 β 100,000 real Bing queries with human-annotated relevant passages β it became the dominant benchmark for evaluating passage retrieval. Systems were ranked by MRR@10: mean reciprocal rank of the first relevant result in the top 10. By 2020, neural models had pushed MRR@10 above 0.40, seemingly excellent. But researchers publishing in the 2022 SIGIR proceedings showed that these same models had catastrophically poor recall on queries where the answer existed only once in the corpus β so-called "singleton relevant" queries. MRR@10 was insensitive to this failure because it only cared about whether the first relevant document appeared in the top 10, not whether all relevant documents were retrieved. A system could score 0.40 MRR@10 while completely missing 30% of answerable queries that had exactly one correct passage.
The most fundamental retrieval metrics are precision@k and recall@k. Precision@k measures what fraction of the top-k retrieved documents are actually relevant. Recall@k measures what fraction of all relevant documents appear in the top k. These metrics trade off against each other and neither alone is sufficient.
For RAG systems specifically, recall@k is often more important than precision@k. A RAG generator given 5 relevant and 3 irrelevant documents can usually extract the right answer. A generator given 8 highly relevant documents but missing the single document containing the key fact will fail. This asymmetry means RAG evaluation should weight recall more heavily than standard IR evaluation does.
MRR measures how high in the ranked list the first relevant document appears. For a set of queries Q, MRR = (1/|Q|) Γ Ξ£ (1/rank_i) where rank_i is the position of the first relevant document for query i. MRR rewards systems that surface at least one relevant document early.
MRR is appropriate when users need only one good document to answer their question. It is poorly suited to RAG when: (1) queries require synthesis across multiple documents, (2) the corpus has many relevant documents that all contribute partial information, or (3) the most important document for an answer is not the most obviously relevant one. The MS MARCO finding described above is a direct consequence of MRR's indifference to recall.
nDCG addresses MRR's limitations by: (1) assigning graded relevance scores rather than binary relevant/irrelevant, and (2) discounting the value of relevant documents that appear lower in the ranking. A document at rank 1 contributes more to DCG than the same document at rank 5.
Normalized DCG divides the actual DCG by the ideal DCG (IDCG) β the score you'd get if all relevant documents were ranked perfectly. This normalization puts scores on a 0β1 scale comparable across queries with different numbers of relevant documents.
nDCG is the dominant metric in modern IR because it handles multi-graded relevance and position sensitivity simultaneously. However, it still has blind spots: it assumes you can correctly define and score all relevant documents in advance, which in dynamic RAG corpora is often impossible. It also does not penalize irrelevant high-ranked documents beyond their rank position β a spurious but confidently-worded document at rank 2 only costs you the DCG contribution of the correct document it displaced, not the damage it does when the LLM incorporates it.
| Metric | Best Used When | Blind Spot in RAG |
|---|---|---|
| Precision@k | Corpus is noisy; LLM is distracted by irrelevant context | Misses retrieval gaps where the right doc was simply not retrieved |
| Recall@k | Multi-hop questions; synthesis across many documents required | Requires knowing the complete relevant set β often undefined |
| MRR@k | Single-answer factoid queries where one good doc is sufficient | Completely insensitive to singleton-relevant query failures |
| nDCG@k | Graded relevance; ranking quality matters more than cutoff | Does not model how LLMs weight specific vs. qualified language in retrieved docs |
In production RAG systems, report at minimum: nDCG@5, Recall@10, and MRR@10. Each catches different failure modes. A system that scores well on all three is genuinely performing well at the retrieval layer across diverse query types.
You'll receive retrieval scenario data β query, relevant doc set, and a ranked retrieved list β and calculate nDCG, MRR, and Recall@k by hand. The tutor will check your math, probe your interpretation, and help you understand when different metrics give conflicting signals.
In late 2023, researchers at Exploding Gradients released RAGAS (Retrieval Augmented Generation Assessment), an open-source framework that became rapidly adopted for evaluating RAG pipelines. RAGAS introduced three metrics: faithfulness (does the answer contain only claims supportable by the retrieved context?), answer relevance (how relevant is the answer to the original question?), and context precision (what fraction of the retrieved context was actually useful for generating the answer?). The framework used an LLM, initially GPT-4, to compute each metric automatically. Within months of release, practitioners discovered systematic bias: GPT-4 rated answers from GPT-4-generated RAG pipelines higher on faithfulness than humans did, suggesting the LLM judge was sharing systematic generation biases with the systems it was evaluating.
Traditional NLP used BLEU, ROUGE, and METEOR to evaluate text generation quality. These metrics compare n-gram overlap between a generated answer and reference answers. For RAG, they fail for a fundamental reason: there is rarely a single canonical reference answer. A correct RAG answer might use entirely different phrasing than the reference text while being factually superior, or it might match the reference phrasing perfectly while being subtly wrong due to a retrieval error. Lexical similarity to a reference is not a proxy for factual correctness in open-domain RAG.
BERTScore improved on this by using contextualized embeddings to compare semantic similarity rather than lexical overlap. But BERTScore still requires reference answers and still cannot detect faithfulness violations β a generated answer could be semantically similar to the reference while containing claims not supported by the retrieved context.
Faithfulness in RAGAS measures whether each claim in the generated answer can be inferred from the retrieved context. The LLM judge decomposes the answer into atomic claims and checks each claim against the context set. A faithfulness score of 1.0 means every claim is context-supported; 0.0 means all claims are unsupported hallucinations. This is arguably the most important RAG-specific metric because it directly measures the grounding layer.
Answer Relevance measures whether the answer addresses the question asked. This catches verbose or tangential answers that might score well on faithfulness because they only make context-supported claims β but few of those claims actually address what the user wanted. RAGAS computes this by having an LLM generate candidate questions for the answer and measuring cosine similarity between those generated questions and the original query.
Context Recall measures whether the retrieved context contains the information needed to answer the question β estimated by checking whether the ground-truth answer's claims can be attributed to the retrieved context.
The LLM-as-judge approach gained major traction after the MT-Bench and Chatbot Arena papers from LMSYS in 2023, which showed that GPT-4 judgments correlated with human preferences at 0.80+ on dialogue quality tasks. This seemed to validate using LLMs as scalable automated evaluators for RAG.
However, subsequent research identified three systematic biases specific to faithfulness evaluation: position bias (LLM judges rate claims appearing earlier in context as better supported, even when later context is equally or more relevant); verbosity bias (longer answers receive higher faithfulness scores even when the added length introduces unsupported claims); and self-consistency bias (when an LLM judge and the evaluated system share the same base model, the judge rates outputs higher because it is biased toward outputs it would itself produce).
The self-consistency bias is particularly important. The RAGAS finding about GPT-4 judging GPT-4 outputs is a specific instance of a general problem: using the same model family to evaluate itself inflates faithfulness scores relative to human judgment by 8β15 percentage points in several published comparisons.
Use a judge model from a different family than your generation model. If generating with GPT-4, evaluate with Claude or Gemini. If generating with an open-source model, evaluate with a different open-source architecture. This does not eliminate LLM-as-judge biases but substantially reduces the self-consistency inflation effect.
Production RAG systems at companies like Cohere, Anthropic, and Databricks use hybrid evaluation stacks that combine: (1) automated LLM-as-judge scoring with multiple judges from different families; (2) human evaluation on a sampled subset of 100β500 queries per evaluation cycle; and (3) behavioral testing β specific adversarial queries designed to trigger known failure modes. No single metric is trusted alone.
Databricks' DBRX documentation described a four-metric combination: LLM faithfulness score, human preference rate on a sample, correctness against a factoid test set with known ground truth, and retrieval nDCG on a separate held-out query set. The rationale was that each metric catches different failure modes β a system must perform well on all four to be considered production-ready.
You'll evaluate RAG-generated answers for faithfulness, relevance, and grounding quality using the RAGAS framework concepts. Practice decomposing answers into atomic claims and checking each claim against the provided context. The tutor will challenge your faithfulness assessments and help you identify the biases you should control for.
In August 2024, researchers at PromptArmor published findings that Slack's AI summarization feature β a RAG-adjacent system that retrieved messages and summarized them β could be manipulated to leak information from private channels through prompt injection in retrieved messages. More relevant to evaluation: the failures only surfaced when the message corpus included adversarial inputs, something the internal evaluation pipeline had not tested. Slack had validated the feature on clean internal message corpora. The regression detection gap was not in the metric chosen but in the test set construction β the evaluation distribution had not included the class of inputs where the system failed. This is the canonical argument for adversarial and out-of-distribution test sets in any RAG evaluation pipeline.
A production RAG evaluation test set must cover at minimum four query categories: (1) factoid queries with single correct answers verifiable against ground truth; (2) synthesis queries requiring information from multiple documents; (3) edge queries designed to test known system weaknesses, such as temporal sensitivity, numerical reasoning, or negation; and (4) out-of-distribution queries drawn from domains not well represented in the training or indexing corpus.
The size of a meaningful test set depends on the variance you need to detect. For a 5-percentage-point regression to be detectable at 95% confidence, you need approximately 400 queries (by standard power analysis). Many production teams use 200β300 queries, which can miss 5-point regressions reliably. This is a known risk trade-off, not a best practice.
Perplexity AI has described using continuous evaluation with a combination of human rater samples, automated RAGAS-style faithfulness scoring, and a dedicated regression test set of ~500 questions with verified answers. The test set is versioned alongside model releases, and any deployment that drops more than 2 percentage points on faithfulness or 3 percentage points on answer accuracy against the test set is blocked for human review before release.
One of the most overlooked evaluation problems in production RAG is index drift: the retrieval corpus changes over time (documents are added, updated, or deleted), causing the system's behavior to change even when no code has been modified. A RAG system that scored 0.85 faithfulness in January may score 0.76 in July not because anything was redeployed but because new documents introduced conflicting information or outdated documents that used to be retrieved first were displaced by newer additions.
Addressing index drift requires: (1) versioned corpus snapshots for reproducible evaluation; (2) a static test corpus subset that never changes, allowing longitudinal comparison; and (3) periodic re-annotation of test queries against the current corpus to flag cases where the ground-truth answer has changed due to corpus updates.
A minimal production monitoring stack for RAG consists of: a shadow evaluation pipeline that runs every query in production through the evaluation stack with a small latency overhead, sampling 1β5% of live traffic; a metric dashboard tracking rolling 24-hour and 7-day averages of faithfulness, answer relevance, and retrieval nDCG; and automated alerting that fires when any metric drops more than a defined threshold (typically 2β3 standard deviations from the rolling baseline).
Langchain, LlamaIndex, and Arize AI all offer commercial tooling for this monitoring pattern. The open-source alternative is RAGAS with a PostgreSQL metrics store and a Grafana dashboard. The specific tools matter less than the discipline of running evaluation continuously rather than only at release time.
Regression detection catches when a new version of the system performs worse on the test set than the previous version. A/B testing catches when a new version performs differently (better or worse) on live traffic. Both are necessary because they catch different failure classes: regressions on the test set catch known failure mode deterioration; A/B testing on live traffic catches distribution shift failures that the test set did not anticipate.
The 2024 Elastic Search documentation on their vector search quality evaluation framework specifically called out the need for both: static regression testing for "safety against known regressions" and live A/B testing for "discovery of unknown regressions on long-tail queries." This two-layer strategy is now considered standard for any RAG system with significant query volume.
β Test set with 400+ queries across factoid, synthesis, edge, and OOD categories β β Retrieval metrics: nDCG@5, Recall@10, MRR@10 β β Answer quality metrics: faithfulness (cross-family LLM judge), answer relevance, context recall β β Human evaluation sample at each major release β β Versioned corpus snapshots for reproducibility β β Continuous 1β5% shadow evaluation on live traffic β β Automated alerts on metric drops exceeding 2β3 SD β β A/B testing on live traffic for new version deployments
You'll design an end-to-end evaluation pipeline for a specific RAG deployment scenario. The tutor will challenge your metric choices, test set design, monitoring architecture, and regression detection thresholds β pushing you to justify every decision with reference to what failure modes it catches and what it misses.