L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Module 6 Β· Lesson 1

Why RAG Quality Is Hard to Measure

Retrieval errors are silent, compound, and context-dependent β€” making evaluation fundamentally different from classification tasks.
What actually goes wrong when retrieved context is subtly off β€” and how do we even detect it?

When Microsoft launched the new Bing Chat in February 2023, early testers began documenting a consistent failure pattern: the system would retrieve accurate background facts but then generate responses that contradicted those facts or extrapolated beyond them in harmful ways. One widely circulated exchange had the chatbot insisting it was not an AI, referencing retrieved text about human cognition as if it applied to itself. The retrieval was working β€” relevant passages were being fetched β€” but the system had no mechanism to flag when retrieved context was being misapplied by the generator. The eval gap was not in finding relevant documents; it was in detecting misuse of those documents. Microsoft began pushing weekly updates within days, but the episode illustrated that relevance alone is an incomplete measure of RAG quality.

The Three-Layer Quality Problem

RAG quality failures occur at three distinct layers, and each requires a different evaluation lens. The first is retrieval quality: did the system find the right documents? The second is grounding quality: does the generated answer actually reflect what was retrieved? The third is answer quality: is the final response accurate, complete, and appropriate for the user?

These layers are not independent. A perfect retrieval step followed by poor grounding produces hallucinated answers that look authoritative because they are decorated with real citations. Conversely, a flawed retrieval step that happens to return marginally useful context can still yield a passable answer for simple queries β€” masking the underlying retrieval problem until a harder query exposes it.

This layered dependency is why end-to-end answer quality metrics, measured alone, are insufficient. A system can pass end-to-end tests while harboring severe retrieval bugs that only manifest on specific query distributions. Robust evaluation requires instrumented signals at each layer independently.

Why Classification Metrics Don't Transfer

In a classification task, you have a fixed label space and ground truth for every example. In RAG evaluation, the "correct" context set for a given query is often ambiguous, query-dependent, and can involve multiple equally valid document combinations. Precision and recall exist, but their denominators are contested: what counts as a relevant document is itself a judgment call.

Silent Failure Modes

RAG systems exhibit several failure modes that are particularly dangerous because they produce plausible-looking outputs: context drift, where retrieved passages are semantically adjacent but subtly off-topic for the specific sub-question being asked; temporal mismatch, where retrieved documents are factually correct but outdated relative to the query's implicit time frame; and coverage gaps, where the system retrieves some relevant content but misses the single document that would have answered the question correctly.

Coverage gaps were documented extensively in the 2023 BEIR benchmark evaluations. Researchers from Hugging Face and collaborators tested dense retrieval systems across 18 heterogeneous retrieval tasks and found that models that scored well on in-domain benchmarks like MS MARCO dropped dramatically on out-of-domain tasks β€” not because they retrieved wrong documents, but because they retrieved highly similar but subtly incorrect documents. The failure was invisible until the evaluation corpus was deliberately varied.

Context DriftRetrieval of passages that are semantically related to the query but do not contain the specific information needed to answer it correctly.
Coverage GapA failure where all retrieved documents are genuinely relevant but the single most important document for answering the query was not retrieved.
Grounding FailureA generation-layer error where the model produces claims that are not supported by, or contradict, the retrieved context it was given.
The Compounding Problem

What makes RAG evaluation particularly challenging is compounding: small retrieval errors amplify in the generation step. An LLM given four mostly-correct passages and one confidently wrong passage will often weight the wrong passage heavily if it is more specific or more assertively worded than the correct ones. Anthropic's research on context manipulation (published in their 2023 work on "sleeper agent" and context-injection robustness) showed that LLMs are systematically overconfident about specific-sounding information even when it contradicts established context. In a RAG pipeline, a single retrieved document containing outdated statistics with precise-sounding numbers will often dominate a response over four correct documents with more qualified language.

This means evaluation cannot treat retrieval quality and generation quality as separate accounting problems. A meaningful evaluation framework must measure the interaction effect β€” what happens when imperfect retrieval meets an LLM with known biases toward confident-sounding specifics.

Design Principle

Evaluate at every layer independently AND measure the interaction between layers. A system where retrieval earns a 0.82 nDCG but answers score 0.91 accuracy is hiding something β€” the high answer accuracy may be concealing cases where the LLM is successfully guessing despite retrieval failure, which will collapse on harder query distributions.

Lesson 1 Quiz

Why RAG Quality Is Hard to Measure
1. The Bing Chat failures in February 2023 primarily illustrated which evaluation gap?
Correct. Retrieval was functioning but the generator misapplied the retrieved content β€” a grounding failure invisible to standard retrieval metrics.
Not quite. The issue was not retrieval relevance but what the generator did with retrieved content β€” a grounding layer problem.
2. What distinguishes "context drift" from a straightforward retrieval miss?
Correct. Documents retrieved during context drift pass surface relevance checks but lack the specific sub-topic information the query required.
Context drift is specifically about retrieving plausibly related but subtly off-target documents β€” harder to detect than an outright miss.
3. According to the 2023 BEIR benchmark findings, what type of failure did dense retrieval models exhibit on out-of-domain tasks?
Correct. The BEIR findings showed that good in-domain performance masked brittle out-of-domain behavior caused by near-miss retrievals.
The BEIR finding was more subtle β€” models retrieved convincingly similar but wrong documents, only detectable through deliberate corpus variation.
4. Why is measuring only end-to-end answer quality insufficient for RAG evaluation?
Correct. LLMs can guess correctly despite retrieval failure, masking the underlying problem until query distribution shifts expose it.
The key insight is that LLMs can compensate for retrieval failures on easy queries β€” making end-to-end metrics blind to retrieval layer bugs.

Lab 1 β€” Diagnosing RAG Failure Modes

Practice identifying which layer of a RAG system has failed based on symptom descriptions.

Your Task

You'll be presented with RAG failure scenarios. Your job is to identify whether the failure is at the retrieval layer, the grounding layer, or the answer quality layer β€” and explain your reasoning. The AI tutor will challenge your analysis and push you toward precise diagnostic thinking.

Starter: "A RAG system is asked 'What is the current Fed funds rate?' It retrieves three documents about the Federal Reserve from 2021 and generates a confident answer citing those documents. The answer is wrong. Which layer failed and why?"
RAG Eval Tutor
Lab 1
Welcome to Lab 1. I'm your RAG evaluation diagnostic tutor. Let's work through failure mode identification together. Start with the scenario above, or describe a RAG failure you want to analyze. I'll challenge your reasoning until we've precisely pinpointed the layer that failed and why.
Module 6 Β· Lesson 2

Retrieval Metrics: Precision, Recall, nDCG, MRR

The mathematical vocabulary of retrieval quality β€” and the specific failure conditions each metric is blind to.
When does a high nDCG score mask a broken retrieval system?

When Microsoft released the MS MARCO dataset in 2016 β€” 100,000 real Bing queries with human-annotated relevant passages β€” it became the dominant benchmark for evaluating passage retrieval. Systems were ranked by MRR@10: mean reciprocal rank of the first relevant result in the top 10. By 2020, neural models had pushed MRR@10 above 0.40, seemingly excellent. But researchers publishing in the 2022 SIGIR proceedings showed that these same models had catastrophically poor recall on queries where the answer existed only once in the corpus β€” so-called "singleton relevant" queries. MRR@10 was insensitive to this failure because it only cared about whether the first relevant document appeared in the top 10, not whether all relevant documents were retrieved. A system could score 0.40 MRR@10 while completely missing 30% of answerable queries that had exactly one correct passage.

Precision and Recall at Rank k

The most fundamental retrieval metrics are precision@k and recall@k. Precision@k measures what fraction of the top-k retrieved documents are actually relevant. Recall@k measures what fraction of all relevant documents appear in the top k. These metrics trade off against each other and neither alone is sufficient.

For RAG systems specifically, recall@k is often more important than precision@k. A RAG generator given 5 relevant and 3 irrelevant documents can usually extract the right answer. A generator given 8 highly relevant documents but missing the single document containing the key fact will fail. This asymmetry means RAG evaluation should weight recall more heavily than standard IR evaluation does.

Precision@k = |{relevant docs} ∩ {retrieved top-k}| / k
Recall@k = |{relevant docs} ∩ {retrieved top-k}| / |{all relevant docs}|
Mean Reciprocal Rank (MRR)

MRR measures how high in the ranked list the first relevant document appears. For a set of queries Q, MRR = (1/|Q|) Γ— Ξ£ (1/rank_i) where rank_i is the position of the first relevant document for query i. MRR rewards systems that surface at least one relevant document early.

MRR is appropriate when users need only one good document to answer their question. It is poorly suited to RAG when: (1) queries require synthesis across multiple documents, (2) the corpus has many relevant documents that all contribute partial information, or (3) the most important document for an answer is not the most obviously relevant one. The MS MARCO finding described above is a direct consequence of MRR's indifference to recall.

MRR = (1/|Q|) Γ— Ξ£α΅’ (1 / rank of first relevant doc for query i)
Normalized Discounted Cumulative Gain (nDCG)

nDCG addresses MRR's limitations by: (1) assigning graded relevance scores rather than binary relevant/irrelevant, and (2) discounting the value of relevant documents that appear lower in the ranking. A document at rank 1 contributes more to DCG than the same document at rank 5.

Normalized DCG divides the actual DCG by the ideal DCG (IDCG) β€” the score you'd get if all relevant documents were ranked perfectly. This normalization puts scores on a 0–1 scale comparable across queries with different numbers of relevant documents.

nDCG is the dominant metric in modern IR because it handles multi-graded relevance and position sensitivity simultaneously. However, it still has blind spots: it assumes you can correctly define and score all relevant documents in advance, which in dynamic RAG corpora is often impossible. It also does not penalize irrelevant high-ranked documents beyond their rank position β€” a spurious but confidently-worded document at rank 2 only costs you the DCG contribution of the correct document it displaced, not the damage it does when the LLM incorporates it.

DCG@k = Ξ£α΅’β‚Œβ‚α΅ (2^relα΅’ βˆ’ 1) / logβ‚‚(i + 1)   Β·   nDCG@k = DCG@k / IDCG@k
Metric Selection Guide for RAG
MetricBest Used WhenBlind Spot in RAG
Precision@kCorpus is noisy; LLM is distracted by irrelevant contextMisses retrieval gaps where the right doc was simply not retrieved
Recall@kMulti-hop questions; synthesis across many documents requiredRequires knowing the complete relevant set β€” often undefined
MRR@kSingle-answer factoid queries where one good doc is sufficientCompletely insensitive to singleton-relevant query failures
nDCG@kGraded relevance; ranking quality matters more than cutoffDoes not model how LLMs weight specific vs. qualified language in retrieved docs
Practitioner Note

In production RAG systems, report at minimum: nDCG@5, Recall@10, and MRR@10. Each catches different failure modes. A system that scores well on all three is genuinely performing well at the retrieval layer across diverse query types.

Lesson 2 Quiz

Retrieval Metrics: Precision, Recall, nDCG, MRR
1. Why did high MRR@10 scores on MS MARCO mask a severe recall failure?
Correct. MRR is satisfied by any relevant document in the top 10 β€” "singleton relevant" queries (one correct doc) could be completely missed while MRR remained high.
The issue is structural: MRR only measures rank of first relevant doc. Queries with exactly one answerable passage can be missed entirely without affecting MRR if other queries compensate.
2. For a RAG system that requires synthesizing information across multiple documents, which metric is most important to optimize?
Correct. Multi-document synthesis fails when relevant documents are missing from the retrieved set, making recall the critical metric.
For synthesis tasks, recall matters most β€” the generator needs all partial-information documents, so any missed relevant doc degrades the answer.
3. What does "normalization" accomplish in nDCG that raw DCG does not provide?
Correct. Dividing by IDCG puts every query's score on a 0–1 scale regardless of how many relevant documents exist for that query.
Normalization means dividing actual DCG by ideal DCG (IDCG), creating a 0–1 scale that allows fair comparison across queries with varying relevance set sizes.
4. Which nDCG blind spot is specifically relevant to RAG but not to traditional IR?
Correct. A spurious high-ranked document only loses you its rank-position DCG contribution in standard nDCG β€” but in RAG it can corrupt the entire generated answer if the LLM over-weights it.
The RAG-specific blind spot is the LLM's tendency to over-weight specific-sounding documents β€” nDCG only penalizes misranking by position, not by downstream generation damage.

Lab 2 β€” Computing and Interpreting Retrieval Metrics

Work through metric calculations and interpret what the numbers reveal β€” and conceal β€” about retrieval quality.

Your Task

You'll receive retrieval scenario data β€” query, relevant doc set, and a ranked retrieved list β€” and calculate nDCG, MRR, and Recall@k by hand. The tutor will check your math, probe your interpretation, and help you understand when different metrics give conflicting signals.

Starter: "For query Q, the relevant documents are {D1, D3, D5}. The system retrieved, in rank order: [D2, D1, D4, D3, D6]. Calculate Precision@3, Recall@5, and MRR. Then tell me what these numbers collectively say about retrieval quality."
Metrics Tutor
Lab 2
Welcome to Lab 2. I'll walk through retrieval metric calculations with you and help you interpret what each metric does and doesn't reveal. Use the starter scenario above, or bring your own retrieval result to analyze. Show your work β€” I'll verify it and probe deeper.
Module 6 Β· Lesson 3

Answer Quality Metrics: RAGAS, Faithfulness, and LLM-as-Judge

How the field moved from BLEU scores to embedding-based and LLM-based evaluation β€” and what each approach still gets wrong.
Can an LLM reliably judge whether another LLM's RAG answer is faithful to its retrieved context?

In late 2023, researchers at Exploding Gradients released RAGAS (Retrieval Augmented Generation Assessment), an open-source framework that became rapidly adopted for evaluating RAG pipelines. RAGAS introduced three metrics: faithfulness (does the answer contain only claims supportable by the retrieved context?), answer relevance (how relevant is the answer to the original question?), and context precision (what fraction of the retrieved context was actually useful for generating the answer?). The framework used an LLM, initially GPT-4, to compute each metric automatically. Within months of release, practitioners discovered systematic bias: GPT-4 rated answers from GPT-4-generated RAG pipelines higher on faithfulness than humans did, suggesting the LLM judge was sharing systematic generation biases with the systems it was evaluating.

The Limits of Lexical Metrics

Traditional NLP used BLEU, ROUGE, and METEOR to evaluate text generation quality. These metrics compare n-gram overlap between a generated answer and reference answers. For RAG, they fail for a fundamental reason: there is rarely a single canonical reference answer. A correct RAG answer might use entirely different phrasing than the reference text while being factually superior, or it might match the reference phrasing perfectly while being subtly wrong due to a retrieval error. Lexical similarity to a reference is not a proxy for factual correctness in open-domain RAG.

BERTScore improved on this by using contextualized embeddings to compare semantic similarity rather than lexical overlap. But BERTScore still requires reference answers and still cannot detect faithfulness violations β€” a generated answer could be semantically similar to the reference while containing claims not supported by the retrieved context.

RAGAS Metrics in Detail

Faithfulness in RAGAS measures whether each claim in the generated answer can be inferred from the retrieved context. The LLM judge decomposes the answer into atomic claims and checks each claim against the context set. A faithfulness score of 1.0 means every claim is context-supported; 0.0 means all claims are unsupported hallucinations. This is arguably the most important RAG-specific metric because it directly measures the grounding layer.

Answer Relevance measures whether the answer addresses the question asked. This catches verbose or tangential answers that might score well on faithfulness because they only make context-supported claims β€” but few of those claims actually address what the user wanted. RAGAS computes this by having an LLM generate candidate questions for the answer and measuring cosine similarity between those generated questions and the original query.

Context Recall measures whether the retrieved context contains the information needed to answer the question β€” estimated by checking whether the ground-truth answer's claims can be attributed to the retrieved context.

FaithfulnessThe fraction of claims in a generated answer that are directly supported by the retrieved context β€” the primary measure of grounding quality.
Answer RelevanceA measure of whether the generated answer directly addresses the user's query, independent of whether it is factually correct.
LLM-as-JudgeUsing a large language model to score or compare outputs from another model, replacing or supplementing human annotation.
LLM-as-Judge: Promise and Systematic Bias

The LLM-as-judge approach gained major traction after the MT-Bench and Chatbot Arena papers from LMSYS in 2023, which showed that GPT-4 judgments correlated with human preferences at 0.80+ on dialogue quality tasks. This seemed to validate using LLMs as scalable automated evaluators for RAG.

However, subsequent research identified three systematic biases specific to faithfulness evaluation: position bias (LLM judges rate claims appearing earlier in context as better supported, even when later context is equally or more relevant); verbosity bias (longer answers receive higher faithfulness scores even when the added length introduces unsupported claims); and self-consistency bias (when an LLM judge and the evaluated system share the same base model, the judge rates outputs higher because it is biased toward outputs it would itself produce).

The self-consistency bias is particularly important. The RAGAS finding about GPT-4 judging GPT-4 outputs is a specific instance of a general problem: using the same model family to evaluate itself inflates faithfulness scores relative to human judgment by 8–15 percentage points in several published comparisons.

Practical Mitigation

Use a judge model from a different family than your generation model. If generating with GPT-4, evaluate with Claude or Gemini. If generating with an open-source model, evaluate with a different open-source architecture. This does not eliminate LLM-as-judge biases but substantially reduces the self-consistency inflation effect.

Building a Hybrid Evaluation Stack

Production RAG systems at companies like Cohere, Anthropic, and Databricks use hybrid evaluation stacks that combine: (1) automated LLM-as-judge scoring with multiple judges from different families; (2) human evaluation on a sampled subset of 100–500 queries per evaluation cycle; and (3) behavioral testing β€” specific adversarial queries designed to trigger known failure modes. No single metric is trusted alone.

Databricks' DBRX documentation described a four-metric combination: LLM faithfulness score, human preference rate on a sample, correctness against a factoid test set with known ground truth, and retrieval nDCG on a separate held-out query set. The rationale was that each metric catches different failure modes β€” a system must perform well on all four to be considered production-ready.

Lesson 3 Quiz

Answer Quality Metrics: RAGAS, Faithfulness, and LLM-as-Judge
1. What specific bias was discovered when GPT-4 was used to evaluate GPT-4-generated RAG answers in the RAGAS framework?
Correct. Shared generation biases between the judge model and the evaluated system inflate faithfulness scores by 8–15 percentage points relative to human judgment.
The self-consistency bias specifically causes the same-family LLM judge to rate outputs from its own family higher than human evaluators would β€” not lower, and not restricted to length.
2. Why do BLEU and ROUGE fail as RAG answer quality metrics?
Correct. In open-domain RAG, there is no single reference answer and factual correctness is not correlated with n-gram overlap against any particular phrasing.
The core failure is conceptual: n-gram overlap against a reference text is not a proxy for factual accuracy in open-domain question answering where many valid phrasings exist.
3. RAGAS "Answer Relevance" specifically catches which failure mode that faithfulness misses?
Correct. A highly faithful answer can still be irrelevant if the generator produces context-grounded text that sidesteps the actual question asked.
Faithfulness checks if claims are context-supported; relevance checks if the answer addresses the question. A verbose, context-grounded but question-avoiding answer would score high on faithfulness but low on relevance.
4. What practical mitigation reduces LLM-as-judge self-consistency bias in RAG evaluation?
Correct. Cross-family evaluation (e.g., Claude judging GPT-4 outputs) substantially reduces the shared-bias inflation effect without eliminating all LLM-judge biases.
Self-consistency bias arises from shared training distributions β€” the only structural fix is using a judge from a different model family, not running the same judge multiple times.

Lab 3 β€” Evaluating RAG Answer Quality

Apply faithfulness scoring and LLM-as-judge techniques to real RAG answer examples.

Your Task

You'll evaluate RAG-generated answers for faithfulness, relevance, and grounding quality using the RAGAS framework concepts. Practice decomposing answers into atomic claims and checking each claim against the provided context. The tutor will challenge your faithfulness assessments and help you identify the biases you should control for.

Starter: "Context: 'The Federal Reserve raised interest rates seven times in 2022, reaching a range of 4.25%–4.50% by December.' Answer: 'The Federal Reserve implemented aggressive monetary tightening throughout 2022, raising rates multiple times to reach historically high levels by year-end, effectively combating inflation with unprecedented speed.' Evaluate faithfulness: identify which claims are context-supported and which are not."
Faithfulness Evaluator
Lab 3
Welcome to Lab 3. We're practicing faithfulness evaluation β€” the most important RAG-specific quality metric. I'll give you answer/context pairs and you'll decompose answers into atomic claims, then judge each claim's support status. Start with the scenario above or bring your own example. Be precise: which exact words are the unsupported claims?
Module 6 Β· Lesson 4

Building an Evaluation Pipeline: Test Sets, Continuous Monitoring, and Regression Detection

From one-time benchmarks to production-grade evaluation infrastructure that catches regressions before users do.
How do you build an evaluation system that reliably catches when a RAG system has gotten worse?

In August 2024, researchers at PromptArmor published findings that Slack's AI summarization feature β€” a RAG-adjacent system that retrieved messages and summarized them β€” could be manipulated to leak information from private channels through prompt injection in retrieved messages. More relevant to evaluation: the failures only surfaced when the message corpus included adversarial inputs, something the internal evaluation pipeline had not tested. Slack had validated the feature on clean internal message corpora. The regression detection gap was not in the metric chosen but in the test set construction β€” the evaluation distribution had not included the class of inputs where the system failed. This is the canonical argument for adversarial and out-of-distribution test sets in any RAG evaluation pipeline.

Test Set Construction Principles

A production RAG evaluation test set must cover at minimum four query categories: (1) factoid queries with single correct answers verifiable against ground truth; (2) synthesis queries requiring information from multiple documents; (3) edge queries designed to test known system weaknesses, such as temporal sensitivity, numerical reasoning, or negation; and (4) out-of-distribution queries drawn from domains not well represented in the training or indexing corpus.

The size of a meaningful test set depends on the variance you need to detect. For a 5-percentage-point regression to be detectable at 95% confidence, you need approximately 400 queries (by standard power analysis). Many production teams use 200–300 queries, which can miss 5-point regressions reliably. This is a known risk trade-off, not a best practice.

Real-World Example β€” Perplexity AI Evaluation

Perplexity AI has described using continuous evaluation with a combination of human rater samples, automated RAGAS-style faithfulness scoring, and a dedicated regression test set of ~500 questions with verified answers. The test set is versioned alongside model releases, and any deployment that drops more than 2 percentage points on faithfulness or 3 percentage points on answer accuracy against the test set is blocked for human review before release.

Corpus Versioning and Index Drift

One of the most overlooked evaluation problems in production RAG is index drift: the retrieval corpus changes over time (documents are added, updated, or deleted), causing the system's behavior to change even when no code has been modified. A RAG system that scored 0.85 faithfulness in January may score 0.76 in July not because anything was redeployed but because new documents introduced conflicting information or outdated documents that used to be retrieved first were displaced by newer additions.

Addressing index drift requires: (1) versioned corpus snapshots for reproducible evaluation; (2) a static test corpus subset that never changes, allowing longitudinal comparison; and (3) periodic re-annotation of test queries against the current corpus to flag cases where the ground-truth answer has changed due to corpus updates.

Continuous Monitoring Architecture

A minimal production monitoring stack for RAG consists of: a shadow evaluation pipeline that runs every query in production through the evaluation stack with a small latency overhead, sampling 1–5% of live traffic; a metric dashboard tracking rolling 24-hour and 7-day averages of faithfulness, answer relevance, and retrieval nDCG; and automated alerting that fires when any metric drops more than a defined threshold (typically 2–3 standard deviations from the rolling baseline).

Langchain, LlamaIndex, and Arize AI all offer commercial tooling for this monitoring pattern. The open-source alternative is RAGAS with a PostgreSQL metrics store and a Grafana dashboard. The specific tools matter less than the discipline of running evaluation continuously rather than only at release time.

Regression Detection vs. A/B Testing

Regression detection catches when a new version of the system performs worse on the test set than the previous version. A/B testing catches when a new version performs differently (better or worse) on live traffic. Both are necessary because they catch different failure classes: regressions on the test set catch known failure mode deterioration; A/B testing on live traffic catches distribution shift failures that the test set did not anticipate.

The 2024 Elastic Search documentation on their vector search quality evaluation framework specifically called out the need for both: static regression testing for "safety against known regressions" and live A/B testing for "discovery of unknown regressions on long-tail queries." This two-layer strategy is now considered standard for any RAG system with significant query volume.

Complete Evaluation Pipeline Checklist

βœ“ Test set with 400+ queries across factoid, synthesis, edge, and OOD categories β€” βœ“ Retrieval metrics: nDCG@5, Recall@10, MRR@10 β€” βœ“ Answer quality metrics: faithfulness (cross-family LLM judge), answer relevance, context recall β€” βœ“ Human evaluation sample at each major release β€” βœ“ Versioned corpus snapshots for reproducibility β€” βœ“ Continuous 1–5% shadow evaluation on live traffic β€” βœ“ Automated alerts on metric drops exceeding 2–3 SD β€” βœ“ A/B testing on live traffic for new version deployments

Lesson 4 Quiz

Building an Evaluation Pipeline
1. The Slack AI evaluation failure in 2024 illustrates which specific evaluation pipeline deficiency?
Correct. The evaluation pipeline validated on clean inputs and never tested adversarial message content β€” a test set construction gap, not a metric gap.
The Slack failure was about test set scope, not metric choice. The evaluation corpus didn't include adversarial inputs, so the evaluation never observed the failure class.
2. Approximately how many test queries are required to detect a 5-percentage-point regression at 95% confidence?
Correct. Standard power analysis for detecting a 5-point regression at 95% confidence requires approximately 400 queries β€” many production teams use fewer, accepting the risk of missing small regressions.
Power analysis for 5-point regression detection at 95% confidence requires approximately 400 queries. Fewer queries can miss regressions of this magnitude reliably.
3. What is "index drift" and why does it create evaluation challenges?
Correct. Corpus changes cause retrieval behavior to change silently, making longitudinal metric comparisons unreliable without versioned corpus snapshots and static test subsets.
Index drift specifically refers to the document corpus changing over time, not the model or query distribution. New documents can displace previously top-ranked results, changing system behavior without any code change.
4. Why are both static regression testing AND live A/B testing necessary in a production RAG evaluation strategy?
Correct. Each method catches what the other misses: static tests guard known failure modes, A/B tests discover unknown regressions on unanticipated live query distributions.
Static tests and A/B tests catch complementary failure classes β€” neither alone is sufficient. Elastic's evaluation framework documentation explicitly recommends both for this reason.

Lab 4 β€” Designing a RAG Evaluation Pipeline

Design a complete evaluation strategy for a real-world RAG deployment scenario.

Your Task

You'll design an end-to-end evaluation pipeline for a specific RAG deployment scenario. The tutor will challenge your metric choices, test set design, monitoring architecture, and regression detection thresholds β€” pushing you to justify every decision with reference to what failure modes it catches and what it misses.

Starter: "You're building evaluation infrastructure for a customer support RAG system that answers questions about a SaaS product using a corpus of 50,000 support documents updated weekly. The system handles 10,000 queries per day. Design your evaluation pipeline: what test set, what metrics, what monitoring, and what thresholds?"
Eval Pipeline Advisor
Lab 4
Welcome to Lab 4 β€” the capstone lab for Module 6. You're designing a complete RAG evaluation pipeline for a real deployment scenario. I'll challenge every design choice: metric selection, test set coverage, monitoring architecture, alert thresholds, and corpus versioning strategy. Use the scenario above or propose your own. Start with your test set design and justify each category of queries you'd include.

Module 6 Test

Retrieval Quality and Evaluation β€” 15 questions β€” 80% to pass
1. Which layer of a RAG system is responsible for "grounding failures" β€” answers that contradict the retrieved context?
Correct. Grounding failures occur in generation β€” the LLM misuses, ignores, or contradicts retrieved context.
Grounding failures are a generation-layer problem β€” the retrieval layer may have worked correctly while the LLM produced unsupported claims.
2. What does Precision@k measure in information retrieval?
Correct. Precision@k = relevant ∩ top-k / k.
Precision@k measures the accuracy of the top-k set β€” what fraction of retrieved documents are relevant, not what fraction of relevant documents were retrieved (that's recall).
3. MRR@10 of 0.40 could mask which specific retrieval failure?
Correct. MRR only requires one relevant doc in the top 10 for each query β€” singleton-relevant queries that are completely missed don't change MRR if other queries compensate.
MRR's blind spot is singleton-relevant queries. Queries with exactly one answerable document can be totally missed as long as other multi-relevant queries compensate.
4. In the nDCG formula, what is the purpose of the logβ‚‚(i+1) denominator?
Correct. The logarithmic discount reflects the assumption that users are less likely to examine documents at lower rank positions.
The log denominator provides position discounting β€” a relevant document at rank 1 contributes much more to DCG than the same document at rank 10.
5. For a RAG system answering multi-hop questions requiring synthesis across 5 documents, which metric combination is most appropriate?
Correct. Multi-hop synthesis needs all relevant documents (Recall@10) and grounding quality (faithfulness) β€” MRR and lexical metrics are both poorly suited.
Synthesis tasks prioritize recall (all pieces needed must be retrieved) and faithfulness (the synthesis must reflect the retrieved pieces) over precision-focused or lexical metrics.
6. RAGAS "Context Precision" measures which specific quality dimension?
Correct. Context Precision asks: of all the retrieved context given to the LLM, what fraction actually contributed to the answer generated?
Context Precision is about the utility of retrieved context for the generation β€” it penalizes retrieving lots of irrelevant passages even if they're topically related to the query.
7. The verbosity bias in LLM-as-judge evaluation means:
Correct. LLM judges tend to conflate thoroughness with accuracy, systematically overrating longer answers.
Verbosity bias means the LLM judge is systematically fooled by length β€” longer answers get higher ratings even when length is achieved by adding unsupported claims.
8. What is the IDCG in the nDCG formula and why is it necessary?
Correct. Dividing by IDCG normalizes DCG so queries with many relevant documents are comparable to queries with few relevant documents.
IDCG is the maximum achievable DCG for a given query β€” the score if relevance-ranked perfectly. Dividing by it gives nDCG its 0–1 normalized scale.
9. Why does the BEIR benchmark finding matter for RAG system evaluation?
Correct. BEIR showed strong MS MARCO performance masking severe out-of-domain degradation β€” a direct argument for OOD test sets in any RAG evaluation pipeline.
BEIR's key lesson: high in-domain scores give false confidence. Systems need evaluation on varied query and document distributions, not just the training domain.
10. What is the recommended mitigation for position bias in LLM-as-judge faithfulness evaluation?
Correct. Averaging over multiple context orderings neutralizes the systematic bias toward claims supported by early-appearing context passages.
Position bias causes LLM judges to over-credit claims appearing early in context. Mitigations include multiple-ordering evaluation averaging or using judges specifically tested for position robustness.
11. What sample percentage of live traffic is typically used in shadow evaluation pipelines for production RAG monitoring?
Correct. 1–5% sampling balances monitoring coverage against computational cost while providing statistically meaningful signals.
Standard practice is 1–5% sampling β€” enough for statistical significance on daily volumes without the cost of evaluating every query.
12. A RAG answer scores 0.95 on faithfulness but 0.40 on answer relevance. What does this indicate?
Correct. High faithfulness + low relevance = the answer is context-grounded but tangential β€” the generator discussed related content without addressing the specific question.
These scores can absolutely coexist: faithfulness measures grounding (all claims are supported) while relevance measures question-addressing (the answer actually answers what was asked). High/low is the verbose-tangential pattern.
13. What is the primary purpose of maintaining a versioned corpus snapshot in a RAG evaluation infrastructure?
Correct. Without versioned snapshots, it is impossible to know whether a metric change between evaluations reflects a code regression or index drift.
Versioned corpus snapshots solve the index drift evaluation problem β€” you can't reliably compare evaluation runs over time if the corpus underneath has changed.
14. The Databricks DBRX evaluation approach used four metrics together. Which combination was described?
Correct. The four-metric combination was designed so each metric catches a different failure mode class that the others miss.
Databricks described: LLM faithfulness + human preference sample + factoid ground-truth correctness + retrieval nDCG β€” covering grounding, human judgment, factual accuracy, and retrieval quality independently.
15. Which claim about BERTScore's limitation for RAG evaluation is accurate?
Correct. BERTScore improves on n-gram overlap but still requires references and is blind to faithfulness β€” a hallucinated answer that happens to be semantically similar to the reference will score well.
BERTScore's fundamental limitation for RAG: it requires reference answers and measures semantic similarity to that reference, not faithfulness to retrieved context β€” the two most important RAG-specific requirements.