Module 7 · Lesson 1

End-to-End Pipeline Architecture

From raw document to grounded answer — every stage, every decision

What actually happens between "upload your docs" and "here's your answer"?

When Microsoft launched Bing Chat on February 7, 2023, the public saw a chatbot. Engineers saw a live demonstration of a production RAG pipeline under stress. Within days, users discovered the system would sometimes confabulate details not in its retrieved context — a signal that the generation stage was overriding retrieval signals. The incident forced a public conversation about where, exactly, RAG pipelines can break.

Understanding the full pipeline — not just retrieval or just generation — is how you avoid shipping the same failure in your own systems.

The Six Canonical Stages

Every production RAG system passes data through six distinct stages. Treating any stage as a black box is how bugs become incidents.

Ingestion — Documents are loaded, parsed, and normalized. PDFs, HTML, DOCX, database rows all become plain text.

↓

Chunking — Text is split into retrievable segments. Chunk boundaries determine what context the retriever can surface.

↓

Embedding — Each chunk is converted to a dense vector via an encoder model. This representation drives semantic search.

↓

Indexing — Vectors are stored in a vector database (Pinecone, Weaviate, pgvector, FAISS) alongside chunk metadata.

↓

Retrieval — At query time, the user query is embedded and compared to the index. Top-k chunks are returned.

↓

Generation — Retrieved chunks are injected into the LLM context window as a prompt. The model answers using that grounded context.

Online vs. Offline Pipeline Paths

Stages 1–4 are offline (indexing time). They run once per document update and can be expensive. Stages 5–6 are online (query time). They run per user request and must be fast — typically under 500 ms for production systems.

This separation matters for architecture. You can iterate on your embedding model or chunking strategy independently of your generation prompt, as long as you re-index after offline changes. Many teams mistake slow query latency for a generation problem when the retrieval step is the bottleneck.

Production Reality

Elasticsearch's 2023 developer survey of 1,000+ RAG deployments found that 61% of teams reported their primary quality issues originated in the indexing pipeline (chunking and embedding), not in the LLM generation stage. Most debugging time is spent in the wrong place.

Data Flow and Metadata

Each chunk should carry structured metadata alongside its vector: source document ID, page number, creation date, access control tags, chunk index within the document. This metadata enables filtered retrieval — e.g., "only search documents from Q3 2024" — without embedding those constraints into the semantic vector itself.

LlamaIndex (formerly GPT Index) formalized this pattern in its NodeWithScore abstraction, where every retrieved chunk is a node carrying both a relevance score and a metadata dict. LangChain uses a similar Document object with a page_content and metadata dict. Both frameworks converged on this design independently, which is a signal it reflects genuine production need.

Offline pathStages run at index-build time: ingestion, chunking, embedding, storage. Can be batch-processed.

Online pathStages run per query: retrieval and generation. Latency-sensitive.

Metadata filteringUsing structured fields (date, source, category) to pre-filter the index before semantic search runs.

Design Principle

Design your pipeline so that each stage can be swapped independently. If you couple your chunker to your embedder to your retriever, every experiment requires a full rewrite. The best production teams treat each stage as a replaceable component with a defined interface.

Lesson 1 Quiz

End-to-End Pipeline Architecture · 4 questions

Which pipeline stages run at index time rather than query time?

Correct. Stages 1–4 form the offline pipeline. They run when documents are added or updated, not per user query.

Not quite. Retrieval and generation are online (query-time). The offline stages — ingestion through indexing — run when documents are processed.

What does metadata attached to each chunk primarily enable?

Correct. Metadata fields (date, source, category) allow pre-filtering the index before semantic search, without encoding those constraints into the embedding itself.

Metadata doesn't affect embedding computation or context window size. Its primary value is enabling filtered retrieval using structured fields.

According to Elasticsearch's 2023 developer survey, where did most RAG quality issues originate?

Correct. 61% of teams identified the indexing pipeline as the source of quality issues — meaning most debugging is spent on the wrong stage.

Counterintuitively, the survey found chunking and embedding — not generation — were most often to blame. Most teams debug the LLM first, which wastes time.

What does the Bing Chat February 2023 incident illustrate about RAG pipelines?

Correct. The incident showed that a generation model can ignore retrieved context and hallucinate — the pipeline only works when all stages are properly calibrated together.

The Bing Chat incident demonstrated that even with retrieval, generation can confabulate if the generation stage doesn't properly attend to the retrieved context.

Lab 1 — Pipeline Stage Mapping

Diagnose a RAG system by mapping its stages

Your Task

You're consulting on a RAG deployment that returns irrelevant answers. The team says "the LLM is hallucinating." Your job is to systematically walk through the six pipeline stages to identify where the actual problem might be. Use the AI assistant to work through the diagnosis — describe what you'd check at each stage.

Start by describing one stage you'd investigate first, and what specific signals would tell you it's working correctly. Ask the assistant to challenge your reasoning or suggest what to check next.

Pipeline Diagnosis Assistant

RAG Stage Analysis

Ready to help you debug this RAG pipeline. The team says the LLM is hallucinating — but remember, 61% of the time the real issue is upstream, in chunking or embedding. Which of the six stages do you want to start with, and what would a healthy signal look like there?

Module 7 · Lesson 2

Query Processing and Retrieval Strategies

The retrieval stage determines what the LLM can possibly know

How do you make sure the right chunks surface for every query type?

Glean, the enterprise search startup that reached a $2.2 billion valuation in 2024, built its entire product around one insight: keyword search fails for enterprise knowledge. When employees ask "what did we decide about the Q3 budget?" a BM25 keyword index returns documents containing those words — not documents about that decision. Glean's retrieval layer combines semantic search with graph-based entity linking to surface the decision memo, not just the budget spreadsheet.

The lesson for RAG designers: retrieval strategy is a product decision, not just an infrastructure one.

Three Retrieval Modes

Production RAG systems use three retrieval modes, often in combination.

Sparse / Keyword

BM25 and TF-IDF. Fast, interpretable, no ML required. Excels on exact-match queries like product codes, proper nouns, and legal terminology. Fails on synonyms and paraphrase.

Dense / Semantic

Embedding similarity (cosine or dot product). Captures meaning over exact terms. Excels on conceptual queries. Can miss rare terminology if not in training data.

Hybrid

Combines sparse and dense scores via Reciprocal Rank Fusion (RRF) or linear interpolation. Used by Elasticsearch 8.x, Pinecone hybrid search, and Weaviate. Best overall recall across query types.

Reranking

A cross-encoder (e.g., Cohere Rerank, BGE-Reranker) rescores the top-k results from a first-stage retriever. Dramatically improves precision at the cost of added latency.

Query Transformation Techniques

Raw user queries are often poor retrieval inputs. Several techniques transform them before retrieval runs.

HyDE (Hypothetical Document Embeddings), introduced by Gao et al. in 2022, generates a hypothetical answer to the query using an LLM, then retrieves documents similar to that hypothetical. This sidesteps the query-document embedding mismatch — queries and documents often have different styles even when semantically related.

Query decomposition breaks complex multi-part questions into sub-queries, retrieves for each independently, then merges results. LangChain's MultiQueryRetriever automates this by generating three query variants and union-merging their results.

Step-back prompting (Google DeepMind, 2023) reformulates the query at a higher abstraction level. "What are the side effects of ibuprofen at 800mg?" becomes "What are NSAID pharmacokinetics?" — retrieving the foundational context before the specific one.

Benchmark Evidence

The BEIR benchmark (2021, Thakur et al.) tested 18 retrieval systems across 18 datasets. No single method dominated across all domains. Dense retrieval outperformed BM25 on semantic tasks; BM25 outperformed dense on exact-match domains like TREC-COVID. Hybrid systems achieved the highest average nDCG@10 across the full benchmark. This is why modern production systems almost always use hybrid retrieval.

Top-k Selection and Context Budget

Retrieving k=3 is common in demos. Production systems typically retrieve k=20 at the first stage, then rerank to k=5 for the generation context window. The tradeoff: more retrieved chunks increase recall but inflate the prompt, increasing cost, latency, and the risk of lost in the middle — the documented tendency of LLMs to underweight context appearing in the middle of long prompts (Liu et al., 2023).

BM25Best Match 25. A probabilistic keyword retrieval function. The default for Elasticsearch and OpenSearch.

RRFReciprocal Rank Fusion. Merges ranked lists by summing reciprocals of rank positions. Parameter-free and robust.

HyDEHypothetical Document Embeddings. Generate a fake answer, embed it, retrieve real documents similar to that embedding.

Lost-in-the-middleLLMs systematically underweight context in the middle of long prompts. Put the most critical context first or last.

Lesson 2 Quiz

Query Processing and Retrieval Strategies · 4 questions

What does HyDE (Hypothetical Document Embeddings) do to improve retrieval?

Correct. HyDE sidesteps the query-document embedding mismatch by generating a fake answer that "looks like" a document, then embedding that instead of the sparse query.

HyDE generates a hypothetical answer to the query using an LLM, embeds that hypothetical, and retrieves real documents similar to it — bridging the query-document style gap.

What does the BEIR benchmark's results suggest about choosing a retrieval strategy?

Correct. BEIR showed dense beats BM25 on semantic tasks, BM25 beats dense on exact-match tasks, and hybrid systems achieved the highest average nDCG@10 across all 18 datasets.

BEIR's finding was that no method dominates across all domains — each has strengths. Hybrid retrieval achieved the best overall average, which is why most production systems use it.

What is the "lost-in-the-middle" problem identified by Liu et al. (2023)?

Correct. Liu et al. showed that even when the answer is in the context, if it's positioned in the middle of a long prompt, LLMs are less likely to use it. Critical context should go first or last.

The lost-in-the-middle problem is about LLM attention patterns: models underweight context in the middle of long prompts, even when it contains the correct answer.

What is Reciprocal Rank Fusion (RRF) used for?

Correct. RRF combines ranked lists by summing 1/(rank + 60) for each document across all lists. It's parameter-free and robust, making it the default fusion method in Elasticsearch 8.x hybrid search.

RRF is a rank fusion algorithm. It takes the ranked output of sparse (BM25) and dense retrievers and merges them into a single list by summing reciprocals of rank positions.

Lab 2 — Retrieval Strategy Selection

Choose and justify retrieval strategies for real query types

Your Task

You're building a RAG system for a legal research firm. Attorneys search for case law using three types of queries: (1) exact citation lookups like "42 U.S.C. § 1983", (2) conceptual queries like "cases about police immunity in excessive force claims", and (3) complex multi-part questions like "what did the Ninth Circuit decide about qualified immunity between 2018 and 2022 and how does it differ from the Fifth Circuit?"

Propose a retrieval strategy for each query type and explain why. The assistant will challenge your choices and ask you to consider tradeoffs. Aim for at least 3 exchanges to complete the lab.

Retrieval Strategy Advisor

Legal RAG System

You're designing retrieval for a legal research RAG system. Attorneys will search with exact citations, conceptual queries, and complex multi-part questions. Each query type has different characteristics that favor different retrieval strategies. Start with one query type and propose your approach — I'll push back on your reasoning and ask about tradeoffs.

Module 7 · Lesson 3

Prompt Engineering for RAG Generation

How you structure the context determines what the model can do with it

What separates a RAG prompt that grounds the model from one that doesn't?

Perplexity AI's approach to RAG prompting became one of the most studied in the industry after the company published blog posts describing their answer engine architecture. Their key insight was that the prompt template, not the retrieval alone, determined whether users trusted the output. By structuring prompts to always cite retrieved sources inline — forcing the model to ground every claim — they reduced user-perceived hallucination rates dramatically compared to systems that simply prepended context without citation instructions.

The Anatomy of a RAG Prompt

A production RAG prompt has four zones, and their order matters due to the lost-in-the-middle effect:

System instruction — Role definition and behavior constraints. "You are a precise research assistant. Answer only from the provided context. If unsure, say so."

↓

Retrieved context block — Numbered or labeled chunks. "Context [1]: ... Context [2]: ..." Labels enable inline citation.

↓

Citation and grounding instruction — "Cite context numbers inline as [1], [2]. Do not use information not present in the context."

↓

User query — Placed last, closest to the generation point. Research by Anthropic engineers suggests positioning the query after context improves grounding.

Grounding Instructions That Work

The most effective grounding instructions are specific about failure modes, not just general about accuracy. Compare these two system prompts:

Weak: "Be accurate and helpful."

Effective: "Answer using only the provided context. If the context does not contain enough information to answer the question, respond: 'The provided documents do not contain enough information to answer this question.' Do not infer, extrapolate, or use knowledge from your training data."

The effective version names the exact failure mode (using training knowledge instead of context) and provides a specific fallback response. This reduces hallucination by giving the model an explicit "safe exit" that avoids the pressure to generate something plausible.

Research Finding — Anthropic, 2023

In Anthropic's Constitutional AI follow-up work on Claude's RAG behavior, researchers found that prompts explicitly naming the fallback ("if unsure, say I don't have enough information") reduced confabulation by approximately 40% compared to prompts that only said "be accurate." The model needs a graceful exit, not just an instruction to succeed.

Context Structuring Strategies

Labeled chunks use identifiers like [Doc 1], [Source: policy_v3.pdf], or [Chunk 4 of 12] to help the model track which context element it's drawing from. Numbered labels enable inline citation in the output.

Relevance ordering — given the lost-in-the-middle effect, place the highest-relevance chunk first, second-highest last, and lower-relevance chunks in the middle. This is counterintuitive but supported by Liu et al.'s experimental data across multiple LLMs.

Context compression uses a smaller LLM to summarize or filter retrieved chunks before injection. LangChain's ContextualCompressionRetriever does this automatically. It reduces prompt tokens but adds latency and an additional failure point.

The Citation Requirement Pattern

Requiring inline citations does more than improve transparency — it functions as a reasoning constraint. When the model must cite [2] for every claim, it's forced to locate a specific source for that claim before outputting it. Claims without a citable source are harder to generate. This is why Perplexity, Bing Chat, and Google's SGE all require citation in their answer templates.

Grounding instructionExplicit prompt text that constrains the model to use only retrieved context and provides a fallback for insufficient information.

Context compressionUsing a model to extract or summarize relevant portions of retrieved chunks before injection, reducing prompt length.

Citation requirementInstructing the model to tag each claim with its source context number, functioning as a reasoning constraint against hallucination.

Practical Rule

Never give the model a context block without also giving it explicit instructions for what to do when the context is insufficient. The model will fill the gap with training data unless you provide a specific alternative — and that alternative must be stated as a concrete response, not just a vague "be honest" instruction.

Lesson 3 Quiz

Prompt Engineering for RAG Generation · 4 questions

Why does requiring inline citations function as a reasoning constraint against hallucination?

Correct. When a citation is required per claim, the model must identify a source context before outputting that claim. Claims without a citable context become structurally harder to generate — it's a built-in constraint.

Citations constrain reasoning, not token count. The model must locate and tag a source for each claim, which makes it harder to generate unsourced (hallucinated) statements.

Based on Anthropic's research, what reduces confabulation by approximately 40% compared to vague accuracy instructions?

Correct. Providing a specific fallback ("say I don't have enough information") gave the model a safe exit, reducing the pressure to generate something plausible from training data.

Anthropic's finding was about prompt design: explicitly naming the fallback response ("if unsure, say...") reduced confabulation ~40% more than just saying "be accurate."

Given the lost-in-the-middle effect, how should you order multiple retrieved chunks in the prompt?

Correct. Liu et al. showed LLMs attend most to content at the beginning and end of long contexts. The "U-shaped" attention pattern means the most important chunks should be placed at the extremes.

The lost-in-the-middle effect means LLMs best attend to context at the start and end of a long prompt. Place highest-relevance first, second-highest last, and lower-relevance in the middle.

What is context compression in a RAG pipeline?

Correct. Context compression uses a smaller LLM to filter or summarize retrieved chunks, reducing the tokens injected into the main model. LangChain's ContextualCompressionRetriever implements this automatically.

Context compression is about the prompt, not vectors. A smaller model processes retrieved chunks to extract only the relevant portions, reducing the main model's prompt length — at the cost of added latency.

Lab 3 — RAG Prompt Construction

Build and critique grounding prompts for a medical information system

Your Task

You're building a RAG system for a hospital that helps clinicians look up drug interaction information. The stakes are high: a hallucinated drug interaction could harm patients. You need to construct a prompt template that maximally grounds the model to retrieved context and handles the "I don't know" case explicitly.

Draft a system prompt (or the key sections of one) for this medical RAG system. Include: a role definition, grounding constraint, citation requirement, and fallback instruction. Share it with the assistant for critique and iteration. Aim for at least 3 exchanges.

Prompt Critique Assistant

Medical RAG Grounding

You're constructing a RAG prompt for a clinical drug interaction system — high stakes, zero tolerance for hallucination. I'll critique your prompt design and push you to strengthen the grounding constraints. Start by drafting your system prompt or at least its key sections. What instructions will you give the model?

Module 7 · Lesson 4

Evaluation, Monitoring, and Iteration

You cannot improve what you cannot measure — and RAG is notoriously hard to measure

How do you know if your RAG pipeline is actually working, and how do you make it better?

When Notion launched Notion AI in February 2023, the team quickly discovered that internal user satisfaction surveys were a lagging indicator — they revealed problems weeks after they appeared. The engineering team, led by ML engineers who had previously worked on recommendation systems, implemented a real-time faithfulness scoring pipeline that evaluated every generated response against its retrieved context using a smaller judge model. Within three months, they reported a 23% reduction in reported hallucinations, not from changing the LLM, but from using the evaluations to identify and fix specific chunking failures in their workspace document indexer.

The RAG Evaluation Framework

The dominant academic framework for RAG evaluation, codified by Es et al. in the RAGAS paper (2023), defines four core metrics that map directly to failure modes in the pipeline.

Faithfulness

Are all claims in the answer supported by the retrieved context? Low faithfulness = the generation stage is hallucinating. Measured by decomposing the answer into claims and verifying each against context.

Answer Relevance

Does the answer address what the user actually asked? Low score = the answer is on-topic but doesn't answer the specific question. Often caused by vague retrieval or insufficient context.

Context Precision

What fraction of retrieved chunks are actually relevant to the query? Low precision = the retriever is returning noise that confuses the generator.

Context Recall

Does the retrieved context contain all information needed to answer the query? Low recall = the retriever is missing key chunks. Answers will be incomplete or incorrect even if faithful.

LLM-as-Judge Evaluation

Human evaluation of RAG outputs is expensive and slow. The emerging standard is LLM-as-judge: using a separate LLM (often GPT-4 or Claude) to evaluate faithfulness, relevance, and context quality at scale. Zheng et al. (2023) at UC Berkeley showed in the MT-Bench paper that GPT-4 as judge agreed with human expert evaluators 80–85% of the time — comparable to inter-human agreement rates.

Tools like RAGAS (open source, 2023), TruLens (TruEra), and DeepEval automate LLM-as-judge evaluation pipelines. They generate evaluation datasets from your documents, run queries through your pipeline, score each response, and surface the lowest-scoring query types for debugging.

Industry Pattern

Cohere's RAG deployment team documented in their 2023 engineering blog that teams who implemented automated evaluation pipelines from week one shipped improvements 3x faster than teams that relied on manual review. The evaluation pipeline is infrastructure, not an afterthought — build it alongside the retrieval pipeline, not after launch.

What to Monitor in Production

Retrieval latency — p50, p95, p99 latency for the retrieval stage separately from generation. Spikes in retrieval latency usually indicate index fragmentation or vector database replication lag.

Context utilization rate — What fraction of retrieved chunks does the model actually cite? If the model consistently ignores 4 of 5 retrieved chunks, your top-k is too high and you're wasting tokens.

Unanswerable rate — What fraction of queries result in "I don't have enough information"? A sudden increase often indicates an indexing failure or a new query domain your corpus doesn't cover.

Query distribution drift — Are users asking different types of questions over time? If your corpus covers product documentation but users start asking about pricing (not indexed), precision will collapse. Monitor embedding-space clustering of incoming queries.

The Improvement Loop

Each RAGAS metric points to a different stage for improvement. Low faithfulness → fix generation (stronger grounding instructions, better prompt). Low context precision → fix retrieval (reranker, query transformation). Low context recall → fix chunking or indexing (chunk size, overlap, coverage gaps). This mapping turns evaluation from a grading exercise into a debugging tool.

RAGASRAG Assessment framework. Open-source evaluation suite measuring faithfulness, answer relevance, context precision, and context recall.

LLM-as-judgeUsing a separate LLM to evaluate the quality of RAG outputs at scale. Achieves 80–85% agreement with human experts per MT-Bench study.

Context utilization rateThe fraction of retrieved chunks the model actually references in its answer. Low rates indicate over-retrieval.

Unanswerable rateThe fraction of queries where the model correctly returns "I don't know." Sudden changes signal indexing failures or corpus gaps.

Closing Principle

The RAGAS metric that is lowest is where you spend your engineering time. The pipeline is only as strong as its weakest measured stage — and the unmeasured stage is always the weakest one, because you have no signal to improve it.

Lesson 4 Quiz

Evaluation, Monitoring, and Iteration · 4 questions

In the RAGAS framework, low context precision indicates a problem in which pipeline stage?

Correct. Context precision measures what fraction of retrieved chunks are actually relevant. Low precision means the retriever is pulling in noise — fix with reranking or better query transformation, not by changing the LLM.

Context precision is about retrieval quality, not generation. Low precision means irrelevant chunks are being retrieved and injected into the prompt, potentially misleading the generator.

What did the MT-Bench study (Zheng et al., 2023) establish about LLM-as-judge evaluation?

Correct. The MT-Bench paper showed GPT-4 as judge agreed with human experts at rates similar to how much human experts agree with each other — validating LLM-as-judge as a scalable alternative to manual review.

MT-Bench showed LLM-as-judge is actually quite reliable — GPT-4 agreed with human experts 80–85% of the time, on par with inter-human agreement rates, making it a viable scalable alternative.

A sudden increase in the unanswerable rate typically signals what?

Correct. Sudden spikes in "I don't know" responses typically mean either the corpus stopped updating (indexing failure) or users started asking about a domain not covered in the index — both require investigation outside the generation stage.

A sudden rise in unanswerable rate usually means the corpus doesn't cover what users are asking — either due to an indexing failure (documents not being processed) or a new query domain entering the system.

What does a low RAGAS faithfulness score specifically point to?

Correct. Faithfulness measures whether every claim in the answer is supported by retrieved context. Low faithfulness = the model is drawing on training knowledge instead of context — a generation-stage failure to be addressed through better grounding instructions.

Faithfulness is specifically about the generation stage: are all answer claims grounded in the retrieved context? Low faithfulness = hallucination at generation time. Fix with stronger grounding instructions and citation requirements.

Lab 4 — RAG Evaluation Design

Build an evaluation plan for a production RAG system

Your Task

You've just launched a RAG-powered customer support system for a SaaS product. After two weeks in production, users are complaining that answers "feel off" but the CSAT scores haven't moved yet — the Notion AI problem. You need to design a systematic evaluation and monitoring plan using RAGAS metrics and LLM-as-judge techniques to identify the root cause before it affects CSAT.

Propose which RAGAS metrics to prioritize first and why, then describe what you'd do if each metric comes back low. The assistant will ask clarifying questions and challenge your prioritization. Aim for at least 3 exchanges to complete the lab.

Evaluation Design Advisor

RAG Monitoring Strategy

Users say answers "feel off" but you don't have data yet to pinpoint why. The four RAGAS metrics each point to different pipeline stages. Where do you start? Tell me which metric you'd measure first and why — I'll push back on your reasoning and ask what you'd do if it came back low versus high.

Module 7 Test

RAG Pipeline Design · 15 questions · Pass at 80%

1. Which two pipeline stages are considered "online" (query-time) in a RAG system?

Correct. Retrieval and generation run per user query — they are latency-sensitive and must complete in milliseconds.

Retrieval and generation are the online stages. Ingestion, chunking, embedding, and indexing are offline — they run when documents are processed, not per query.

2. What is the primary purpose of attaching metadata to each document chunk?

Correct. Metadata fields like date, source, and category allow pre-filtering the index — e.g., "only search 2024 documents" — without encoding that constraint into the semantic vector.

Metadata enables filtered retrieval. Fields like date, source, or access-control tags let you restrict the search space before semantic comparison — they don't affect embedding quality.

3. What does BM25 stand for and what retrieval mode does it represent?

Correct. BM25 (Best Match 25) is a probabilistic keyword ranking function — the sparse retrieval standard used by default in Elasticsearch and OpenSearch.

BM25 stands for Best Match 25. It's a sparse/keyword retrieval function — probabilistic, fast, and excellent for exact-match queries like product codes and proper nouns.

4. What finding from the BEIR benchmark justifies using hybrid retrieval in production?

Correct. BEIR tested 18 systems across 18 diverse datasets. Dense retrieval won on semantic tasks, BM25 on exact-match tasks, and hybrid systems achieved the best average across all domains.

BEIR showed domain-specific strengths for each method. Neither dense nor sparse universally dominates. Hybrid retrieval (combining both) achieved the highest average performance across all 18 test datasets.

5. HyDE was introduced to solve which specific retrieval problem?

Correct. HyDE bridges the gap by generating a hypothetical document-style answer to the query, then embedding that instead of the sparse query — resulting in embeddings that better match actual document embeddings.

HyDE targets the embedding mismatch: a user query is stylistically different from a document even when they're about the same topic. Embedding a hypothetical answer produces an embedding that better resembles real documents.

6. In a RAG prompt, where should the user query be positioned relative to the retrieved context?

Correct. Placing the user query after the context — closest to where generation begins — improves grounding. Anthropic's engineering team found this ordering improves context utilization.

The query should be placed last, after the context block and citation instructions. This positions it closest to the generation point, which research suggests improves the model's use of the preceding context.

7. What does a "graceful exit" instruction do in a RAG system prompt?

Correct. Providing a concrete fallback ("say I don't have enough information") gives the model a safe path that doesn't require generating something plausible from training data — reducing hallucination by ~40% per Anthropic's research.

A graceful exit instruction is a specific fallback response (e.g., "If the context doesn't contain enough information, respond: 'I don't have enough information'"). It removes the pressure to confabulate when context is insufficient.

8. Reciprocal Rank Fusion (RRF) is described as "parameter-free." What does this mean in practice?

Correct. Unlike linear interpolation (α × dense_score + (1-α) × sparse_score), RRF requires no weight tuning. It computes 1/(rank+60) for each document across each list and sums — robust across domains without tuning.

Parameter-free means RRF doesn't require you to tune a weight hyperparameter (like α in weighted sum fusion). It uses a fixed formula: sum of 1/(rank+60) across all ranked lists — effective without any domain-specific calibration.

9. What is the "lost-in-the-middle" effect, and who documented it?

Correct. Liu et al. (2023) showed across multiple LLMs that attention follows a U-shape — strongest at the beginning and end of context, weakest in the middle. This has direct implications for how you order retrieved chunks in prompts.

Liu et al. (2023) documented that LLMs show U-shaped attention over long contexts — strongest at the start and end, weakest in the middle. Critical retrieved chunks should be placed at the extremes, not sandwiched in the center.

10. What does RAGAS stand for and what does it measure?

Correct. RAGAS (RAG Assessment) provides four metrics that map to different pipeline stages: faithfulness (generation), answer relevance (generation), context precision (retrieval), and context recall (retrieval + indexing).

RAGAS stands for RAG Assessment. It measures four metrics: faithfulness, answer relevance, context precision, and context recall — each pointing to a different stage of the RAG pipeline when they come back low.

11. What does low RAGAS context recall indicate about the pipeline?

Correct. Context recall measures whether all information needed to answer the query is present in the retrieved set. Low recall means key information exists in the corpus but wasn't retrieved — fix with better retrieval or improved chunking/indexing.

Context recall is about coverage: does the retrieved context contain what's needed? Low recall means relevant information exists in your corpus but wasn't retrieved — a retrieval or indexing problem, not a generation problem.

12. What did Elasticsearch's 2023 developer survey find about where RAG quality issues originate?

Correct. 61% of reported RAG quality issues were traced to chunking and embedding — not the LLM. This means most debugging effort spent on the generation stage is misplaced.

Elasticsearch found 61% of quality issues came from the indexing pipeline (chunking and embedding). Despite this, most teams debug the LLM first — which is why having RAGAS metrics matters for directing effort correctly.

13. In the context of LLM-as-judge evaluation, what did the MT-Bench paper establish?

Correct. Zheng et al. at UC Berkeley showed GPT-4 as judge agreed with human experts at rates similar to how much humans agree with each other — validating LLM-as-judge as a scalable, reliable evaluation method.

MT-Bench (Zheng et al., UC Berkeley, 2023) established that GPT-4 as judge reaches 80–85% agreement with human experts — on par with inter-human agreement — making LLM-as-judge a scalable and valid alternative to manual review.

14. What does a sudden drop in context utilization rate (the fraction of retrieved chunks the model cites) most likely indicate?

Correct. If the model consistently ignores 4 of 5 retrieved chunks, the top-k is too high — you're wasting tokens on context the model doesn't use. Reduce k or improve retrieval precision.

Low context utilization typically means over-retrieval: the system fetches more chunks than the model actually needs, wasting prompt tokens. The fix is reducing top-k or improving retrieval precision so fewer, more relevant chunks are returned.

15. Step-back prompting (Google DeepMind, 2023) transforms a query by doing what?

Correct. Step-back prompting reframes specific queries at a higher abstraction level — e.g., "ibuprofen 800mg side effects" → "NSAID pharmacokinetics" — to retrieve the foundational knowledge that underlies the specific question.

Step-back prompting reformulates the query at a higher abstraction level. A specific question like "side effects of ibuprofen at 800mg?" becomes "NSAID pharmacokinetics?" — retrieving the foundational context before the specific answer.