RAG Systems from Scratch

1. What distinguishes semantic embedding-based retrieval from traditional keyword search?

Correct. Embedding models map semantically similar text to nearby points in vector space — enabling "buy" to match "purchase" without any synonym list or pre-labeling.

Incorrect. The key distinction is semantic matching in continuous vector space, not hash tables, human labels, or live LLM scoring of every document.

2. What is the recommended mitigation for position bias in LLM-as-judge faithfulness evaluation?

Correct. Averaging over multiple context orderings neutralizes the systematic bias toward claims supported by early-appearing context passages.

Position bias causes LLM judges to over-credit claims appearing early in context. Mitigations include multiple-ordering evaluation averaging or using judges specifically tested for position robustness.

3. In a RAG prompt, where should the user query be positioned relative to the retrieved context?

Correct. Placing the user query after the context — closest to where generation begins — improves grounding. Anthropic's engineering team found this ordering improves context utilization.

The query should be placed last, after the context block and citation instructions. This positions it closest to the generation point, which research suggests improves the model's use of the preceding context.

4. What is the correct way to handle a document update in a production RAG index to avoid update drift?

Correct. Treating updates as delete-then-reinsert operations ensures the index always contains exactly one version of each document's content, without requiring complex differential logic.

Delete-then-reinsert is the cleanest pattern: remove all chunks from the old version, insert all chunks from the new version. This avoids mixed-version retrieval with no complex diff logic needed.

5. What does BM25 stand for and what retrieval mode does it represent?

Correct. BM25 (Best Match 25) is a probabilistic keyword ranking function — the sparse retrieval standard used by default in Elasticsearch and OpenSearch.

BM25 stands for Best Match 25. It's a sparse/keyword retrieval function — probabilistic, fast, and excellent for exact-match queries like product codes and proper nouns.

6. Which of the following best describes why chunking is described as a "resource allocation decision"?

Correct.

The key insight is that the generation model has a fixed context window shared by all inputs; chunk size determines how many tokens of that window are consumed by retrieved content.

7. What finding from the BEIR benchmark justifies using hybrid retrieval in production?

Correct. BEIR tested 18 systems across 18 diverse datasets. Dense retrieval won on semantic tasks, BM25 on exact-match tasks, and hybrid systems achieved the best average across all domains.

BEIR showed domain-specific strengths for each method. Neither dense nor sparse universally dominates. Hybrid retrieval (combining both) achieved the highest average performance across all 18 test datasets.

8. What is the primary purpose of maintaining a versioned corpus snapshot in a RAG evaluation infrastructure?

Correct. Without versioned snapshots, it is impossible to know whether a metric change between evaluations reflects a code regression or index drift.

Versioned corpus snapshots solve the index drift evaluation problem — you can't reliably compare evaluation runs over time if the corpus underneath has changed.

9. What does a "graceful exit" instruction do in a RAG system prompt?

Correct. Providing a concrete fallback ("say I don't have enough information") gives the model a safe path that doesn't require generating something plausible from training data — reducing hallucination by ~40% per Anthropic's research.

A graceful exit instruction is a specific fallback response (e.g., "If the context doesn't contain enough information, respond: 'I don't have enough information'"). It removes the pressure to confabulate when context is insufficient.

10. Which RAGAS metric would you monitor to detect cases where the LLM adds information not present in the retrieved context?

Correct. Faithfulness verifies that every claim in the generated answer is grounded in the retrieved context — directly detecting LLM-introduced hallucinations.

Faithfulness is the RAGAS metric for detecting unsupported claims — it checks whether each assertion in the answer can be traced to retrieved chunks.

11. Notion's 2023 engineering blog reported a 2.4× improvement in what metric after adding Pinecone vector search?

Correct. Notion benchmarked "relevant block in top-3" on internal query sets and saw a 2.4× improvement over their prior keyword-only backend.

Notion measured retrieval quality. Their "relevant block retrieved in top-3" rate improved 2.4× after adding vector search via Pinecone.

12. For a RAG system answering multi-hop questions requiring synthesis across 5 documents, which metric combination is most appropriate?

Correct. Multi-hop synthesis needs all relevant documents (Recall@10) and grounding quality (faithfulness) — MRR and lexical metrics are both poorly suited.

Synthesis tasks prioritize recall (all pieces needed must be retrieved) and faithfulness (the synthesis must reflect the retrieved pieces) over precision-focused or lexical metrics.

13. The Databricks DBRX evaluation approach used four metrics together. Which combination was described?

Correct. The four-metric combination was designed so each metric catches a different failure mode class that the others miss.

Databricks described: LLM faithfulness + human preference sample + factoid ground-truth correctness + retrieval nDCG — covering grounding, human judgment, factual accuracy, and retrieval quality independently.

14. LlamaIndex benchmarking showed that switching from 512-token fixed chunks to 256-token chunks with 64-token overlap improved exact-match retrieval recall on multi-hop questions by approximately how much?

Correct.

LlamaIndex benchmarking on SQUAD and HotpotQA showed roughly 12 percentage point improvement — meaningful but not dramatic — from this relatively simple change.

15. What is the primary semantic property of a well-trained embedding model?

Correct. Proximity in embedding space reflects semantic similarity — this is the foundational property that makes vector retrieval useful for RAG.

The key property is geometric proximity reflecting semantic similarity. Dimensions are not interpretable, and magnitude is not a confidence signal.

16. Reciprocal Rank Fusion was introduced at which conference and in which year?

Correct. Cormack, Clarke, and Buettcher published RRF at SIGIR 2009. Its simplicity and robustness have made it the default fusion method 15 years later.

RRF was published at SIGIR 2009 by Cormack, Clarke, and Buettcher.

17. What is the IDCG in the nDCG formula and why is it necessary?

Correct. Dividing by IDCG normalizes DCG so queries with many relevant documents are comparable to queries with few relevant documents.

IDCG is the maximum achievable DCG for a given query — the score if relevance-ranked perfectly. Dividing by it gives nDCG its 0–1 normalized scale.

18. Reciprocal Rank Fusion (RRF) is described as "parameter-free." What does this mean in practice?

Correct. Unlike linear interpolation (α × dense_score + (1-α) × sparse_score), RRF requires no weight tuning. It computes 1/(rank+60) for each document across each list and sums — robust across domains without tuning.

Parameter-free means RRF doesn't require you to tune a weight hyperparameter (like α in weighted sum fusion). It uses a fixed formula: sum of 1/(rank+60) across all ranked lists — effective without any domain-specific calibration.

19. Scalar Quantization (SQ8) converts each float32 dimension to int8. What is the compression ratio and typical recall impact?

Correct. float32 (4 bytes) → int8 (1 byte) = 4× compression. Distance arithmetic on int8 is faster, and recall loss is typically under 1% for well-tuned SQ8.

SQ8 compresses 4× (float32 to int8). Recall loss is typically under 1%. Product Quantization achieves 64× compression but with larger (~7–15%) recall loss.

20. What is the primary reason SPLADE vectors can use inverted index retrieval despite having ~30,000 dimensions?

Correct. Sparsity is the key. An inverted index stores only non-zero entries — 50–200 entries per document is trivial. Dimensionality only matters when the vector is dense.

The critical property is true sparsity: most of 30,000 dimensions are exactly zero, so only the non-zero entries need storage — identical to how BM25 stores only terms that appear in a document.

Final Exam