Module 4 · Lesson 1

RAG Fundamentals: Why Retrieval Changes Everything

From static weights to living knowledge — the architecture that made enterprise AI real

Why can't you just put all your documents in the context window and call it done?

In September 2023, Databricks published benchmarks showing their RAG-enabled DBRX model answering questions about proprietary internal documentation with 78% accuracy — compared to 31% accuracy from the same base model without retrieval. The difference wasn't a smarter model. It was the pipeline feeding the model the right chunks of text at the right moment.

That same year, Klarna deployed a RAG system over 200+ product policy documents. Customer service resolution time dropped from 11 minutes to under 2 minutes. The AI wasn't hallucinating policies — it was reading them.

The Core Problem RAG Solves

Large language models are trained on a snapshot of the world. That snapshot has a cutoff date. It contains no knowledge of your internal documents, your company's pricing, your users' history, or anything that happened after training ended. When you ask a model about these things without RAG, it either refuses or — far more dangerously — confabulates a plausible-sounding answer.

Retrieval-Augmented Generation (RAG) solves this by separating the knowledge store from the reasoning engine. At inference time, a retrieval system fetches relevant document chunks, injects them into the model's context, and the model reasons over real evidence rather than learned weights. The model becomes a reader and reasoner; your database becomes the source of truth.

Why Not Just Fine-Tune?

Fine-tuning bakes knowledge into weights. Updating it requires retraining — expensive, slow, and error-prone. RAG updates instantly: change a document in your store, and the next query reflects it. Fine-tuning is for teaching style and capability; RAG is for teaching facts.

The RAG Pipeline — Step by Step

Every production RAG system, from LlamaIndex deployments to enterprise Vertex AI Search, shares the same fundamental pipeline in two phases: indexing time and query time.

1

Document Ingestion

Raw documents (PDF, HTML, Markdown, SQL exports) are loaded, cleaned, and parsed. Metadata is extracted: source URL, date, author, section headers.

2

Chunking

Documents are split into chunks — typically 256–1024 tokens each, with overlap of 10–20% to preserve context across boundaries. Chunk strategy has huge impact on retrieval quality.

3

Embedding

Each chunk is run through an embedding model (OpenAI text-embedding-3-small, Cohere embed-english-v3, or open-source alternatives) to produce a dense vector representation.

4

Vector Store

Vectors are stored in a vector database: Pinecone, Weaviate, Qdrant, pgvector, or Chroma. Each vector is linked to its source chunk and metadata.

5

Query Embedding + Retrieval

At query time, the user's question is embedded with the same model. The vector store returns the top-k most similar chunks via approximate nearest neighbor (ANN) search.

6

Augmented Generation

Retrieved chunks are injected into the LLM prompt as context. The model generates an answer grounded in those chunks. Citations can reference source documents.

User Query

↓

Query Embedding Model

↓

Vector Database
ANN Search → Top-K Chunks

↓

LLM + Context Window
[Retrieved Chunks] + [Question]

↓

Grounded Answer + Sources

Naive RAG vs. Advanced RAG

The simplest RAG implementation — embed everything, retrieve top-5, stuff into prompt — works for demos. Production systems that handle thousands of queries per day require more sophistication. Advanced RAG introduces query rewriting, hybrid search (dense + sparse/BM25), re-ranking models, and multi-step retrieval.

Cohere's 2024 RAG Evaluation Report found that adding a cross-encoder reranker on top of baseline retrieval improved answer accuracy by 19 percentage points on enterprise document QA benchmarks, with no change to the underlying LLM.

Dense Retrieval Semantic search via embedding similarity. Finds conceptually related content even with different vocabulary. Can miss exact keyword matches.

Sparse Retrieval (BM25) Keyword-based retrieval using term frequency statistics. Excellent for exact matches, product codes, names. Misses paraphrases.

Hybrid Search Combines dense + sparse scores (via Reciprocal Rank Fusion or weighted sum). Industry standard for production RAG as of 2024.

Re-ranking A second-pass model (cross-encoder) re-scores retrieved candidates for relevance. Typically boosts top-k quality significantly at modest latency cost.

Real Deployment — Morgan Stanley, 2023

Morgan Stanley built a RAG system over 100,000+ research reports and internal documents using GPT-4 and custom embeddings. Advisors query the system for client-specific market intelligence. The bank publicly cited it as a key AI productivity initiative at their 2023 investor day. The system retrieves and cites source reports rather than generating financial claims from model weights — a critical compliance distinction.

Chunking Strategies — The Decisions That Matter Most

Poor chunking is the most common reason RAG systems fail in production. Too large: retrieval precision drops, context window fills with irrelevant content. Too small: retrieved chunks lack the surrounding context needed to answer the question.

Fixed-size chunking (simple, fast) splits every N tokens regardless of semantic boundaries. Works for uniform documents. Sentence/paragraph chunking respects natural language boundaries. Semantic chunking (used by LlamaIndex's SemanticSplitter) embeds sentences and splits when cosine similarity between adjacent sentences drops below a threshold. Hierarchical chunking stores both summary-level and detail-level chunks, enabling coarse-to-fine retrieval.

# Python — basic chunking with overlap using LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens per chunk
    chunk_overlap=64,     # overlap for boundary continuity
    separators=["\n\n", "\n", ". ", " "]  # try natural breaks first
)

chunks = splitter.split_text(document_text)
# Each chunk: str, ready for embedding
    

512

Typical chunk size (tokens)

Sweet spot for most use cases

10–20%

Recommended overlap

Preserves cross-boundary context

Top-5

Typical k for retrieval

Balanced recall vs. noise

+19 pp

Accuracy gain with reranking

Cohere RAG Report, 2024

L1 Quiz — RAG Fundamentals

Three questions · instant feedback

1. What is the primary advantage of RAG over fine-tuning for incorporating new factual knowledge?

Correct — RAG separates knowledge from weights. Update a document in your store and the next query reflects it. Fine-tuning bakes knowledge into model parameters, requiring retraining to update.

Not quite. The key advantage is update speed and cost: RAG lets you change knowledge by updating a database, not by retraining a model. Fine-tuning is expensive and slow to refresh.

2. In a hybrid search RAG system, what two retrieval methods are typically combined?

Correct — hybrid search combines dense (vector similarity, semantic) with sparse (BM25, keyword frequency) retrieval, then merges rankings via RRF or weighted scoring. Industry standard for production RAG.

Not quite. Hybrid search combines dense vector retrieval (semantic similarity via embeddings) with sparse retrieval (BM25 keyword matching). Each method catches what the other misses.

3. Why is chunk overlap (e.g., 10–20% of chunk size) recommended in RAG chunking strategies?

Correct — when a sentence or concept spans a chunk boundary, overlap ensures both adjacent chunks contain enough surrounding context to be useful when retrieved independently.

Not quite. Overlap preserves contextual continuity: if a key idea straddles a chunk boundary, overlapping tokens ensure neither chunk is missing critical context when retrieved.

Lab 1 — Design Your RAG Pipeline

Hands-on · AI-assisted · Minimum 3 exchanges to complete

Scenario: You are building a RAG system for a legal tech startup

The startup has 50,000 contract documents (NDAs, MSAs, SOWs) in PDF format. Users need to query these with natural language questions like "What are the termination clauses in our enterprise agreements?" Your AI lab partner will help you design the full pipeline — from chunking strategy to vector store choice to query architecture.

Work through the design decisions step by step. Ask about chunking, embedding choices, vector DB selection, hybrid search, and how you'd handle metadata filtering for date ranges or contract types.

💬 Start by telling your assistant what you know about the document corpus: approximately how long are the contracts, what types of queries users will make, and what your latency requirements are. Then ask for chunking strategy recommendations.

Lab Goals — work through these in conversation

1. Describe the corpus and query types, get chunking strategy advice

2. Ask about embedding model choices and trade-offs (OpenAI vs. open-source)

3. Discuss vector store options and metadata filtering for contract type/date

4. Ask how you'd implement hybrid search and reranking for this legal use case

5. Get a final architecture recommendation you could present to a CTO

RAG Architecture Lab

Legal Tech Scenario

0 / 3 exchanges

Welcome to the RAG pipeline design lab. I'm your AI architecture partner for this session.

You're building a RAG system over 50,000 legal contracts. This is a genuinely interesting challenge — legal documents have very specific retrieval needs: clause-level precision, metadata-heavy filtering (by contract type, party, date), and users who need citable answers.

To give you the best chunking strategy recommendation, tell me: roughly how long are these contracts (pages), what's the most common query type you expect, and do you have any latency constraints (e.g., must respond in under 2 seconds)?

Module 4 · Lesson 2

Vector Databases & Embedding Engineering

Choosing the right store, engineering better representations, and surviving production at scale

When every embedding model promises "best-in-class," how do you actually choose — and what breaks first at 10 million vectors?

In October 2023, the ANN-Benchmarks project published results showing Qdrant achieving 1.2 million queries per second on a 10M-vector SIFT dataset with 95% recall at 1ms p99 latency on a single machine. Pinecone's managed service processed over 1 billion vector queries per day by Q4 2023 across all customers. These are not edge cases — they define what production looks like.

The Vector Database Landscape

By 2024, the vector database market had fragmented into four categories: dedicated vector stores, vector-capable traditional databases, cloud-managed services, and in-process libraries. Each has genuine trade-offs that matter for RAG deployments.

Pinecone

Managed cloud, zero ops

Best for: fast prototyping, enterprise SLA needs

Qdrant

Self-hosted, Rust-based

Best for: high perf, full control, on-prem

pgvector

PostgreSQL extension

Best for: teams already on Postgres, hybrid SQL+vector

Chroma

Open-source, embeddable

Best for: development, small-scale, local RAG

Weaviate

Self-hosted or cloud

Best for: hybrid search built-in, GraphQL API

Milvus

Distributed, open-source

Best for: billion-scale deployments, cloud-native

ANN Algorithms — What's Actually Happening

Exact nearest neighbor search across millions of vectors is prohibitively slow — O(n·d) per query where d is embedding dimensionality. All production vector databases use Approximate Nearest Neighbor (ANN) algorithms that trade a small accuracy loss for orders-of-magnitude speedup.

HNSW (Hierarchical Navigable Small World) is the dominant algorithm in 2024. It builds a multi-layer graph where upper layers provide coarse navigation and lower layers refine to exact neighborhoods. Qdrant, Weaviate, and pgvector all use HNSW by default. Key parameters: ef_construction (graph quality at index time, higher = slower build, better recall) and ef (search quality at query time, higher = slower query, better recall).

IVF-PQ (Inverted File with Product Quantization) is used by Faiss and Milvus for billion-scale. Compresses vectors to reduce memory footprint — critical when 100M 1536-dim float32 vectors would require 600GB RAM without compression.

# Qdrant — create collection with HNSW config
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="legal_contracts",
    vectors_config=VectorParams(
        size=1536,           # OpenAI text-embedding-3-small dim
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,               # edges per node — higher = better recall, more RAM
        ef_construct=200,  # build quality — increase for better index
    )
)
    

Embedding Model Selection

The MTEB (Massive Text Embedding Benchmark) leaderboard, maintained on HuggingFace, is the authoritative source for embedding model comparison. As of early 2024, the top performers on the retrieval subtask included text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), and open-source models like bge-large-en-v1.5 (BAAI) and e5-mistral-7b-instruct (Microsoft).

Critical: embedding models must match at indexing and query time. You cannot index with OpenAI embeddings and query with Cohere. If you switch models, you must re-embed your entire corpus — plan for this from day one.

Dimensionality text-embedding-3-small: 1536 dims. text-embedding-3-large: 3072 dims (can reduce via Matryoshka). bge-large: 1024 dims. Higher dims = more expressive, more storage, slower search.

Matryoshka Embeddings OpenAI's text-embedding-3 series supports dimension truncation: you can use 256, 512, or 1536 dims from the same model. Smaller = 5–10x storage savings with modest accuracy loss.

Bi-encoder vs. Cross-encoder Bi-encoders embed query and doc independently (fast retrieval). Cross-encoders process them jointly (slow but highly accurate) — used for reranking, not first-pass retrieval.

Metadata Filtering — The Production Requirement

Pure semantic search is rarely enough in production. Users need "contracts signed after Jan 2023" or "NDAs involving counterparty Acme Corp." Vector databases support metadata filtering — attached structured fields that constrain the ANN search space before or during vector comparison.

Implementation strategy matters: pre-filter (filter then search — fast, can miss results if filter is too narrow) vs. post-filter (search then filter — can miss top-k results) vs. in-filter (Qdrant's approach — interleaves filtering during HNSW graph traversal for best accuracy/speed balance).

# Qdrant — vector search with metadata filter
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

results = client.search(
    collection_name="legal_contracts",
    query_vector=embed("termination clause 30 day notice"),
    query_filter=Filter(
        must=[
            FieldCondition(key="contract_type", match=MatchValue(value="MSA")),
            FieldCondition(key="signed_year", range=Range(gte=2022))
        ]
    ),
    limit=10
)
    

The Embedding Drift Problem

When you update your embedding model (new version, better accuracy), your index is stale. The new model's vector space may not align with the old one. Production systems need a re-indexing strategy: background job to re-embed all chunks, swap index atomically when complete. Pinterest engineering blog (2023) described this as their #1 operational pain point in their recommendation RAG system.

Scaling Considerations

At 1M vectors with 1536 dims in float32: 6GB RAM minimum just for vectors, before index overhead. HNSW typically adds 1.2–2x memory overhead for the graph structure. At this scale, Qdrant or Pinecone handle it fine. At 100M+ vectors, you need quantization (Scalar Quantization cuts to 1.5GB, Product Quantization further) or distributed deployment (Milvus sharding).

6 GB

RAM for 1M × 1536-dim vectors

Before HNSW index overhead

4×

Scalar quantization compression

Float32 → Int8, ~1% recall loss

<5ms

p99 query latency (Qdrant, 10M vecs)

ANN-Benchmarks 2023

95%+

Recall@10 with good HNSW config

vs. exact brute force

L2 Quiz — Vector Databases & Embeddings

Three questions · instant feedback

1. You switch from text-embedding-3-small to a new open-source embedding model for better performance. What must you do to your vector database?

Correct — embedding models produce vectors in their own learned space. Switching models invalidates all existing vectors; you must re-embed your entire corpus and rebuild the index. This is a major operational cost to plan for.

Not quite. Embedding models produce vectors in their own unique vector space. Vectors from different models are incompatible — you must re-embed the entire corpus with the new model and rebuild the index.

2. HNSW's ef parameter controls what, and what is the trade-off when you increase it?

Correct — ef (or ef_search) controls how wide the HNSW graph traversal beam is during search. More candidates examined = better recall, but each candidate comparison costs time. It's a direct recall/latency dial.

Not quite. The ef parameter in HNSW controls search beam width — how many candidate vectors are tracked during graph traversal. Higher ef examines more candidates, improving recall at the cost of query latency.

3. For a RAG use case requiring filtering by document date AND semantic similarity, what is Qdrant's recommended approach (vs. naive post-filter)?

Correct — Qdrant's in-filter approach checks metadata conditions during graph traversal rather than before or after. This avoids the "recall cliff" of aggressive pre-filtering while maintaining speed by skipping non-matching candidates during navigation.

Not quite. Post-filtering can miss the true top-k (you might search for 10 but filter leaves you with 3). Qdrant's in-filter approach interleaves filtering during HNSW traversal, maintaining recall without the cost of pre-filtering everything.

Lab 2 — Vector Store & Embedding Decisions

Hands-on · AI-assisted · Minimum 3 exchanges to complete

Scenario: You're scaling a RAG system from 10K to 10M documents

Your startup's RAG prototype works great at 10,000 documents using Chroma + OpenAI embeddings. You've just closed a Series A and need to scale to 10 million documents within 6 months. Your AI partner will help you architect the production vector infrastructure — covering database selection, HNSW tuning, quantization, and embedding model trade-offs at scale.

💬 Start by describing your current setup (Chroma + OpenAI, 10K docs) and ask your assistant to diagnose what will break first and what you should migrate to. Then dig into specific configuration decisions for your target scale.

Lab Goals — work through these in conversation

1. Identify bottlenecks in Chroma + OpenAI at 10M scale

2. Get a specific vector database recommendation with justification

3. Discuss HNSW m and ef_construction settings for your scale

4. Understand when to apply scalar vs. product quantization

5. Plan the migration: how to re-embed 10M docs without downtime

Vector Scale Lab

10M Document Challenge

0 / 3 exchanges

Great scenario — this is exactly the scaling cliff most RAG startups hit after their first funding round.

Chroma is excellent for development and small-scale prototypes, but at 10M documents with 1536-dim OpenAI embeddings you're looking at roughly 60GB of raw vectors before HNSW index overhead. Chroma's SQLite-backed storage wasn't designed for this.

Tell me more about your current setup: what hardware are you on, what's your p99 latency requirement, and do you need to run on-premise or is cloud OK? That'll help me give you a specific migration path rather than a generic "use X" recommendation.

Module 4 · Lesson 3

Advanced Retrieval: Reranking, Query Rewriting & Multi-Step RAG

Moving beyond top-k: the techniques that separate 70% accuracy from 90%

If your retrieval finds the right documents 80% of the time, what's getting in the way of the other 20% — and how do you systematically close that gap?

In 2023, Notion's AI team published a technical blog detailing their RAG system for the Notion AI product. Their biggest accuracy gain came not from a better LLM but from query rewriting: before hitting the vector store, user questions were rewritten into 3 alternative phrasings by a small LLM, all three were retrieved against, and results were merged. This HyDE-adjacent technique improved answer relevance on internal evaluations by approximately 22% compared to single-query retrieval.

Why Naive Retrieval Fails at the Edges

Standard RAG retrieves the top-k most semantically similar chunks to a query. This works well when the user's phrasing closely matches document language. It fails when:

Vocabulary mismatch: The user asks "how do I cancel my subscription" but documents say "termination procedures for recurring billing." Semantically similar, but embedding distance may be high. Multi-hop reasoning: The answer requires facts from two different documents — neither alone is sufficient. Ambiguous queries: "What's our policy on this?" has no grounding unless the query is expanded. Long documents with sparse relevance: The relevant sentence is buried in a 10,000-word report; the chunk containing it may not rank in the top-k.

Query Rewriting & Expansion

Before retrieval, a small LLM transforms the user query to improve retrieval performance. Several strategies exist with different trade-offs:

A

Multi-Query Expansion

Generate N alternative phrasings of the query. Retrieve against all N. Merge results via Reciprocal Rank Fusion (RRF). Implemented in LangChain as MultiQueryRetriever. Reduces vocabulary mismatch at cost of N× retrieval latency.

B

HyDE (Hypothetical Document Embeddings)

Ask LLM to generate a hypothetical ideal answer document, then embed that document for retrieval. Published by Gao et al. (2022), widely adopted. Particularly powerful for technical/expert domains where questions and answers use different vocabulary.

C

Step-Back Prompting

Rephrase specific question as more general concept (Google DeepMind, 2023). "What is the cancellation policy for annual subscriptions?" → "What are the general billing and cancellation terms?" Retrieves background knowledge before drilling into specifics.

D

Query Decomposition

Break complex questions into sub-questions. Retrieve for each. Synthesize. Used for multi-hop reasoning. "Compare Q1 and Q3 revenue and explain any policy changes between them" → two separate retrieval queries + synthesis.

# LangChain — HyDE retrieval
from langchain.retrievers import HyDEDocumentRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant

llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_existing_collection("legal_contracts", embeddings)

hyde_retriever = HyDEDocumentRetriever.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),
    embeddings=embeddings
)

docs = hyde_retriever.invoke("termination notice period for enterprise MSA")
# LLM first generates a hypothetical MSA termination clause,
# then retrieves real docs similar to that hypothesis
    

Re-ranking: The Highest ROI Retrieval Improvement

After first-pass retrieval returns candidate chunks, a cross-encoder reranker re-scores each candidate by processing the query and candidate together in a single forward pass. This is far more accurate than bi-encoder cosine similarity (which processes them independently) but too slow for first-pass retrieval across millions of vectors — hence the two-stage approach.

Cohere Rerank API (2023): the most widely used managed reranking service. BGE-Reranker (BAAI): state-of-the-art open-source cross-encoder. ms-marco-MiniLM: lightweight cross-encoder trained on Microsoft's MARCO QA dataset, fast enough for real-time reranking of top-20 candidates.

# Two-stage: retrieve 20, rerank to top-5
import cohere

co = cohere.Client("your-api-key")

# Stage 1: fast ANN retrieval — get top-20 candidates
candidates = vectorstore.similarity_search(query, k=20)
candidate_texts = [doc.page_content for doc in candidates]

# Stage 2: cross-encoder reranking — reorder by true relevance
rerank_results = co.rerank.rerank(
    query=query,
    documents=candidate_texts,
    model="rerank-english-v3.0",
    top_n=5
)

final_chunks = [candidate_texts[r.index] for r in rerank_results.results]
    

Multi-Step RAG & Agentic Retrieval

Complex queries sometimes cannot be answered with a single retrieval step. Iterative RAG uses the LLM's output to generate follow-up retrieval queries. FLARE (Forward-Looking Active REtrieval, Jiang et al. 2023) generates answers token-by-token, triggering retrieval whenever the model's confidence drops below a threshold — essentially retrieving exactly when the model needs it.

LlamaIndex's Sub-Question Query Engine decomposes complex questions into sub-questions, dispatches retrieval for each, then synthesizes. This is the architecture behind many enterprise "research assistant" RAG deployments documented in 2023–2024.

Complex User Query

↓

Query Decomposer LLM

↓

Sub-Q 1 → Retrieval

Sub-Q 2 → Retrieval

Sub-Q 3 → Retrieval

↓

Synthesis LLM

↓

Comprehensive Answer + Citations

RAG Evaluation — Measuring What Matters

You cannot improve what you don't measure. The RAGAs framework (open-source, 2023) provides automated evaluation across four dimensions without requiring ground-truth labels. ARES (Stanford, 2023) uses a trained LLM judge to evaluate context relevance and answer faithfulness against a small labeled dataset. Both are widely used in production RAG evaluation pipelines.

Context Recall

Did retrieval find the right chunks?

RAGAs metric — compares retrieved vs. ground-truth

Context Precision

Are retrieved chunks actually relevant?

Signal-to-noise in the retrieved context

Faithfulness

Is the answer grounded in retrieved chunks?

Detects hallucination despite having context

Answer Relevance

Does the answer address the question?

End-to-end answer quality measurement

L3 Quiz — Advanced Retrieval

Three questions · instant feedback

1. What is HyDE (Hypothetical Document Embeddings) and when is it most powerful?

Correct — HyDE bridges the vocabulary gap between questions and answers. A user asking "how do I cancel?" and a document about "account termination procedures" may be semantically distant at the question level but close at the answer-document level.

Not quite. HyDE asks an LLM to generate a hypothetical answer document, then retrieves real documents similar to that hypothesis. It's especially powerful for expert/technical domains where user questions and document language differ significantly.

2. In a two-stage reranking pipeline, what is the typical first stage and why isn't the second stage used alone?

Correct — cross-encoders process query+document jointly (slow), so they can only practically score a small candidate set (20–50). Fast bi-encoder ANN retrieval generates that candidate set; the cross-encoder then reorders it accurately.

Not quite. Cross-encoders are highly accurate but slow — they process query and document jointly and can't scale to millions of vectors at query time. The bi-encoder's fast ANN retrieval generates a manageable candidate pool (20–50), which the cross-encoder then re-scores accurately.

3. The RAGAs framework's "Faithfulness" metric measures what specific failure mode?

Correct — faithfulness catches a critical failure mode: the LLM ignoring the retrieved context and generating from its parametric memory instead. An LLM can hallucinate even when given correct context if it overrides retrieval with trained knowledge.

Not quite. Faithfulness specifically measures whether the answer's claims are supported by the retrieved chunks. It targets the failure mode where the LLM ignores provided context and generates answers from its training weights instead — hallucinating despite correct retrieval.

Lab 3 — Building an Advanced Retrieval Pipeline

Hands-on · AI-assisted · Minimum 3 exchanges to complete

Scenario: Your baseline RAG is at 68% accuracy — your CTO wants 85%

You have a working RAG system for a healthcare benefits platform. Users query a corpus of insurance policy documents. Your baseline (simple top-5 retrieval, no reranking) scores 68% on your internal QA evaluation set. Your CTO has set a 85% target before launch. Work with your AI partner to diagnose the failure modes and implement a multi-technique improvement strategy.

💬 Share your current pipeline setup: "We have top-5 cosine similarity retrieval, text-embedding-3-small, Qdrant, GPT-4o as generator, 512-token chunks with 10% overlap. Our eval set shows we fail most on multi-part benefit questions and questions with technical medical terminology." Then ask for a diagnosis and improvement plan.

Lab Goals — work through these in conversation

1. Get a diagnosis of which failure modes explain the accuracy gap

2. Design a query rewriting strategy for medical terminology mismatch

3. Plan a reranking implementation with specific model recommendation

4. Discuss whether HyDE or multi-query makes more sense for healthcare benefits queries

5. Build a prioritized improvement roadmap to reach the 85% target

Advanced Retrieval Lab

Healthcare Benefits RAG

0 / 3 exchanges

Healthcare benefits RAG is a great case study — it has almost every retrieval challenge in one package: technical vocabulary mismatch (medical codes vs. plain language), multi-part questions (does my plan cover X, and what's the deductible, and does it require pre-auth?), and high accuracy stakes.

68% is a solid baseline but you have a meaningful gap to close. To diagnose the failure modes precisely: can you share roughly what percentage of failures fall into "retrieved the wrong chunks" vs. "retrieved right chunks but got a wrong answer"? If you've run any manual error analysis on the eval set, even informal patterns help a lot here. Then we can decide which intervention to tackle first.

Module 4 · Lesson 4

Production RAG: Evaluation, Observability & Knowledge Graph Augmentation

Shipping RAG is easy. Keeping it accurate at production scale, with evolving documents and real users, is the actual engineering challenge.

How do you know when your RAG system is quietly getting worse — and how do GraphRAG and knowledge-augmented retrieval change the ceiling?

In July 2024, Microsoft Research published GraphRAG, an open-source system combining knowledge graph extraction with traditional vector retrieval. On benchmark "global reasoning" queries — questions requiring synthesis across many documents — GraphRAG scored 72% win rate against naive RAG on community-level sensemaking tasks. The improvement came from extracted entity relationships, not raw text retrieval.

That same quarter, LlamaIndex released LlamaCloud citing a recurring pattern from enterprise customers: RAG systems that scored 85%+ on launch-day evaluations degraded to 70% within 90 days as underlying documents changed and query distributions shifted. Continuous evaluation pipelines had become non-negotiable.

The Production Evaluation Stack

Production RAG evaluation operates at three layers. Offline evaluation runs a curated QA benchmark before every deployment — similar to a test suite. Online evaluation scores live queries using LLM judges and tracks metrics over time. Human evaluation samples a percentage of production queries weekly for expert review — the ground truth check.

1

Build a Golden Dataset

100–500 question/answer pairs with source citations. Expert-written. Must represent the real query distribution. Updated quarterly as documents and use cases evolve.

2

Automated Eval on Golden Set

Run RAGAs or ARES against golden dataset on every code change. Track context recall, faithfulness, answer relevance as CI/CD pipeline metrics. Alert on >3% regression.

3

Online Scoring via LLM Judge

Sample 5–10% of live queries. Score them with GPT-4 as judge (faithfulness, relevance). Store results in time-series DB. Build dashboard showing rolling 7-day accuracy trend.

4

Retrieval Tracing & Logging

Log every query: raw query, rewritten query, retrieved chunk IDs + scores, final answer, latency. Use LangSmith, Phoenix (Arize), or LlamaTrace. Enables root-cause analysis of failures.

5

Query Distribution Monitoring

Embed all queries and visualize their distribution over time. Cluster drift indicates users are asking new types of questions your system wasn't optimized for — triggers evaluation refresh.

# RAGAs evaluation pipeline — automated scoring
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_recall, context_precision
)
from datasets import Dataset

golden_data = Dataset.from_dict({
    "question": questions,        # your 200 golden questions
    "answer": generated_answers,   # your RAG system's answers
    "contexts": retrieved_contexts,# list of retrieved chunks per Q
    "ground_truth": gold_answers    # expert-written ground truth
})

result = evaluate(
    golden_data,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
# Returns per-metric scores + per-question breakdown
print(result)
    

Observability: Knowing When Things Break

RAG systems fail silently. Unlike a 500 error, a hallucinated answer looks identical to a correct one from the outside. Production observability requires instrumenting the full trace — not just the final answer.

LangSmith (LangChain's observability platform, launched 2023) logs the full LLM call tree including retrieval steps, latencies, and prompt/response content. Phoenix by Arize provides embedding visualization and retrieval quality metrics. Langfuse (open-source alternative) supports cost tracking and custom evaluation scoring via API.

Key Observability Metrics for RAG

Retrieval latency p95/p99 — degradation here often precedes accuracy issues (index fragmentation, DB under load). Context length utilization — are retrieved chunks using your full context window or wasting it? Reranking score distribution — a narrowing score gap between top-1 and top-5 reranked candidates often signals that retrieval is returning irrelevant results. Query refusal rate — the model saying "I don't have enough information" is informative signal, not a failure.

GraphRAG — When Knowledge Graphs Change the Answer

Standard RAG retrieves text chunks. GraphRAG extracts a knowledge graph from documents — entities, relationships, communities — and adds graph-structured retrieval alongside vector retrieval. This fundamentally changes what kinds of questions can be answered.

Vector RAG excels at: "What does document X say about topic Y?" — local, specific retrieval. GraphRAG excels at: "What is the overall theme connecting these 50 documents?" or "Which entities appear most frequently in the context of risk?" — global, relational queries.

Microsoft's GraphRAG open-source release (July 2024) uses a two-phase approach: offline graph construction (entity extraction → relationship extraction → community detection) and online query routing (local search for specific facts, global search for thematic synthesis). The community detection step (using Leiden algorithm) creates hierarchical summaries at multiple granularities.

Raw Documents

↓ Offline Processing

Entity Extraction
LLM-powered NER

Relationship Graph
Co-occurrence + LLM

Community Summaries
Leiden clustering

↓ Query Time

Local Search
Specific entity queries

Global Search
Thematic/relational queries

↓

LLM Generation with Graph + Vector Context

Document Lifecycle Management

Production knowledge systems must handle the full document lifecycle: ingestion (new documents added), updates (existing documents revised — requires deleting old chunks and re-embedding), and deletion (documents removed — requires removing associated vectors and metadata). Vector stores that lack delete-by-metadata capability create technical debt that compounds over time.

The most robust pattern (used by Notion AI, Morgan Stanley, and Glean according to their respective engineering blogs) is document-level versioning: each document has a version ID as metadata, chunks are tagged with their parent document ID and version, and updates delete all chunks with the old document ID before re-indexing.

Payload Indexing Creating database indexes on metadata fields in your vector store. Essential for efficient filtered search at scale. In Qdrant, create a payload index on fields you'll frequently filter by.

Soft Deletes Marking vectors as inactive rather than deleting them immediately. Allows rollback if document update causes regressions. Vectors are truly deleted during scheduled maintenance windows.

Embedding Versioning Tagging each vector with the embedding model version used to create it. Enables targeted re-embedding of specific chunks when model is updated without full corpus rebuild.

The Glean Architecture (2024)

Glean, the enterprise search company valued at $4.6B in 2024, published engineering posts describing their RAG infrastructure. Key features: per-user and per-department access control enforced at the vector search layer via metadata filters (not post-retrieval); real-time document ingestion pipelines targeting under 60-second latency from document update to searchable chunk; and hybrid retrieval (BM25 + dense) across 100+ enterprise SaaS connectors. Document lifecycle management — handling version updates across Confluence, Salesforce, Slack, etc. — is described as their primary engineering complexity.

Putting It All Together: Production RAG Architecture

A mature production RAG system layers all the techniques from this module: hybrid retrieval as the first pass, cross-encoder reranking as the second pass, query rewriting to handle vocabulary gaps, metadata filtering for access control and relevance scoping, continuous evaluation via RAGAs on a golden dataset, full trace observability in LangSmith or Phoenix, and document lifecycle management with versioned chunks. GraphRAG adds an optional third retrieval path for global/relational queries.

60s

Target ingestion latency (Glean)

Doc update → searchable chunk

72%

GraphRAG win rate vs. naive RAG

Global reasoning queries (Microsoft, 2024)

5–10%

Live query sampling rate

For online LLM judge evaluation

3%

Regression alert threshold

On golden dataset — trigger investigation

L4 Quiz — Production RAG & GraphRAG

Three questions · instant feedback

1. What type of query does GraphRAG handle significantly better than standard vector RAG, and why?

Correct — standard RAG retrieves local chunks and cannot answer "what themes connect these 1000 documents?" GraphRAG's community detection and hierarchical summaries encode corpus-wide relational structure, enabling synthesis across document sets.

Not quite. GraphRAG's advantage is on global/relational queries — "what are the main themes," "which entities are most associated with X across all documents." These require corpus-level understanding that individual chunks cannot provide.

2. When implementing document lifecycle management in a production RAG system, what is the recommended pattern for handling a document update (revised version)?

Correct — leaving old chunks alongside new chunks causes the system to retrieve both versions, leading to contradictory context and confused answers. Delete-by-document-ID + re-ingest is the clean pattern used in production systems like Glean and Notion AI.

Not quite. Leaving old chunks in the index alongside new ones causes the system to potentially retrieve outdated information alongside current information, producing contradictory context. The correct approach is delete old chunks by document ID, then re-index the updated document.

3. In RAG observability, a narrowing score gap between the top-1 and top-5 reranked candidates typically signals what?

Correct — when the reranker sees candidates 1–5 all scoring similarly (e.g., 0.55, 0.53, 0.52, 0.51, 0.50), it means none are clearly relevant. In a healthy system, the true answer chunk should score distinctly higher (e.g., 0.92, 0.45, 0.31...). Narrow gaps indicate the retrieval stage is failing.

Not quite. In a healthy retrieval pipeline, the top-ranked chunk should score distinctly higher than lower candidates. When scores are tightly clustered (all ~0.5), it means the first-pass retrieval returned a set of candidates that are all only marginally relevant — a retrieval failure signal.

Lab 4 — Production Evaluation & Observability Design

Hands-on · AI-assisted · Minimum 3 exchanges to complete

Scenario: Build the evaluation and observability stack for a financial research RAG system

You're the ML engineer at a hedge fund that has deployed a RAG system over 10 years of analyst reports, earnings transcripts, and SEC filings. The system is live with 50 analyst users. Your task: design the full evaluation and observability stack that will catch regressions before users do, handle quarterly document updates cleanly, and decide whether to add GraphRAG for cross-corpus thematic queries.

💬 Start by describing the system: "We have 2M document chunks covering 10 years of financial data. Analysts query for specific company financials (local) and thematic market trends (global). Currently zero observability — no logging, no eval pipeline. Where do I start?" Then work through the design with your AI partner.

Lab Goals — work through these in conversation

1. Design a golden dataset strategy for financial QA (what questions, how many, who writes them)

2. Plan the automated eval pipeline (RAGAs metrics, CI/CD integration, alert thresholds)

3. Design observability tracing with LangSmith or Phoenix for financial use case

4. Design document lifecycle management for quarterly earnings updates

5. Evaluate whether GraphRAG is worth implementing for thematic market analysis queries

Production RAG Ops Lab

Financial Research System

0 / 3 exchanges

Financial research RAG with zero current observability — this is actually a very common situation six to twelve months after initial deployment. You built the thing, it shipped, everyone's using it, and suddenly you realize you have no idea how it's actually performing.

Good news: you can prioritize these interventions by impact-to-effort ratio. The golden dataset and automated eval pipeline are your highest-leverage first move — they give you a baseline measurement before you change anything else. Observability and lifecycle management are parallel tracks you can build while the eval pipeline runs.

Tell me more about your current query distribution: roughly what fraction of analyst queries are local (specific company/date lookups) vs. global (market themes, cross-company patterns)? That ratio will heavily influence whether GraphRAG is worth the implementation cost for your team.

Module 4 Test — RAG & Knowledge Systems

15 questions · 80% to pass · covers all four lessons

1. What fundamental problem does RAG solve that fine-tuning cannot address efficiently?

Correct — RAG separates knowledge from weights, enabling instant knowledge updates by changing a database rather than retraining a model.

RAG's primary advantage is knowledge update speed. Fine-tuning is slow and expensive to refresh; RAG updates instantly when source documents change.

2. In the RAG pipeline, what happens at "indexing time" vs. "query time"?

Correct — the two-phase architecture is fundamental to RAG: offline indexing creates the searchable knowledge base; online query time retrieves and generates.

Indexing is the offline preparation phase (chunk → embed → store). Query time is the live inference phase (embed query → retrieve → generate).

3. What is Reciprocal Rank Fusion (RRF) used for in RAG systems?

Correct — RRF is the standard algorithm for combining result sets from dense and sparse retrieval in hybrid search. It scores candidates by their rank positions across lists rather than raw scores.

RRF is a rank aggregation algorithm. It combines ranked lists from multiple retrieval methods into one list, used in hybrid search to merge dense and sparse results.

4. Why does chunk size significantly affect RAG system quality?

Correct — chunk size is one of the most impactful tuning decisions in RAG. The sweet spot balances retrieval precision (smaller = more precise) with context sufficiency (larger = more context).

Chunk size critically affects both retrieval precision and answer quality. Large chunks introduce noise; small chunks lose context. 512 tokens with overlap is a common starting point.

5. What is the HNSW algorithm and why is it used in production vector databases?

Correct — HNSW is the dominant ANN algorithm in 2024, used by Qdrant, Weaviate, and pgvector by default. Its multi-layer graph structure enables sub-millisecond queries across millions of vectors.

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor graph algorithm. It makes vector search tractable at millions of vectors with very high recall (95%+) at low latency.

6. What does "Matryoshka Embeddings" refer to in the context of OpenAI's text-embedding-3 models?

Correct — Matryoshka Representation Learning (MRL) trains models so that shorter prefixes of the full embedding vector are still useful. OpenAI's text-embedding-3 supports this, enabling 5–10x storage savings with modest accuracy loss.

Matryoshka embeddings are vectors where shorter prefixes remain meaningful. You can use 256, 512, or 1536 dimensions from the same text-embedding-3-small model — useful for storage/latency optimization.

7. In a two-stage retrieval pipeline, what role does the cross-encoder play and how does it differ from a bi-encoder?

Correct — bi-encoders enable fast ANN retrieval (pre-compute doc embeddings, compare at query time). Cross-encoders produce more accurate relevance scores by attending to both query and document together — but can only do this for a small candidate set.

Bi-encoders embed query and document separately (fast, scalable to millions). Cross-encoders process them jointly (slow, highly accurate). Use bi-encoder for first-