RAG Systems from Scratch · Introduction

LLMs know a lot. They don't know about your data.

RAG is the bridge between the general intelligence of a model and the specific intelligence of your documents.

A large language model in 2026 has been trained on a significant fraction of the public text on the internet. It knows a lot. It does not know what's in your company's support tickets, your legal contracts, your proprietary research, your private Slack, your internal wiki, or the email thread where you actually made the decision.

Retrieval-Augmented Generation (RAG) is the pattern for bridging that gap. You turn your private documents into a searchable index, retrieve the relevant chunks at query time, and give them to the model as context. The model reasons across the general and the specific. It's the single most practical AI architecture pattern of the 2020s, and nearly every company building serious AI is running some version of it.

This course builds RAG from scratch — not by calling a hosted vector database and hoping for the best, but by actually implementing the pipeline end to end. It covers document chunking strategies, embedding models, vector databases, hybrid search, re-ranking, prompt design for retrieved context, evaluation, and the real-world failure modes (chunk boundary issues, stale indexes, adversarial queries) that kill production RAG systems.

Lesson 1 · Why RAG Exists

The Knowledge Cutoff Problem

Large language models are trained once, then frozen. The world keeps moving.

Why can't we just retrain the model whenever we need new facts?

When Microsoft launched Bing Chat in February 2023, it was one of the first mass-market deployments of a large language model in a live search product. Within days, reporters discovered the model confidently citing outdated information — store hours that had changed, executives who had resigned, prices that no longer existed. The system knew nothing beyond its training cutoff. Users were furious. The problem had a name: knowledge staleness.

Microsoft's engineers already had a partial answer in production: they were feeding retrieved web content directly into the model's context window alongside the user's query. That architectural choice — retrieve, then generate — is exactly what we study in this course.

What a Language Model Actually Knows

A large language model is, at its core, a compressed statistical summary of the text it was trained on. During pre-training, billions of parameters are adjusted to predict the next token across hundreds of billions of words. Once training ends, those parameters are frozen. The model has no mechanism to update itself when a company files for bankruptcy, a law changes, or a new product ships.

This is not a bug — it is an architectural fact. Training GPT-4, according to public estimates, cost tens of millions of dollars in compute. Retraining to add a week's worth of news is economically absurd. Fine-tuning on new data is cheaper, but it introduces its own hazards: catastrophic forgetting, where updating weights for new facts degrades performance on old ones. Neither path scales to the pace of real-world knowledge change.

The result: every deployed LLM has a knowledge cutoff — the date beyond which it has no direct information. OpenAI's GPT-4 Turbo, released in November 2023, has a training cutoff of April 2023. Claude 3 Opus has a cutoff of August 2023. Any question that depends on facts after those dates is answered from inference and interpolation, not from evidence.

$50M+

Estimated GPT-4 training cost

6–18 mo

Typical gap: cutoff → deployment

~1%

Daily web content change rate

Hallucination: The Downstream Symptom

When a model is asked about something beyond its training data, it does not say "I don't know" by default — it generates the most statistically plausible continuation of the prompt. That continuation may be factually wrong. This phenomenon, called hallucination, became a defining concern of enterprise AI deployment in 2022–2023.

In May 2023, two lawyers in New York, Steven Schwartz and Peter LoDuca, submitted a legal brief to federal court containing citations to six cases that did not exist. ChatGPT had fabricated them with confident, realistic-looking formatting. Judge P. Kevin Castel fined the law firm $5,000 and issued a blistering opinion. The incident became one of the most-cited examples of LLM hallucination in a high-stakes professional context.

The structural cause: the model had no access to an actual legal database during generation. It was producing tokens that looked like Westlaw citations because that pattern appeared in training data — not because it had retrieved and verified the underlying documents.

Real Case — Mata v. Avianca, 2023

Federal judge P. Kevin Castel sanctioned attorneys who submitted a ChatGPT-generated brief containing citations to entirely fabricated cases. The model had no retrieval mechanism. It generated plausible-looking legal citations from statistical pattern completion. RAG systems address exactly this failure mode by grounding generation in retrieved, verifiable documents.

Why This Matters for Every Business Application

Enterprise AI applications almost universally require facts the base model cannot have: internal policy documents updated last week, customer records, product specs, regulatory filings, support tickets. A model trained on public internet data has none of this. Asking it to answer questions about your company's Q3 pricing strategy or your hospital's current formulary is asking it to invent.

The naive fix — stuffing everything into the system prompt — runs into context window limits. A 128,000-token context window sounds large until you realize that a medium-sized company's internal knowledge base might contain 50 million tokens of documentation. Even with 1-million-token windows (Google's Gemini 1.5 Pro, 2024), loading entire knowledge bases into every request is economically and latency-wise impractical at scale.

RAG solves this by retrieving only the relevant subset of knowledge at query time. Instead of loading everything, the system finds the three or five most relevant passages and includes only those. The model generates from evidence, not from statistical memory.

Core Insight

RAG separates knowledge storage (a retrieval index that can be updated in minutes) from reasoning capability (the frozen model). You update the index without touching the model. The model reasons over whatever evidence you retrieve. These two concerns, previously fused in one expensive training run, are now independently manageable.

Knowledge CutoffThe date after which a model has no training data. Facts beyond this date must come from retrieval, tool use, or user input.

HallucinationWhen a model generates confident, plausible-sounding text that is factually incorrect or entirely fabricated — a direct consequence of generating without retrieved evidence.

Context WindowThe maximum number of tokens an LLM can process in a single inference call. Even very large windows cannot hold entire enterprise knowledge bases.

Quiz · Lesson 1

The Knowledge Cutoff Problem

Three questions. Select the best answer for each.

Why is retraining an LLM on new data considered impractical for keeping it current with daily world events?

Correct. GPT-4-scale training runs cost an estimated $50M+ in compute. Even cheaper fine-tuning introduces catastrophic forgetting — new facts can degrade performance on old ones. Neither approach is economically viable for keeping pace with daily knowledge change.

Not quite. The barrier is economic and technical (cost + catastrophic forgetting), not legal or API-based. Training can be done again — it's just wildly impractical for daily updates.

In the 2023 Mata v. Avianca case, why did ChatGPT produce citations to nonexistent legal cases?

Correct. With no retrieval mechanism, the model produced tokens that matched the statistical pattern of Westlaw citations. It had seen many real citations in training data and could mimic the format — but it had no access to an actual database to verify existence.

Not correct. The model was not given intentionally false data, nor were the cases real. The structural cause is that LLMs generate plausible-looking text without verification when no retrieval grounding is present.

What is the core architectural insight that makes RAG more practical than loading an entire enterprise knowledge base into every LLM context window?

Correct. A 50-million-token enterprise knowledge base cannot fit in any current context window and would be prohibitively expensive even if it could. RAG solves this by selecting only the relevant passages — typically three to five — for each specific query.

Not quite. RAG doesn't compress, cache uniquely, or encode knowledge into weights. Its key insight is selective retrieval: find the relevant slice of the knowledge base at query time and pass only that slice to the model.

Lab · Lesson 1

Diagnosing Cutoff Failures

Interact with the AI tutor to explore knowledge cutoff scenarios and hallucination patterns.

Your Task

Work with the AI tutor to understand the knowledge cutoff problem from multiple angles. Ask it to explain scenarios where a model without retrieval would fail, discuss why hallucination is structurally linked to missing retrieval, and explore the cost tradeoffs of retraining vs. RAG.

Suggested starting points: "Give me three real examples of business questions that would require RAG rather than a base LLM." / "Why does hallucination happen structurally, not just randomly?" / "What's the difference between a knowledge cutoff and a context window limit?"

RAG Tutor — Lab 1 Knowledge Cutoff & Hallucination

Welcome to Lab 1. I'm here to help you understand why RAG exists — specifically the knowledge cutoff problem and its downstream effect: hallucination. Ask me about scenarios where a base LLM fails without retrieval, why retraining doesn't solve the problem, or how the Mata v. Avianca case illustrates the structural issue. What would you like to explore?

Lesson 2 · Why RAG Exists

The Context Window Is Not a Database

Bigger windows lower the ceiling. They don't remove it.

What happens when a business's entire knowledge base exceeds what any model can read?

When Google announced Gemini 1.5 Pro with a one-million-token context window in March 2024, many observers declared the RAG debate over. If you could fit a thousand-page PDF into a single prompt, why bother with retrieval pipelines? The excitement was legitimate but the conclusion was premature. A week later, researchers at Stanford published benchmarks showing that Gemini 1.5 Pro's attention was not uniform across that million-token window — facts buried in the middle were recalled significantly less reliably than facts near the edges. The phenomenon already had a name: the "lost in the middle" problem, documented by Liu et al. in 2023.

More practically: if your company has 200,000 internal support tickets plus 5,000 product documentation pages plus 3 years of Slack archives, you are not fitting that in any current context window. And even if you could, you would be paying for it every single query.

How Context Windows Actually Work

A transformer model processes all tokens in its context window through self-attention — every token attends to every other token. This means computational cost scales as O(n²) with context length. Doubling the context window roughly quadruples the attention computation. At one million tokens, the cost per inference call is substantial, even with efficient attention variants.

At current API pricing (2024), sending one million tokens to Gemini 1.5 Pro costs approximately $3.50 per query. A customer service system handling 100,000 queries per day, if it loaded its full knowledge base each time, would cost $350,000 per day in input tokens alone. RAG, by retrieving 3–5 relevant passages (perhaps 2,000 tokens), reduces that input cost by 99.8%.

This is not a hypothetical concern. Glean, the enterprise search startup, explicitly built its architecture around retrieval rather than context stuffing precisely because enterprise knowledge bases are orders of magnitude larger than any realistic context window.

$3.50

Cost to send 1M tokens to Gemini 1.5 Pro (2024)

O(n²)

Attention complexity scaling with context length

99.8%

Token cost reduction: full-context vs RAG retrieval

The "Lost in the Middle" Problem

Even when information is present in the context window, models do not attend to it uniformly. The 2023 paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu, Lin, et al. at Stanford demonstrated that retrieval performance degrades significantly for documents placed in the middle of a long context. Models are strongly biased toward information at the beginning and end of their context.

This has a practical implication: naively stuffing a knowledge base into a long context does not guarantee the model will find the relevant fact. If the answer happens to be in document 47 of 100 documents, the model may effectively ignore it. A well-designed RAG pipeline that surfaces only the top-3 relevant documents produces more reliable answers than a context dump that buries the answer in the middle.

Research Finding — Liu et al., 2023

"Lost in the Middle" (arXiv:2307.03172) tested GPT-3.5-Turbo and Claude across multi-document QA tasks with varying document counts and positions. Performance dropped significantly when the relevant document was placed in the middle of the context. The finding holds across model families and persists even with extended context windows. Precise retrieval consistently outperformed context stuffing.

Private Knowledge: What No Model Has Seen

Beyond staleness and window limits, there is a third structural reason RAG exists: private knowledge. The vast majority of enterprise knowledge was never on the public internet and thus never in any model's training data. Your internal CRM notes, your legal contracts, your engineering runbooks, your customer correspondence — none of this was in the CommonCrawl dataset that trained most LLMs.

No amount of window size or model recency solves this. The only way to make a model reason about your proprietary documents is to provide them at inference time. RAG is the scalable mechanism for doing that selectively and efficiently.

Problem 1

Staleness

Training data has a cutoff. The model knows nothing after it. Retrieval from a live index solves this — the index can be updated in minutes.

Problem 2

Context Cost

Even million-token windows are economically and computationally impractical for full knowledge base loading at production query volumes.

Problem 3

Attention Degradation

Models do not attend uniformly to long contexts. Facts buried in the middle are recalled less reliably than focused retrieved passages.

Problem 4

Private Knowledge

Internal documents, CRM records, and proprietary data were never in training data. Only inference-time retrieval can make them available.

Lost in the MiddleThe empirically documented phenomenon where LLMs fail to reliably use information positioned in the middle of long context windows, performing better with information at the edges.

Attention ComplexitySelf-attention scales as O(n²) with sequence length. Doubling context length roughly quadruples compute cost, making unlimited context economically impractical.

Private KnowledgeProprietary or internal documents that were never part of any model's training data and can only be accessed at inference time via retrieval.

Quiz · Lesson 2

The Context Window Is Not a Database

Three questions. Select the best answer for each.

Why did Google's announcement of Gemini 1.5 Pro's one-million-token context window not eliminate the case for RAG?

Correct. Two problems persist: the "lost in the middle" phenomenon means the model doesn't reliably attend to information buried in long contexts, and at $3.50 per million tokens, loading a full knowledge base on every query is financially untenable at scale.

Not quite. The model wasn't discontinued, and enterprise knowledge bases absolutely contain documents. The real issues are uneven attention quality and prohibitive per-query cost when loading full knowledge bases.

What did Liu et al.'s "Lost in the Middle" paper (2023) demonstrate about long-context LLM behavior?

Correct. Liu et al. showed that across GPT-3.5-Turbo and Claude, performance on multi-document QA tasks dropped significantly when the relevant document was in the middle of the context. This directly undermines the "just stuff everything in" approach.

Not quite. The paper found the opposite of uniform attention — models are biased toward the beginning and end of the context. Middle-positioned information is systematically underutilized.

Which of the following best explains why private enterprise knowledge cannot be solved by simply training a better base model?

Correct. Private knowledge was never on the public internet, so it was never in CommonCrawl or any standard pre-training corpus. No model trained on public data has ever seen it. The only solution is to inject it at inference time — which is exactly what retrieval provides.

Not quite. The core issue isn't format, regulation, or change rate — it's that private data simply was not in the training set and never can be without a dedicated fine-tuning or RAG approach.

Lab · Lesson 2

Context Window Economics

Explore the cost and quality tradeoffs of context stuffing versus targeted retrieval.

Your Task

Work through the economics and attention quality issues of long-context approaches with the AI tutor. Calculate real costs, reason through the "lost in the middle" problem, and understand why selective retrieval beats context stuffing even when windows are large.

Try asking: "Walk me through the cost math for a 100,000-query-per-day system using full context vs RAG." / "If the model doesn't attend uniformly to long contexts, does that mean RAG retrieval quality matters enormously?" / "What determines how many documents to retrieve in a RAG system?"

RAG Tutor — Lab 2 Context Window Economics

Welcome to Lab 2. We're focusing on why the context window is not a substitute for a proper retrieval system — both economically and in terms of attention quality. Ask me to walk through cost calculations, explain the "lost in the middle" problem in depth, or reason about how many documents a RAG system should retrieve. What would you like to work through?

Lesson 3 · Why RAG Exists

RAG vs. Fine-Tuning: Different Tools for Different Problems

Fine-tuning teaches style and behavior. RAG provides facts. Confusing them is expensive.

When should you fine-tune, when should you use RAG, and when do you need both?

In 2022, Bloomberg LP spent substantial engineering resources fine-tuning a 50-billion-parameter LLM on financial text — the result, BloombergGPT, was announced in April 2023. It produced more accurate financial terminology and outperformed general models on several finance-specific benchmarks. But BloombergGPT still did not know the price of Apple stock yesterday. It still could not answer questions about a specific Bloomberg Terminal news article from last week. Fine-tuning had taught the model how to talk about finance — it had not given the model access to current financial data.

Bloomberg's production systems use retrieval to pull live data alongside the fine-tuned model. The fine-tuning and the RAG serve orthogonal needs. Understanding this distinction is one of the most practically important concepts in deploying AI systems.

What Fine-Tuning Actually Changes

Fine-tuning modifies a model's weights — the billions of numerical parameters that encode its behavior. It is most effective at changing three things: output format (producing structured JSON, following a specific template), tone and style (more terse, more formal, using domain vocabulary naturally), and task specialization (improving performance on a narrow task type like extracting contract clauses or classifying support ticket urgency).

What fine-tuning does poorly: injecting specific facts. When you try to fine-tune a model to "know" that your company's refund policy is 30 days, the model distributes that information across weight adjustments throughout the network. It may recall it correctly most of the time — but it may also confabulate variations, blend it with similar policies it saw in training, or override it when contradicting context appears. Facts stored in weights are lossy and unreliable compared to facts retrieved verbatim from a document.

The research on this is consistent: the 2023 paper "How Do Large Language Models Handle Privacy Sensitive Text?" and related work on knowledge editing (ROME, MEMIT) shows that inserting or editing specific factual knowledge in model weights is technically possible but unreliable at scale. A change to one fact can propagate inconsistently to related assertions.

Fine-Tuning vs. RAG — What Each Modifies

Fine-Tuning Changes

✓ Output style and format
✓ Domain vocabulary fluency
✓ Task-specific behavior
✓ Instruction-following patterns
✗ Specific verifiable facts
✗ Real-time information
✗ Private document content

RAG Provides

✓ Specific verifiable facts
✓ Current information
✓ Private document content
✓ Citeable source passages
✓ Updateable without retraining
✗ Style or behavior changes
✗ Improved reasoning capability

The Bloomberg GPT Lesson

BloombergGPT was trained on 363 billion tokens of financial text — filings, news articles, analyst reports spanning decades. The result was a model with dramatically better financial domain fluency. On the Financial Phrase Bank sentiment classification task, it outperformed GPT-4 in several evaluations. The fine-tuning was genuinely valuable.

But financial AI applications need current data. A model that knows financial vocabulary but cannot access yesterday's earnings announcement is not useful for most Bloomberg Terminal workflows. Bloomberg's real production systems connect the fine-tuned model to live data feeds and retrieval systems. The two layers solve different problems: fine-tuning for domain competence, RAG for factual currency.

This pattern repeats across industries. In medical AI, fine-tuning on clinical notes improves HIPAA-compliant terminology and clinical reasoning patterns. But a doctor asking about a specific patient's medication history needs retrieval from that patient's actual EHR record — not from statistical patterns in a training corpus. Google's Med-PaLM 2 (2023) was fine-tuned on medical text and achieved expert-level performance on USMLE questions, but real clinical deployment requires retrieval from patient records and current clinical guidelines.

Decision Framework

Use fine-tuning when: you need consistent output format, domain-specific style, improved task specialization, or behavior that must persist across all queries regardless of retrieved context.

Use RAG when: you need access to specific facts, current information, private documents, or cited sources. RAG is almost always needed in enterprise applications.

Use both when: you need domain competence (fine-tuning) AND factual currency or private knowledge (RAG). Most sophisticated production systems combine them.

When Fine-Tuning Is Actually Harmful

There is a failure mode that appears repeatedly in enterprise AI projects: teams spend months fine-tuning a model on proprietary documents, expecting it to "learn" their knowledge base. The model does learn statistical patterns from the documents — but it does so imprecisely and without attribution. Specific factual answers become blurry. Worse, the model may confidently produce information that blends multiple documents or updates from the fine-tuning data, making it harder to trace errors to their source.

A RAG system, by contrast, returns a retrieved passage verbatim and can cite the source document and even the page or paragraph. When the retrieved content is wrong, you can audit the retrieval pipeline. When a fine-tuned model produces wrong facts, you cannot easily determine whether the error came from the original training data, the fine-tuning corpus, or the model's generalization.

Auditability is increasingly a legal and regulatory requirement. The EU AI Act (2024) mandates traceability for high-risk AI systems. A RAG system's retrieved passages are inherently more auditable than knowledge encoded opaquely in model weights.

Fine-TuningUpdating a model's weights on a curated dataset to change its behavior, style, or task performance. Does not reliably inject specific verifiable facts.

Weight-Encoded KnowledgeInformation stored implicitly in model parameters during training. Difficult to update, verify, or audit — contrasted with retrieved knowledge which is explicit and citable.

AuditabilityThe ability to trace an AI output back to its source evidence. RAG provides this naturally; weight-encoded knowledge does not.

Quiz · Lesson 3

RAG vs. Fine-Tuning

Three questions. Select the best answer for each.

What was the primary limitation of BloombergGPT despite its extensive fine-tuning on financial text?

Correct. BloombergGPT excelled at financial terminology and sentiment tasks — things fine-tuning is good for. But it had no mechanism to access current prices, recent filings, or live news. That requires retrieval from live data sources, which Bloomberg's production systems provide separately.

Not quite. BloombergGPT actually performed well on financial reasoning benchmarks. The limitation is simpler and more fundamental: fine-tuning cannot give a model access to events that occurred after its training data cutoff.

Why is fine-tuning a poor mechanism for injecting specific verifiable facts into a model?

Correct. When a fact is fine-tuned into a model, it's distributed across weight adjustments throughout the network. The model may recall it correctly most of the time, but may also confabulate variations or blend it with similar training patterns. This is fundamentally less reliable than retrieving the fact verbatim from a document.

Not quite. The barrier is not technical limits or provider policies — it's that weight-encoded knowledge is inherently lossy and unreliable. Facts spread across billions of parameters cannot be recalled with the precision and verifiability of retrieved document passages.

From an auditability standpoint, why does RAG have a structural advantage over fine-tuning for enterprise deployments subject to regulatory scrutiny?

Correct. When a RAG system produces a wrong answer, you can trace it to the specific retrieved document and passage. When a fine-tuned model hallucinates, you cannot determine whether the error originated in pre-training data, fine-tuning data, or model generalization — making it essentially unauditable at the fact level.

Not quite. There's no specific ISO certification or blanket legal mandate involved. The structural advantage is simpler: RAG outputs are grounded in specific retrievable passages that can be examined, verified, and corrected. Fine-tuned weight knowledge has no such explainability.

Lab · Lesson 3

Fine-Tuning vs. RAG Decision Making

Practice reasoning through when to fine-tune, when to use RAG, and when to combine both.

Your Task

Present the AI tutor with real enterprise AI scenarios and reason through whether fine-tuning, RAG, or both are appropriate. The tutor will challenge your reasoning and help you sharpen the distinctions.

Try: "A legal firm wants AI to draft contracts in their house style using their standard clauses — fine-tune, RAG, or both?" / "An e-commerce company wants AI to answer customer questions about product inventory and return policies — which approach?" / "A hospital wants AI to assist with patient diagnosis using current clinical guidelines — what's the right architecture?"

RAG Tutor — Lab 3 Fine-Tuning vs. RAG Decisions

Welcome to Lab 3. We're working on the most practical decision in enterprise AI architecture: when to fine-tune, when to use RAG, and when to combine them. Present me with a real or hypothetical use case, and let's reason through the decision together. I'll ask you to justify your choices and point out tradeoffs you might have missed. What scenario would you like to start with?

Lesson 4 · Why RAG Exists

The RAG Architecture: A First Look

Before we build it piece by piece, understand the whole system and how the pieces connect.

What exactly happens between a user's question and a RAG system's answer?

When Notion launched Notion AI in November 2022, it faced a concrete engineering challenge: users wanted AI to answer questions about their own workspace documents. A user asking "What did we decide about the product roadmap in last Tuesday's meeting?" needed the system to surface the right note, not generate a plausible-sounding fabrication. The engineering answer was a retrieval pipeline: embed every Notion page, store embeddings in a vector index, retrieve semantically similar pages at query time, and pass them to the generation model.

That pipeline — embed, index, retrieve, generate — is the canonical RAG architecture. You will build every component of it in this course. But first, you need to understand what role each piece plays and why the sequence matters.

The Four Stages of a RAG Pipeline

Every RAG system, from the simplest prototype to production systems at companies like Notion, Glean, and Perplexity, implements some version of four core stages. Understanding these stages at a conceptual level before diving into implementation is essential for reasoning about where failures occur and where to optimize.

Stage 1

Indexing — Building the Knowledge Base

Documents are loaded, chunked into passages, converted into embedding vectors by an embedding model, and stored in a vector database. This stage happens offline, before any user query. The quality of chunking and embedding choices here determines the ceiling of retrieval quality.

Stage 2

Query Processing — Understanding the Request

At query time, the user's question is converted into an embedding using the same embedding model used for indexing. Some systems also rewrite or expand the query at this stage to improve retrieval recall. The query embedding is what we search against the index.

Stage 3

Retrieval — Finding Relevant Passages

The query embedding is compared against stored document embeddings using similarity search (commonly cosine similarity or approximate nearest neighbor). The top-k most similar passages are returned. This is the stage where semantic understanding happens — "purchase" matches "buy" even though the words differ.

Stage 4

Generation — Producing the Answer

The retrieved passages are combined with the user's question into a prompt sent to the LLM. The model generates an answer grounded in those passages. Well-designed prompts instruct the model to cite sources and to say "I don't know" if the retrieved passages don't contain the answer.

RAG Pipeline — Data Flow

User Query

→

Embedding Model

→

Vector Search

→

Top-k Passages

→

LLM + Prompt

→

Grounded Answer

Why the Separation of Retrieval and Generation Matters

The most important design insight in RAG is that retrieval and generation are separate, independently improvable components. This is unlike a pure LLM where the "retrieval" happens implicitly in attention over training data — opaque and unmodifiable at inference time.

In a RAG system, if answers are wrong, you can diagnose whether the problem is in retrieval (the right passages are not being found) or generation (the right passages are found but the model misuses them). Each stage has its own metrics: retrieval precision and recall, answer faithfulness, answer relevance. Each can be improved independently.

This separation also enables knowledge updates without model retraining. When your company's refund policy changes from 30 to 45 days, you update one document in the index. The LLM is unchanged. The next query about return policies retrieves the updated passage. Compare this to fine-tuning: you would need to fine-tune the model again, verify the update didn't degrade other capabilities, and redeploy — a cycle that takes days to weeks.

The Naive RAG Baseline and Its Failure Modes

The simplest RAG system — embed documents, do cosine similarity search, stuff top results into a prompt — is called naive RAG. It is the starting point for every RAG project and the architecture that most tutorials implement. It also has well-documented failure modes that the rest of this course teaches you to address.

Chunking failures: If a document is split in a way that cuts across the answer, neither chunk alone contains enough context. The retrieved passage is technically relevant but incomplete. Retrieval failures: Embedding similarity finds semantically similar text, but may miss exact-match answers or technical terms not well-represented in the embedding space. Generation failures: Even with correct passages, the model may ignore them in favor of its parametric knowledge, hallucinate additional facts not in the passage, or fail to synthesize across multiple retrieved chunks.

Understanding these failure modes — before you ever write code — is what separates engineers who ship reliable RAG systems from those who ship demos that break in production.

What This Course Builds

Module 1 (this module) establishes why RAG exists. Modules 2–6 build each component: document processing and chunking (M2), embedding models (M3), vector databases and retrieval (M4), prompt engineering for generation (M5), and evaluation and production hardening (M6). By the end, you will have built a complete, evaluatable RAG system from scratch — not a demo, but an architecture you can reason about, audit, and improve.

Naive RAGThe baseline RAG architecture: embed documents, cosine similarity search, stuff top-k results into prompt. The starting point, not the destination — has documented failure modes addressed by advanced techniques.

IndexingThe offline stage where documents are chunked, embedded, and stored in a vector database. Determines the ceiling of retrieval quality.

Top-k RetrievalReturning the k most similar passages to a query embedding from the vector index. k is a design choice trading off context length against retrieval coverage.

Quiz · Lesson 4

The RAG Architecture

Three questions. Select the best answer for each.

In a RAG pipeline, which stage happens offline (before any user query is received)?

Correct. Indexing is the offline preprocessing stage. Documents are chunked, embedded, and stored in the vector database before any user query arrives. Query processing, retrieval, and generation all happen at inference time, in response to each user query.

Not quite. That stage happens at inference time (when a user asks a question). The offline stage is indexing — the preprocessing step where documents are chunked, embedded, and loaded into the vector database in advance.

What is the key practical advantage of RAG's separation of retrieval and generation as independently improvable components?

Correct. In a pure LLM, "retrieval" happens opaquely in attention — you can't distinguish a retrieval failure from a generation failure. In RAG, if an answer is wrong, you can check whether the right passage was retrieved (retrieval precision/recall) versus whether the model misused a correct passage (faithfulness). Each is independently fixable.

Not quite. The advantage is diagnostic and engineering, not hardware cost or legal. Being able to measure and improve retrieval quality independently from generation quality is what allows RAG systems to be systematically debugged and improved in production.

Why does semantic embedding-based retrieval find the passage "how do I purchase a subscription" when a user asks "how do I buy a plan" — even though the words don't match?

Correct. This is the fundamental value of dense retrieval over keyword search. Embedding models are trained to place semantically similar text near each other in the high-dimensional vector space. "Buy" and "purchase" end up close together because they appear in similar contexts throughout the training corpus, so queries using one will retrieve documents using the other.

Not quite. The match doesn't rely on synonym dictionaries, LLM query rewriting, or database-side synonym expansion. The magic happens in the embedding model itself: semantically similar meanings produce geometrically similar vectors, so similarity search naturally finds relevant passages regardless of exact word choice.

Lab · Lesson 4

Designing a RAG Pipeline

Work through the architectural decisions in a RAG system before writing any code.

Your Task

Use the AI tutor to reason through concrete RAG design decisions. The goal is to develop intuition for how each stage of the pipeline affects overall system behavior — before you encounter these decisions in code in later modules.

Try asking: "If my indexing stage chunks documents poorly, how does that affect the generation stage?" / "Walk me through what would go wrong if I used a different embedding model for queries than for documents." / "What are the tradeoffs of retrieving k=3 versus k=10 passages?"

RAG Tutor — Lab 4 RAG Pipeline Architecture

Welcome to Lab 4. We're focusing on the RAG pipeline architecture — how the four stages connect and how decisions in each stage propagate through the system. Ask me about how indexing quality affects generation, what happens when retrieval fails, how to choose k, or why embedding model consistency matters. This is your chance to develop intuition before we build each component in later modules. What would you like to explore?

Module Test · M1

Why RAG Exists

15 questions. Score 80% or higher to pass this module.

1. What architectural fact explains why all deployed LLMs have a knowledge cutoff?

Correct. Parameters are frozen after training. No self-update mechanism exists.

Incorrect. The cutoff is an architectural fact — parameters are frozen after training — not a regulatory or intentional design choice.

2. Which of the following best describes catastrophic forgetting in the context of LLM updates?

Correct. Catastrophic forgetting is the degradation of prior capabilities when model weights are updated on new data — a key reason why continuous retraining is not a practical solution to staleness.

Incorrect. Catastrophic forgetting refers to weight-level degradation: updating a model on new training data can overwrite previously learned knowledge.

3. The Mata v. Avianca case (2023) resulted in sanctions against lawyers who submitted ChatGPT-generated briefs. What was the root structural cause?

Correct. No retrieval = no verification. The model produced tokens matching the statistical pattern of real citations, not actual cases.

Incorrect. The model was not given false data, nor were the cases sealed. With no retrieval, the model generated plausible-looking citations from statistical patterns with no verification.

4. Why is loading an enterprise's entire knowledge base into every LLM context window economically impractical at production query volumes?

Correct. The math is straightforward: large contexts multiplied by high query volumes create prohibitive input token costs before any other consideration.

Incorrect. Context windows can be very large (1M tokens for Gemini 1.5 Pro), and the issue is not format compatibility or regulations — it's pure economics at scale.

5. According to Liu et al.'s "Lost in the Middle" research (2023), where in a long context do LLMs recall information most reliably?

Correct. The "lost in the middle" effect: information at the beginning and end of long contexts is recalled more reliably than information in the middle.

Incorrect. Attention is far from uniform. The paper documented a U-shaped performance curve — strong recall at the beginning and end, significantly degraded recall in the middle.

6. What makes private enterprise knowledge impossible to access via a base LLM, regardless of context window size or recency?

Correct. Private knowledge was never in CommonCrawl or any standard pre-training dataset. No amount of model improvement changes this — injection at inference time via RAG is the only solution.

Incorrect. The barrier is not encryption, network policy, or tokenization — it's simply that private data was never in the training set and thus was never encoded into model weights.

7. Why is fine-tuning a poor mechanism for injecting your company's current refund policy into a production AI system?

Correct. Fine-tuned facts are lossy, inconsistently recalled, and every update requires another fine-tuning cycle. A RAG document update takes minutes and is retrieved verbatim.

Incorrect. The issue is not document length, liability, or knowledge isolation — it's that weight-encoded facts are unreliable and require full retraining to update.

8. BloombergGPT (2023) demonstrated that fine-tuning on domain-specific data improves which capability — and fails to provide which other?

Correct. BloombergGPT demonstrated that fine-tuning on 363B tokens of financial text produces better domain fluency — but cannot provide access to yesterday's earnings announcement. That requires retrieval.

Incorrect. Fine-tuning improves domain-competence capabilities, not real-time data access. BloombergGPT excelled on financial benchmarks but still needed retrieval for current data in production.

9. In a RAG pipeline, which stage determines the ceiling of overall system quality?

Correct. Indexing is the ceiling — if relevant content isn't indexed or is chunked so that the answer is split across boundaries, retrieval cannot find it and generation cannot produce it. Garbage in, garbage out.

Incorrect. While the LLM, query, and database all matter, indexing is the ceiling: if the right passage isn't in the index or is poorly chunked, no downstream optimization can compensate.

10. What distinguishes semantic embedding-based retrieval from traditional keyword search?

Correct. Embedding models map semantically similar text to nearby points in vector space — enabling "buy" to match "purchase" without any synonym list or pre-labeling.

Incorrect. The key distinction is semantic matching in continuous vector space, not hash tables, human labels, or live LLM scoring of every document.

11. What is the primary auditability advantage of RAG over fine-tuning for high-stakes enterprise applications?

Correct. Retrieved passages are explicit and citable. When a RAG answer is wrong, you can trace it to the specific passage. Fine-tuned knowledge is distributed across billions of weights with no traceable source.

Incorrect. The advantage is structural: retrieved evidence is explicit and traceable, weight-encoded knowledge is not. This is independent of auto-reporting, specific approvals, or default logging.

12. Which of the following is a documented failure mode of naive RAG that more advanced techniques address?

Correct. Poor chunking is one of naive RAG's most common failure modes — the answer exists in the knowledge base but the chunk boundary means no single retrieved passage contains enough context for the model to answer correctly.

Incorrect. Safety refusals, document count limits, and length-biased scores are not the primary naive RAG failure modes. Poor chunking, retrieval misses on technical terms, and model ignoring retrieved context are the documented issues.

13. When Microsoft deployed Bing Chat in February 2023 using a retrieve-then-generate architecture, what problem was this architecture specifically designed to mitigate?

Correct. Bing Chat's core problem was that the base LLM's knowledge was frozen. Retrieving live web content and injecting it into the context was the architectural response to knowledge staleness.

Incorrect. The primary motivation for RAG in Bing Chat was knowledge staleness — the model needed current web information that its training cutoff couldn't provide.

14. Self-attention in transformer models scales as O(n²) with context length. What does this imply for the viability of very long contexts?

Correct. Quadratic scaling means very long contexts become rapidly more expensive. Even with FlashAttention and other efficiency improvements, unlimited context is not economically viable at scale — reinforcing the need for selective retrieval.

Incorrect. O(n²) applies broadly to self-attention. While efficient variants reduce constants, the quadratic relationship means large contexts still impose significant costs that prevent them from replacing selective retrieval at scale.

15. A company wants AI to handle customer support queries about their product catalog (10,000 SKUs updated weekly) and to respond in a friendly, brand-consistent tone. What is the most appropriate architecture?

Correct. This is a canonical "both" scenario. Fine-tuning handles the style and behavioral consistency; RAG handles the weekly-updated factual product information. Neither alone solves both problems.

Incorrect. Fine-tuning alone can't keep pace with weekly catalog updates. RAG alone won't reliably produce brand-consistent tone. Full-context loading is expensive and has attention quality issues. The right answer combines fine-tuning for behavior and RAG for current facts.