A large language model in 2026 has been trained on a significant fraction of the public text on the internet. It knows a lot. It does not know what's in your company's support tickets, your legal contracts, your proprietary research, your private Slack, your internal wiki, or the email thread where you actually made the decision.
Retrieval-Augmented Generation (RAG) is the pattern for bridging that gap. You turn your private documents into a searchable index, retrieve the relevant chunks at query time, and give them to the model as context. The model reasons across the general and the specific. It's the single most practical AI architecture pattern of the 2020s, and nearly every company building serious AI is running some version of it.
This course builds RAG from scratch — not by calling a hosted vector database and hoping for the best, but by actually implementing the pipeline end to end. It covers document chunking strategies, embedding models, vector databases, hybrid search, re-ranking, prompt design for retrieved context, evaluation, and the real-world failure modes (chunk boundary issues, stale indexes, adversarial queries) that kill production RAG systems.
When Microsoft launched Bing Chat in February 2023, it was one of the first mass-market deployments of a large language model in a live search product. Within days, reporters discovered the model confidently citing outdated information — store hours that had changed, executives who had resigned, prices that no longer existed. The system knew nothing beyond its training cutoff. Users were furious. The problem had a name: knowledge staleness.
Microsoft's engineers already had a partial answer in production: they were feeding retrieved web content directly into the model's context window alongside the user's query. That architectural choice — retrieve, then generate — is exactly what we study in this course.
A large language model is, at its core, a compressed statistical summary of the text it was trained on. During pre-training, billions of parameters are adjusted to predict the next token across hundreds of billions of words. Once training ends, those parameters are frozen. The model has no mechanism to update itself when a company files for bankruptcy, a law changes, or a new product ships.
This is not a bug — it is an architectural fact. Training GPT-4, according to public estimates, cost tens of millions of dollars in compute. Retraining to add a week's worth of news is economically absurd. Fine-tuning on new data is cheaper, but it introduces its own hazards: catastrophic forgetting, where updating weights for new facts degrades performance on old ones. Neither path scales to the pace of real-world knowledge change.
The result: every deployed LLM has a knowledge cutoff — the date beyond which it has no direct information. OpenAI's GPT-4 Turbo, released in November 2023, has a training cutoff of April 2023. Claude 3 Opus has a cutoff of August 2023. Any question that depends on facts after those dates is answered from inference and interpolation, not from evidence.
When a model is asked about something beyond its training data, it does not say "I don't know" by default — it generates the most statistically plausible continuation of the prompt. That continuation may be factually wrong. This phenomenon, called hallucination, became a defining concern of enterprise AI deployment in 2022–2023.
In May 2023, two lawyers in New York, Steven Schwartz and Peter LoDuca, submitted a legal brief to federal court containing citations to six cases that did not exist. ChatGPT had fabricated them with confident, realistic-looking formatting. Judge P. Kevin Castel fined the law firm $5,000 and issued a blistering opinion. The incident became one of the most-cited examples of LLM hallucination in a high-stakes professional context.
The structural cause: the model had no access to an actual legal database during generation. It was producing tokens that looked like Westlaw citations because that pattern appeared in training data — not because it had retrieved and verified the underlying documents.
Federal judge P. Kevin Castel sanctioned attorneys who submitted a ChatGPT-generated brief containing citations to entirely fabricated cases. The model had no retrieval mechanism. It generated plausible-looking legal citations from statistical pattern completion. RAG systems address exactly this failure mode by grounding generation in retrieved, verifiable documents.
Enterprise AI applications almost universally require facts the base model cannot have: internal policy documents updated last week, customer records, product specs, regulatory filings, support tickets. A model trained on public internet data has none of this. Asking it to answer questions about your company's Q3 pricing strategy or your hospital's current formulary is asking it to invent.
The naive fix — stuffing everything into the system prompt — runs into context window limits. A 128,000-token context window sounds large until you realize that a medium-sized company's internal knowledge base might contain 50 million tokens of documentation. Even with 1-million-token windows (Google's Gemini 1.5 Pro, 2024), loading entire knowledge bases into every request is economically and latency-wise impractical at scale.
RAG solves this by retrieving only the relevant subset of knowledge at query time. Instead of loading everything, the system finds the three or five most relevant passages and includes only those. The model generates from evidence, not from statistical memory.
RAG separates knowledge storage (a retrieval index that can be updated in minutes) from reasoning capability (the frozen model). You update the index without touching the model. The model reasons over whatever evidence you retrieve. These two concerns, previously fused in one expensive training run, are now independently manageable.
Work with the AI tutor to understand the knowledge cutoff problem from multiple angles. Ask it to explain scenarios where a model without retrieval would fail, discuss why hallucination is structurally linked to missing retrieval, and explore the cost tradeoffs of retraining vs. RAG.
When Google announced Gemini 1.5 Pro with a one-million-token context window in March 2024, many observers declared the RAG debate over. If you could fit a thousand-page PDF into a single prompt, why bother with retrieval pipelines? The excitement was legitimate but the conclusion was premature. A week later, researchers at Stanford published benchmarks showing that Gemini 1.5 Pro's attention was not uniform across that million-token window — facts buried in the middle were recalled significantly less reliably than facts near the edges. The phenomenon already had a name: the "lost in the middle" problem, documented by Liu et al. in 2023.
More practically: if your company has 200,000 internal support tickets plus 5,000 product documentation pages plus 3 years of Slack archives, you are not fitting that in any current context window. And even if you could, you would be paying for it every single query.
A transformer model processes all tokens in its context window through self-attention — every token attends to every other token. This means computational cost scales as O(n²) with context length. Doubling the context window roughly quadruples the attention computation. At one million tokens, the cost per inference call is substantial, even with efficient attention variants.
At current API pricing (2024), sending one million tokens to Gemini 1.5 Pro costs approximately $3.50 per query. A customer service system handling 100,000 queries per day, if it loaded its full knowledge base each time, would cost $350,000 per day in input tokens alone. RAG, by retrieving 3–5 relevant passages (perhaps 2,000 tokens), reduces that input cost by 99.8%.
This is not a hypothetical concern. Glean, the enterprise search startup, explicitly built its architecture around retrieval rather than context stuffing precisely because enterprise knowledge bases are orders of magnitude larger than any realistic context window.
Even when information is present in the context window, models do not attend to it uniformly. The 2023 paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu, Lin, et al. at Stanford demonstrated that retrieval performance degrades significantly for documents placed in the middle of a long context. Models are strongly biased toward information at the beginning and end of their context.
This has a practical implication: naively stuffing a knowledge base into a long context does not guarantee the model will find the relevant fact. If the answer happens to be in document 47 of 100 documents, the model may effectively ignore it. A well-designed RAG pipeline that surfaces only the top-3 relevant documents produces more reliable answers than a context dump that buries the answer in the middle.
"Lost in the Middle" (arXiv:2307.03172) tested GPT-3.5-Turbo and Claude across multi-document QA tasks with varying document counts and positions. Performance dropped significantly when the relevant document was placed in the middle of the context. The finding holds across model families and persists even with extended context windows. Precise retrieval consistently outperformed context stuffing.
Beyond staleness and window limits, there is a third structural reason RAG exists: private knowledge. The vast majority of enterprise knowledge was never on the public internet and thus never in any model's training data. Your internal CRM notes, your legal contracts, your engineering runbooks, your customer correspondence — none of this was in the CommonCrawl dataset that trained most LLMs.
No amount of window size or model recency solves this. The only way to make a model reason about your proprietary documents is to provide them at inference time. RAG is the scalable mechanism for doing that selectively and efficiently.
Work through the economics and attention quality issues of long-context approaches with the AI tutor. Calculate real costs, reason through the "lost in the middle" problem, and understand why selective retrieval beats context stuffing even when windows are large.
In 2022, Bloomberg LP spent substantial engineering resources fine-tuning a 50-billion-parameter LLM on financial text — the result, BloombergGPT, was announced in April 2023. It produced more accurate financial terminology and outperformed general models on several finance-specific benchmarks. But BloombergGPT still did not know the price of Apple stock yesterday. It still could not answer questions about a specific Bloomberg Terminal news article from last week. Fine-tuning had taught the model how to talk about finance — it had not given the model access to current financial data.
Bloomberg's production systems use retrieval to pull live data alongside the fine-tuned model. The fine-tuning and the RAG serve orthogonal needs. Understanding this distinction is one of the most practically important concepts in deploying AI systems.
Fine-tuning modifies a model's weights — the billions of numerical parameters that encode its behavior. It is most effective at changing three things: output format (producing structured JSON, following a specific template), tone and style (more terse, more formal, using domain vocabulary naturally), and task specialization (improving performance on a narrow task type like extracting contract clauses or classifying support ticket urgency).
What fine-tuning does poorly: injecting specific facts. When you try to fine-tune a model to "know" that your company's refund policy is 30 days, the model distributes that information across weight adjustments throughout the network. It may recall it correctly most of the time — but it may also confabulate variations, blend it with similar policies it saw in training, or override it when contradicting context appears. Facts stored in weights are lossy and unreliable compared to facts retrieved verbatim from a document.
The research on this is consistent: the 2023 paper "How Do Large Language Models Handle Privacy Sensitive Text?" and related work on knowledge editing (ROME, MEMIT) shows that inserting or editing specific factual knowledge in model weights is technically possible but unreliable at scale. A change to one fact can propagate inconsistently to related assertions.
BloombergGPT was trained on 363 billion tokens of financial text — filings, news articles, analyst reports spanning decades. The result was a model with dramatically better financial domain fluency. On the Financial Phrase Bank sentiment classification task, it outperformed GPT-4 in several evaluations. The fine-tuning was genuinely valuable.
But financial AI applications need current data. A model that knows financial vocabulary but cannot access yesterday's earnings announcement is not useful for most Bloomberg Terminal workflows. Bloomberg's real production systems connect the fine-tuned model to live data feeds and retrieval systems. The two layers solve different problems: fine-tuning for domain competence, RAG for factual currency.
This pattern repeats across industries. In medical AI, fine-tuning on clinical notes improves HIPAA-compliant terminology and clinical reasoning patterns. But a doctor asking about a specific patient's medication history needs retrieval from that patient's actual EHR record — not from statistical patterns in a training corpus. Google's Med-PaLM 2 (2023) was fine-tuned on medical text and achieved expert-level performance on USMLE questions, but real clinical deployment requires retrieval from patient records and current clinical guidelines.
Use fine-tuning when: you need consistent output format, domain-specific style, improved task specialization, or behavior that must persist across all queries regardless of retrieved context.
Use RAG when: you need access to specific facts, current information, private documents, or cited sources. RAG is almost always needed in enterprise applications.
Use both when: you need domain competence (fine-tuning) AND factual currency or private knowledge (RAG). Most sophisticated production systems combine them.
There is a failure mode that appears repeatedly in enterprise AI projects: teams spend months fine-tuning a model on proprietary documents, expecting it to "learn" their knowledge base. The model does learn statistical patterns from the documents — but it does so imprecisely and without attribution. Specific factual answers become blurry. Worse, the model may confidently produce information that blends multiple documents or updates from the fine-tuning data, making it harder to trace errors to their source.
A RAG system, by contrast, returns a retrieved passage verbatim and can cite the source document and even the page or paragraph. When the retrieved content is wrong, you can audit the retrieval pipeline. When a fine-tuned model produces wrong facts, you cannot easily determine whether the error came from the original training data, the fine-tuning corpus, or the model's generalization.
Auditability is increasingly a legal and regulatory requirement. The EU AI Act (2024) mandates traceability for high-risk AI systems. A RAG system's retrieved passages are inherently more auditable than knowledge encoded opaquely in model weights.
Present the AI tutor with real enterprise AI scenarios and reason through whether fine-tuning, RAG, or both are appropriate. The tutor will challenge your reasoning and help you sharpen the distinctions.
When Notion launched Notion AI in November 2022, it faced a concrete engineering challenge: users wanted AI to answer questions about their own workspace documents. A user asking "What did we decide about the product roadmap in last Tuesday's meeting?" needed the system to surface the right note, not generate a plausible-sounding fabrication. The engineering answer was a retrieval pipeline: embed every Notion page, store embeddings in a vector index, retrieve semantically similar pages at query time, and pass them to the generation model.
That pipeline — embed, index, retrieve, generate — is the canonical RAG architecture. You will build every component of it in this course. But first, you need to understand what role each piece plays and why the sequence matters.
Every RAG system, from the simplest prototype to production systems at companies like Notion, Glean, and Perplexity, implements some version of four core stages. Understanding these stages at a conceptual level before diving into implementation is essential for reasoning about where failures occur and where to optimize.
The most important design insight in RAG is that retrieval and generation are separate, independently improvable components. This is unlike a pure LLM where the "retrieval" happens implicitly in attention over training data — opaque and unmodifiable at inference time.
In a RAG system, if answers are wrong, you can diagnose whether the problem is in retrieval (the right passages are not being found) or generation (the right passages are found but the model misuses them). Each stage has its own metrics: retrieval precision and recall, answer faithfulness, answer relevance. Each can be improved independently.
This separation also enables knowledge updates without model retraining. When your company's refund policy changes from 30 to 45 days, you update one document in the index. The LLM is unchanged. The next query about return policies retrieves the updated passage. Compare this to fine-tuning: you would need to fine-tune the model again, verify the update didn't degrade other capabilities, and redeploy — a cycle that takes days to weeks.
The simplest RAG system — embed documents, do cosine similarity search, stuff top results into a prompt — is called naive RAG. It is the starting point for every RAG project and the architecture that most tutorials implement. It also has well-documented failure modes that the rest of this course teaches you to address.
Chunking failures: If a document is split in a way that cuts across the answer, neither chunk alone contains enough context. The retrieved passage is technically relevant but incomplete. Retrieval failures: Embedding similarity finds semantically similar text, but may miss exact-match answers or technical terms not well-represented in the embedding space. Generation failures: Even with correct passages, the model may ignore them in favor of its parametric knowledge, hallucinate additional facts not in the passage, or fail to synthesize across multiple retrieved chunks.
Understanding these failure modes — before you ever write code — is what separates engineers who ship reliable RAG systems from those who ship demos that break in production.
Module 1 (this module) establishes why RAG exists. Modules 2–6 build each component: document processing and chunking (M2), embedding models (M3), vector databases and retrieval (M4), prompt engineering for generation (M5), and evaluation and production hardening (M6). By the end, you will have built a complete, evaluatable RAG system from scratch — not a demo, but an architecture you can reason about, audit, and improve.
Use the AI tutor to reason through concrete RAG design decisions. The goal is to develop intuition for how each stage of the pipeline affects overall system behavior — before you encounter these decisions in code in later modules.