Module 8 · Lesson 1

Scaling the Pipeline: Infrastructure Choices

From a working prototype to a system that handles millions of queries without crumbling

What breaks first when a RAG prototype meets real production traffic?

In November 2023, Klarna publicly described its AI assistant — built on a retrieval-augmented stack — handling over 2.3 million conversations in its first month. The prototype that preceded it had been tested at roughly 500 queries per day. The gap between those two numbers is where production RAG architecture lives.

Why Prototype RAG Fails at Scale

Most RAG prototypes share a single-process architecture: one embedding model, one vector store connection, one LLM call per request — all synchronous, all in sequence. At low volume this is fine. Under load, three failure modes emerge simultaneously.

Embedding bottleneck. Embedding a user query requires a model inference call. At 10 queries per second, a 40ms embedding latency consumes 400ms of CPU time per second — already near saturation on a single core. At 100 QPS the queue grows unbounded.

Vector store connection exhaustion. Pinecone, Weaviate, and Qdrant all enforce per-account connection limits. A naive implementation opens a new connection per request; under load, connection setup latency (10–50ms) compounds with query latency and the connection pool exhausts.

LLM rate limits. OpenAI's GPT-4 Turbo has a default tokens-per-minute (TPM) limit of 450,000 for new tier accounts. A RAG response averaging 1,200 tokens means roughly 375 requests per minute before rate-limit errors begin. No queue means dropped requests.

Real Incident

When Notion AI launched in November 2022, users reported multi-second latencies and partial outages. Notion's engineering team later attributed the instability partly to synchronous embedding calls blocking response threads — a classic single-process bottleneck at scale.

The Production RAG Stack

A production-grade RAG system separates concerns into at least four independently scalable tiers: an ingestion pipeline, a retrieval service, a generation service, and an orchestration layer. Each can scale horizontally without affecting the others.

Production RAG Architecture — Request Flow

Client Request

→

API Gateway / Auth

→

Orchestration Layer

↓

Query Embedding Service

→

Vector Store

→

Re-ranker

→

Context Builder

↓

LLM Generation Service

→

Response Cache

→

Client

Async Queues and Worker Pools

The core shift in production RAG is moving from synchronous to asynchronous processing. A message queue — Redis Streams, Amazon SQS, or Apache Kafka — decouples request acceptance from request processing. The API gateway acknowledges the request immediately; a pool of workers picks it up for processing.

For the embedding service specifically, batching transforms economics dramatically. Sending 32 queries to OpenAI's text-embedding-3-small in a single API call costs the same as sending 1 query but produces 32 embeddings. With a queue, a worker can collect 32 pending queries before issuing the batch call — achieving near-linear throughput improvement at no additional cost.

Worker pool sizing follows Little's Law: L = λW, where L is the average number of items in the system, λ is throughput, and W is average processing time. If embedding takes 80ms and you target 500 QPS, you need at least 40 concurrent embedding workers just for that tier.

Caching Layers

Production RAG stacks typically implement two distinct caches. The semantic cache stores (query-embedding → retrieved-chunks) mappings. If a new query's embedding is within cosine distance 0.05 of a cached query, the retrieval step is skipped entirely. This is the approach used by GPTCache, open-sourced by Zilliz in 2023.

The response cache stores (query-hash → final-LLM-response) mappings for exact or near-exact repeated queries. In enterprise deployments where many users ask the same FAQ-style questions, response cache hit rates of 30–60% are common, eliminating LLM calls entirely for those requests.

Key Infrastructure Principle

Each tier in production RAG should have its own health check, circuit breaker, and independent scaling policy. A spike in LLM latency should never cause the embedding service to queue-starve or the vector store to timeout. Isolation is the foundation of resilience.

Circuit BreakerA pattern that detects when a downstream service is failing and stops sending requests to it for a cooldown period, preventing cascading failures across the entire pipeline.

Little's LawL = λW: a queueing theory formula relating average system occupancy (L), throughput rate (λ), and average processing time (W) — used to size worker pools.

Semantic CacheA cache that indexes query embeddings and returns pre-computed retrieval results for sufficiently similar future queries, bypassing the vector store lookup.

Lesson 1 Quiz

Scaling the Pipeline: Infrastructure Choices

1. Which failure mode typically hits first when a single-process RAG prototype is placed under high query volume?

Correct. Synchronous embedding calls block response threads; at sustained load the embedding bottleneck saturates before most other components.

Not quite. The embedding bottleneck — blocking synchronous calls per request — is almost always the first saturation point in a naive single-process architecture.

2. In the context of production RAG, what does a semantic cache store?

Correct. A semantic cache indexes query embeddings; sufficiently similar future queries skip vector store retrieval entirely by returning the cached chunk set.

That describes a response cache (exact query hash → LLM output). A semantic cache operates at the embedding level, caching retrieval results for similar queries.

3. According to Little's Law (L = λW), if your embedding service processes each query in 80ms and you need to sustain 500 queries per second, approximately how many concurrent embedding workers do you need?

Correct. L = λW = 500 × 0.08 = 40. You need 40 concurrent workers to keep the queue from growing under that load.

Apply Little's Law: L = λW = 500 requests/sec × 0.08 sec/request = 40 concurrent workers needed at minimum.

4. What is the primary purpose of a circuit breaker in a production RAG pipeline?

Correct. A circuit breaker detects downstream failures and halts request forwarding during a cooldown, preventing one slow service from causing system-wide failure.

A circuit breaker is a resilience pattern — it detects failing services and temporarily stops calling them, allowing the rest of the pipeline to continue operating.

Lab 1: Infrastructure Architecture Design

Explore production RAG infrastructure decisions with your AI lab assistant

Scenario: Scaling a RAG System to Production

You are an engineer at a mid-size SaaS company. Your RAG-powered support assistant prototype works well at 50 queries per day. The product team has committed to a public launch expected to drive 5,000 queries per hour at peak. You need to redesign the architecture.

Discuss infrastructure choices, bottleneck analysis, caching strategies, and worker pool sizing with the lab assistant. Complete at least 3 exchanges to finish the lab.

Start by describing your current single-process architecture and asking what you should change first.

Production RAG Architecture Lab

L1 · Infrastructure

Welcome to Lab 1. I'm your production RAG infrastructure advisor. You're moving from 50 queries/day to 5,000 queries/hour — that's roughly a 2,400× traffic increase. Tell me about your current architecture: how do you handle embedding, retrieval, and generation right now? We'll figure out what needs to change first.

Module 8 · Lesson 2

Observability: Knowing What Your System Is Actually Doing

You cannot improve what you cannot measure — and in RAG, measurement is surprisingly hard

How do you detect that your RAG system is retrieving the wrong documents before your users do?

In March 2023, Air Canada's RAG-based support chatbot told a passenger that a bereavement discount could be claimed retroactively — a policy that did not exist. The system had retrieved an outdated policy document and the LLM generated a confident, incorrect response. The case went to the British Columbia Civil Resolution Tribunal, which ruled against Air Canada. The airline was ordered to pay damages.

The failure was not in the LLM's reasoning. It was in the absence of any monitoring that would have flagged which documents were being retrieved for policy queries — and whether those documents were current.

The Three Observability Layers

Production RAG observability operates at three distinct levels that must be monitored independently. Infrastructure metrics tell you whether the system is running. Pipeline metrics tell you how well individual components are performing. Quality metrics tell you whether the answers are actually good.

Most teams instrument infrastructure metrics — latency, error rates, CPU usage — well from day one. Pipeline and quality metrics are where production RAG systems consistently fail silently.

Infrastructure

P99

End-to-end latency at 99th percentile; the experience of your slowest 1% of users

Pipeline

MRR

Mean Reciprocal Rank of retrieved chunks; how often the most relevant chunk appears first

Quality

RAGAS

Automated faithfulness + answer relevance scoring against reference answers

Business

CSAT

Customer satisfaction scores correlated with RAG quality metrics to validate proxies

Tracing the Full RAG Request

Every production RAG request should emit a structured trace containing: the original query, the query embedding (or a hash of it), the top-k retrieved chunk IDs and their similarity scores, the assembled context window, the LLM prompt, and the final response. This trace is the foundation of all downstream debugging and quality analysis.

Frameworks built for this purpose include LangSmith (LangChain's observability platform, released publicly in 2023), Arize Phoenix, and Weights & Biases Weave. Each captures span-level traces across the full pipeline. Without traces, diagnosing a quality regression reduces to guesswork.

The critical trace field that most teams initially omit is chunk provenance — specifically which document, which version of that document, and which ingestion timestamp each retrieved chunk came from. The Air Canada failure was precisely a provenance failure: a stale document version was in the index with no mechanism to detect or flag it.

Document Versioning in Practice

Elastic's internal RAG deployment (described at their 2023 engineering summit) stores a last_modified timestamp and a source_hash alongside each vector in their Elasticsearch store. A daily job flags chunks older than 90 days for human review. Any query that retrieves a flagged chunk automatically appends a confidence warning to the response.

Automated Quality Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework published by Explorazure in 2023 that computes four automated metrics without requiring human-labeled ground truth for every query.

Faithfulness measures whether every factual claim in the generated answer can be traced to the retrieved context — catching hallucinations introduced by the LLM despite having correct source material. Answer Relevancy measures whether the answer actually addresses the question. Context Precision measures whether retrieved chunks are signal or noise. Context Recall measures whether all information needed to answer the question was actually retrieved.

Running RAGAS on a random sample of 100–500 queries per day provides a continuous quality signal that can alert engineers before users surface problems. Databricks reported in 2024 that automated RAGAS monitoring caught a 12-percentage-point drop in faithfulness scores caused by an inadvertent index update — 18 hours before any user complaints arrived.

Alerting Thresholds

Effective RAG observability requires setting alert thresholds at each layer. A reasonable baseline for a customer-facing system: P99 latency above 4 seconds triggers a page; MRR below 0.6 triggers a Slack alert; RAGAS faithfulness below 0.75 triggers an engineering review; daily CSAT correlation below 0.5 triggers a model/pipeline audit.

These thresholds are starting points, not universal standards. They should be calibrated against your baseline during the first 30 days of production operation, then tightened as you understand your system's normal variance.

MRRMean Reciprocal Rank — the average of 1/rank where rank is the position of the first relevant retrieved chunk. MRR of 1.0 means the best chunk is always first; 0.5 means it's typically second.

RAGASRetrieval Augmented Generation Assessment — an automated evaluation framework computing faithfulness, answer relevancy, context precision, and context recall without human labels.

Chunk ProvenanceMetadata tracking exactly which source document, document version, and ingestion timestamp produced each vector chunk — essential for debugging stale-data failures.

Lesson 2 Quiz

Observability: Knowing What Your System Is Actually Doing

1. The Air Canada chatbot case (2023) illustrates which specific observability failure in a RAG system?

Correct. The chatbot retrieved an outdated policy document. No monitoring tracked which document versions were being retrieved, so the stale data went undetected.

The failure was a provenance problem — stale documents were in the index with no mechanism to detect or flag them. The LLM was working correctly on wrong inputs.

2. RAGAS measures four automated metrics. Which one specifically detects hallucinations introduced by the LLM after correct chunks are retrieved?

Correct. Faithfulness checks whether every factual claim in the generated answer can be traced back to the retrieved context — catching cases where the LLM adds unsupported information.

Faithfulness is the RAGAS metric that verifies every claim in the answer is supported by the retrieved chunks — it specifically catches LLM-introduced hallucinations.

3. What is Mean Reciprocal Rank (MRR) measuring in a RAG retrieval pipeline?

Correct. MRR = average of 1/rank where rank is the position of the first relevant chunk. High MRR means the retriever consistently surfaces the best material first.

MRR (Mean Reciprocal Rank) measures rank quality: the average of 1/rank for the first relevant result across queries. MRR of 1.0 means the best chunk is always position 1.

4. According to the Databricks 2024 example, what was the practical benefit of running automated RAGAS monitoring on daily query samples?

Correct. Automated daily RAGAS monitoring flagged an 18-hour early warning of a faithfulness drop caused by an inadvertent index update — before users reported problems.

The Databricks example demonstrated early detection: RAGAS monitoring caught a 12-point faithfulness drop 18 hours before user complaints — exactly the proactive signal observability should provide.

Lab 2: RAG Observability Design

Design a monitoring and alerting strategy for a production RAG deployment

Scenario: Building a Monitoring Stack

Your RAG-powered internal knowledge assistant just went live for 800 employees. Within two weeks, the HR team reports that answers about benefits policies seem outdated. You have no current monitoring in place — just basic server uptime checks.

Work with the lab assistant to design a comprehensive observability strategy covering infrastructure, pipeline quality, and automated alerting. Complete at least 3 exchanges to finish the lab.

Ask the assistant how you would have caught the benefits policy problem before the HR team noticed it.

RAG Observability Lab

L2 · Monitoring

Welcome to Lab 2. You've got a classic silent-failure scenario — stale documents being retrieved without any signal reaching the engineering team. Let's build the monitoring stack that would have caught this. What's your first instinct about where the failure point was, and what kind of visibility would have helped?

Module 8 · Lesson 3

Security, Access Control, and Data Isolation

When your RAG system can retrieve anything, it can leak anything — unless you build controls from the start

How do you ensure a RAG system never returns documents a user isn't authorized to see?

In February 2024, Samsung's internal ChatGPT deployment — used by engineers to query internal documentation — became the subject of a data leak investigation. Engineers had pasted confidential source code into prompts and internal meeting notes into the system, which then stored that content in external model training pipelines. Samsung subsequently banned the use of generative AI tools on company networks.

A self-hosted RAG system with proper data isolation would have prevented the external exposure. But self-hosted RAG introduces its own security surface: who can retrieve what. Without row-level security on the vector store, every user can potentially retrieve every document.

The Authorization Gap in Naive RAG

Standard vector similarity search is inherently authorization-blind. When you query a vector database for the top-k most similar chunks to a given embedding, the database returns whichever chunks score highest — regardless of who originally had access to the source document. In an enterprise context where the index contains documents from Legal, HR, Finance, and Engineering, this means every user potentially reaches every document.

This is not a theoretical risk. When Salesforce deployed an internal RAG assistant in 2022, engineers discovered in internal testing that queries about compensation packages were retrieving confidential HR salary band documents that had been inadvertently ingested into the shared index — accessible to all employees.

Metadata Filtering for Access Control

All major vector databases — Pinecone, Weaviate, Qdrant, Chroma, and pgvector — support metadata filters applied at query time. Every chunk, at ingestion, receives metadata tags including owner_department, clearance_level, allowed_user_ids, or allowed_groups depending on the access model.

At query time, the retrieval service injects a metadata filter based on the authenticated user's attributes. The vector similarity search only scores chunks whose metadata satisfies the filter. This approach is called pre-retrieval filtering and is the most reliable pattern because unauthorized documents are never scored — they are excluded before the similarity computation even runs.

The alternative — post-retrieval filtering, where results are retrieved and then filtered by authorization — is significantly weaker because it reduces effective k (you may retrieve k=10 but return only k=3 after filtering) and introduces information leakage risk if filter logic has bugs.

Pre-Retrieval vs Post-Retrieval Authorization

Pre-Retrieval (Secure)

User Auth → Inject Filter → Vector Search (filtered) → Results

Post-Retrieval (Risky)

Vector Search (unfiltered) → Auth Check → Drop unauthorized → Results

Namespace Isolation

For hard multi-tenant isolation — where different organizations or business units must never share any vector space — namespace isolation is the appropriate pattern. Pinecone's namespaces and Qdrant's collection partitioning allow completely separate vector spaces within the same physical infrastructure, with no cross-namespace query possible at the database level.

Namespace isolation eliminates the metadata filter injection requirement — there is simply no mechanism to query across namespaces — at the cost of higher operational complexity (each tenant's namespace must be independently managed for ingestion, indexing, and deletion).

A practical rule: use metadata filtering for role-based access within a single organization; use namespace isolation for true multi-tenant deployments serving different legal entities.

Prompt Injection and Adversarial Retrieval

Production RAG systems face a novel attack surface: adversarial content embedded in documents that, when retrieved, manipulates the LLM's behavior. This is called indirect prompt injection. In 2023, researchers at ETH Zurich demonstrated that injecting hidden instructions into documents indexed by a RAG system could cause the LLM to leak other retrieved documents to the attacker through the generated response.

Defenses include: context sanitization (stripping instruction-like patterns from retrieved text before insertion into the prompt), privilege separation (the LLM layer should never have write access to the vector store or any persistent storage), and response auditing (flagging responses that contain structural patterns resembling data exfiltration — e.g., JSON blobs or base64 strings in answers to natural language questions).

OWASP LLM Top 10 — 2023

The OWASP Top 10 for LLM Applications, published in 2023, lists "Insecure Output Handling" and "Sensitive Information Disclosure" among the top vulnerabilities. Both are directly applicable to RAG systems: outputs that include retrieved confidential material and systems that retrieve more data than the user is authorized to see.

Pre-Retrieval FilteringAuthorization applied as a metadata constraint before vector similarity scoring runs — unauthorized documents are excluded entirely from the search space.

Namespace IsolationPhysically separate vector spaces within a database, making cross-tenant queries architecturally impossible rather than policy-enforced.

Indirect Prompt InjectionAn attack where malicious instructions embedded in indexed documents are retrieved and cause the LLM to behave in ways not intended by the system designer.

Lesson 3 Quiz

Security, Access Control, and Data Isolation

1. Why is post-retrieval authorization filtering considered weaker than pre-retrieval filtering in a RAG system?

Correct. In post-retrieval filtering, unauthorized documents are retrieved and scored before authorization is checked. Bugs in the filter logic can expose them, and effective k drops unpredictably.

The risk is that unauthorized documents enter the pipeline before being checked. Any bug in filter logic — or a race condition — can result in leakage. Pre-retrieval filtering prevents unauthorized docs from being scored at all.

2. What is indirect prompt injection in the context of RAG security?

Correct. Indirect prompt injection embeds instructions in source documents. When those documents are retrieved and inserted into the LLM context, the instructions execute — potentially leaking other retrieved content.

Indirect prompt injection is subtler than direct user-input attacks. The malicious content lives in documents in the knowledge base; when retrieved, it's inserted into the LLM prompt as "trusted" context and can hijack behavior.

3. When should you use namespace isolation rather than metadata filtering for access control in a RAG deployment?

Correct. Namespace isolation is the right pattern for true multi-tenant deployments with different legal entities — it makes cross-tenant queries architecturally impossible rather than merely policy-prohibited.

Namespace isolation is for hard multi-tenancy between legal entities. For role-based access within a single organization, metadata filtering is simpler and sufficient. The key criterion is whether cross-query must be architecturally impossible.

4. Which of the following is a recommended defense against indirect prompt injection in production RAG?

Correct. Context sanitization removes instruction-pattern text from retrieved chunks before they enter the LLM prompt, reducing the attack surface for indirect injection.

Context sanitization is the key defense: scrubbing retrieved text for instruction-like patterns before inserting it into the LLM prompt prevents injected commands from executing. Encryption and MFA don't address the injection vector.

Lab 3: RAG Security Architecture

Design access controls and injection defenses for a multi-department RAG deployment

Scenario: Enterprise Knowledge Base Security Audit

You are the security engineer for a 2,000-person company deploying a unified RAG assistant over documents from Legal, HR, Finance, Engineering, and Sales. All documents are currently in a single unfiltered index. A routine audit has flagged that any employee can potentially retrieve any document.

Work through the security architecture with the lab assistant: access control model, namespace vs. metadata filtering decisions, prompt injection defenses, and audit logging. Complete at least 3 exchanges to finish the lab.

Start by describing the access control problem and asking what your first design decision should be.

RAG Security Lab

L3 · Access Control

Welcome to Lab 3. You've got a classic enterprise RAG security problem: a shared index with no authorization layer. Before we design solutions, let's scope the threat model. What are the highest-risk data categories in your index, and do you have existing identity infrastructure (SSO, LDAP, Active Directory) we can build on for the access control model?

Module 8 · Lesson 4

Continuous Improvement: Feedback Loops and Index Maintenance

A RAG system that isn't actively improved degrades — not because it changes, but because the world does

How do you build the operational discipline to keep a RAG system accurate six months after launch?

In January 2024, Morgan Stanley disclosed that its OpenAI-powered financial advisor assistant — which had been trained on over 100,000 proprietary research documents — required a dedicated team of four ML engineers to manage index freshness, handle document retirement, and run continuous evaluation. The system that launched in 2023 shared little more than architecture with the system running twelve months later.

What had changed was not the model. It was the corpus, the evaluation benchmarks, and the retrieval strategy — all evolved through a structured feedback loop that treated production traffic as a continuous source of training signal.

User Feedback as Ground Truth

The most valuable signal for improving a production RAG system is structured user feedback. A minimal implementation captures binary thumbs-up/thumbs-down on each response. More valuable is a correction capture — when a user indicates an answer is wrong and provides the correct information, that pair (query, correct answer) becomes a labeled evaluation example.

Over 90 days of production operation, even a system handling 1,000 queries per day with a 5% explicit feedback rate generates 4,500 labeled examples. This corpus enables offline evaluation: replay historical queries through updated pipeline configurations and measure quality improvement before deploying changes.

Bing's AI team described this approach at the 2023 ACL workshop on industrial NLP: they maintained a "golden set" of queries with human-verified correct answers, growing it continuously from user feedback, and used it to gate any change to the retrieval pipeline before deployment.

Index Drift and Document Lifecycle Management

Document indices drift from reality along three axes: deletion drift (source documents are removed or superseded but their chunks remain indexed), update drift (documents are revised but old chunks persist alongside new ones), and growth drift (the index grows so large that relevant documents are diluted by noise, lowering precision).

Managing deletion drift requires a delete-propagation pipeline: when a document is removed from the source system, all chunks derived from it must be deleted from the vector index. This requires maintaining a document_id → [chunk_ids] mapping at ingestion time — a step that is frequently omitted in prototype RAG systems and then painful to retrofit.

Update drift is handled by treating document updates as delete-then-reinsert operations. When a document is modified, all its old chunks are deleted and replaced with chunks from the new version. This maintains chunk-level consistency without requiring complex differential update logic.

Confluence + Notion Integrations

Glean, the enterprise search startup, built their production RAG ingestion pipeline to listen to webhook events from Confluence and Notion. Any page update or deletion triggers a real-time reindexing job. Their 2023 engineering blog post reported that without webhook-driven reindexing, their index had 23% stale content within two weeks of initial ingestion at a 500-employee company.

Retrieval Strategy Evolution

Production RAG systems typically evolve through three retrieval strategy phases. Dense-only retrieval (pure vector similarity) is where most systems start. Hybrid retrieval (dense + BM25 sparse) is added when users report that exact-match queries — product names, error codes, specific dates — are failing because dense embeddings don't preserve lexical identity well. Reranking is added when retrieval precision is acceptable but ordering is poor — the right documents are in the top-20 but not consistently in the top-3.

The decision to add each layer should be data-driven: a drop in MRR below 0.65 for a class of queries is the signal to investigate hybrid retrieval; a gap between context recall and faithfulness (chunks are retrieved but answers are unfaithful) suggests reranker addition. These thresholds are not guesses — they are derived from the labeled evaluation set built through user feedback.

A/B Testing Retrieval Changes

Any change to a production RAG pipeline — new embedding model, updated chunking strategy, additional reranker — should be validated through A/B testing before full deployment. The traffic split sends a percentage of queries through the new configuration; automated RAGAS scoring and user feedback rates are compared between control and treatment.

LangSmith's production deployment dashboard, introduced in late 2023, supports this pattern natively: you define two pipeline configurations, route traffic fractions to each, and view side-by-side quality metric comparisons. Cloudflare's AI team described using a similar pattern when migrating their internal documentation RAG system from Ada-002 to text-embedding-3-large in early 2024 — the A/B test ran for 72 hours over 18,000 queries before they committed to the migration.

Operational Cadence

Production RAG maintenance should follow a structured weekly cadence: daily automated RAGAS scores reviewed for threshold violations; weekly review of low-rated user feedback examples (bottom 50 by score); monthly retrieval strategy review against the golden evaluation set; quarterly full corpus freshness audit and embedding model evaluation against newer alternatives.

This cadence is not optional overhead — it is the mechanism by which a RAG system remains useful rather than drifting into confident incorrectness, which is the failure mode that erodes user trust faster than any single outage.

The Compounding Return

Every user feedback example you capture, every evaluation run you conduct, every index freshness audit you complete compounds. Teams that establish these practices in the first 90 days of production operation consistently outperform teams that retrofit them at month six — not because the technology is different, but because they have accumulated ground truth that cannot be reconstructed retrospectively.

Index DriftThe progressive divergence between the vector index and the current state of source documents — through deletions, updates, and volume growth — that degrades retrieval quality over time.

Delete PropagationThe pipeline that removes all vector chunks derived from a source document when that document is deleted or superseded — requires a document_id → [chunk_ids] mapping maintained at ingestion.

Golden Evaluation SetA curated, human-verified set of (query, correct answer) pairs grown from production feedback and used to gate pipeline changes through offline evaluation before deployment.

Lesson 4 Quiz

Continuous Improvement: Feedback Loops and Index Maintenance

1. What was the key operational insight from Morgan Stanley's RAG assistant deployment described in early 2024?

Correct. Morgan Stanley maintained four dedicated ML engineers for index freshness, document retirement, and continuous evaluation — the system evolved continuously through structured operational discipline.

The Morgan Stanley case illustrated that production RAG is not "deploy and forget." What changed over 12 months was the corpus, evaluation benchmarks, and retrieval strategy — all requiring dedicated operational investment.

2. Why does a prototype RAG system frequently omit the document_id → [chunk_ids] mapping, and why does this matter in production?

Correct. Prototypes work with a static corpus, so deletion propagation seems unnecessary. In production, when documents are removed or updated, you need this mapping to find and delete the associated chunks — without it, stale content accumulates.

The mapping is not needed when the corpus never changes. In production, documents are deleted and updated constantly. Without knowing which chunks came from which document, you cannot clean up old content — leading to deletion drift.

3. According to the Glean engineering example, what was the stale content rate in a 500-person company's index within two weeks without webhook-driven reindexing?

Correct. Glean's engineering blog reported 23% stale content within two weeks of initial ingestion without continuous webhook-triggered reindexing — a surprisingly fast drift rate in an active company.

Glean reported 23% stale content within two weeks. At an active company, Confluence pages and Notion documents are updated constantly — without real-time reindexing triggers, index freshness degrades rapidly.

4. What specific quality signal should trigger an investigation into adding hybrid retrieval (dense + BM25) to a production RAG system?

Correct. Dense embeddings don't preserve lexical identity well — exact-match terms like error codes or proper nouns are where BM25 sparse retrieval adds the most value. MRR below 0.65 on those query classes is the signal to act.

The signal for hybrid retrieval is poor performance on lexically specific queries (error codes, product names, specific dates) — where dense embeddings fail to preserve exact terms. MRR below 0.65 on that query class is the threshold to investigate.

Lab 4: RAG Continuous Improvement Strategy

Design a feedback loop and operational cadence for a live RAG deployment

Scenario: Six-Month Post-Launch Degradation

You launched a RAG-powered product documentation assistant six months ago. At launch it answered 78% of queries correctly (by RAGAS faithfulness). Today that number is 61%. Support tickets have increased 40%. Leadership wants a plan to recover quality and prevent future degradation.

Work with the lab assistant to diagnose the degradation, design a feedback capture system, establish an index maintenance strategy, and build an operational cadence. Complete at least 3 exchanges to finish the lab.

Start by asking the assistant how to diagnose whether the degradation is from index drift, retrieval strategy drift, or LLM quality changes.

RAG Continuous Improvement Lab

L4 · Feedback Loops

Welcome to Lab 4. A 17-point drop in faithfulness over six months is serious but diagnosable. Before we prescribe solutions, we need to localize the failure. There are three candidate causes: your index has drifted (stale or deleted documents), your retrieval quality has degraded (MRR drop), or your LLM handling has changed (prompt drift, model updates). Do you have any historical RAGAS scores, MRR trends, or retrieval logs we can look at to narrow this down?

Module 8 Test

Production RAG Architecture — 15 questions · 80% to pass

1. Which of the following best describes the primary bottleneck in a single-process RAG architecture under high load?

Correct. Synchronous embedding per request is the fastest-saturating component; it blocks threads and causes unbounded queue growth under load.

The embedding bottleneck — blocking synchronous inference calls — is the first failure mode in single-process RAG under load.

2. Little's Law states L = λW. If you need 200 QPS throughput and each request takes 60ms to process, how many concurrent workers are needed?

Correct. L = 200 × 0.06 = 12 concurrent workers needed.

L = λW = 200 × 0.06 = 12. You need 12 concurrent workers to maintain that throughput at that processing time.

3. What distinguishes a semantic cache from a standard response cache in a RAG system?

Correct. The semantic cache operates at the retrieval layer (embedding similarity → cached chunks); the response cache operates at the output layer (query hash → cached answer).

They operate at different pipeline stages: semantic cache at retrieval (similarity-based), response cache at generation output (exact-match hash).

4. The Notion AI launch instability in November 2022 was partly attributed to which architectural issue?

Correct. Notion's engineering team attributed instability to synchronous embedding calls blocking response threads — the classic single-process bottleneck.

Notion's public post-mortem pointed to synchronous embedding calls as a key contributor to the multi-second latencies at launch.

5. Which RAGAS metric would you monitor to detect cases where the LLM adds information not present in the retrieved context?

Correct. Faithfulness verifies that every claim in the generated answer is grounded in the retrieved context — directly detecting LLM-introduced hallucinations.

Faithfulness is the RAGAS metric for detecting unsupported claims — it checks whether each assertion in the answer can be traced to retrieved chunks.

6. What critical trace field do most teams initially omit from RAG observability instrumentation?

Correct. Chunk provenance — which document version each chunk came from and when it was ingested — is the field most often omitted and most critical for diagnosing stale-data failures.

Chunk provenance is the critical missing field. Without knowing which document version was retrieved and when it was indexed, stale-data failures like the Air Canada case are impossible to diagnose.

7. Why is pre-retrieval authorization filtering more secure than post-retrieval filtering?

Correct. In pre-retrieval filtering, unauthorized documents are never scored. In post-retrieval, they enter the pipeline and a filter bug can expose them.

The security difference is fundamental: pre-retrieval keeps unauthorized data entirely out of the pipeline; post-retrieval requires the filter to work correctly after unauthorized data has already been processed.

8. When should namespace isolation be preferred over metadata filtering for access control?

Correct. Namespace isolation makes cross-tenant queries impossible at the database level — essential when serving separate legal entities rather than just different roles within one organization.

The criterion is tenant isolation strength: different legal entities require architectural impossibility of cross-queries (namespace isolation); different roles within one org can use metadata filtering.

9. What does indirect prompt injection exploit in a RAG system?

Correct. The LLM treats retrieved context as trusted input. Instructions hidden in documents execute with the same authority as legitimate system prompts when those documents are retrieved.

Indirect injection exploits the LLM's trust in retrieved context — it processes instructions in documents the same way it processes system-level commands.

10. What is the correct way to handle a document update in a production RAG index to avoid update drift?

Correct. Treating updates as delete-then-reinsert operations ensures the index always contains exactly one version of each document's content, without requiring complex differential logic.

Delete-then-reinsert is the cleanest pattern: remove all chunks from the old version, insert all chunks from the new version. This avoids mixed-version retrieval with no complex diff logic needed.

11. What specific metric change should trigger investigation into adding a reranker to a production RAG pipeline?

Correct. If recall is good (right documents are retrieved) but faithfulness is low (answers are poor), the issue is likely that the best chunks are buried in position 8–15 rather than 1–3. A reranker fixes ordering, not retrieval coverage.

The reranker signal is high recall + low faithfulness: the right documents are in the retrieved set but not in the top positions the LLM weighs most heavily. A reranker improves ordering without changing what is retrieved.

12. How did Databricks use automated RAGAS monitoring to demonstrate production value in 2024?

Correct. Databricks reported RAGAS monitoring detected a 12-percentage-point faithfulness drop 18 hours before any user complaints surfaced — the canonical example of proactive quality observability.

The Databricks case showed RAGAS monitoring catching a regression 18 hours before users noticed — demonstrating that automated quality monitoring enables proactive rather than reactive incident response.

13. What is the recommended operational cadence for reviewing low-rated user feedback examples in a production RAG system?

Correct. The recommended cadence is weekly review of low-rated examples, with daily automated RAGAS monitoring, monthly retrieval strategy review, and quarterly full corpus audits.

The full cadence: daily automated RAGAS checks, weekly review of low-rated feedback, monthly retrieval strategy review against the golden set, quarterly corpus freshness audit.

14. The Samsung generative AI data leak incident in 2024 highlighted which risk that a properly architected self-hosted RAG system would have mitigated?

Correct. Engineers pasted confidential code and notes into an externally hosted system. A self-hosted RAG deployment with proper data isolation would have kept that data within Samsung's controlled infrastructure.

The Samsung incident involved confidential data leaving Samsung's control to external model infrastructure. Self-hosted RAG with data isolation keeps all document content and queries within the organization's own systems.

15. Cloudflare validated their migration from Ada-002 to text-embedding-3-large using A/B testing. What were the key parameters of that test?

Correct. Cloudflare ran a 72-hour A/B test over 18,000 queries before committing to the migration — validating quality improvement at real production scale before full deployment.

Cloudflare's approach: 72 hours, 18,000 queries, side-by-side quality metrics. This provided statistical confidence from real production traffic before committing to the embedding model change.