Module 6 · Lesson 1

RAG Architecture & Corpus Ingestion

From raw documents to a queryable knowledge store — understanding the full ingestion pipeline.

What actually happens between uploading a PDF and getting a grounded answer?

In 2023, Google DeepMind's internal documentation team confronted a problem shared by thousands of enterprises: a corpus of tens of thousands of engineering documents that language models hallucinated over freely. The fix was not a bigger model — it was a retrieval layer. Their approach, now published in the Vertex AI documentation, became the template for production RAG on Google Cloud.

What RAG Actually Is

Retrieval-Augmented Generation (RAG) addresses a structural problem with large language models: they encode knowledge at training time, making that knowledge static, unverifiable, and sometimes fabricated. RAG separates the knowledge store from the reasoning engine, letting you update one without retraining the other.

The canonical architecture has two phases: an offline ingestion phase that converts documents into searchable vector representations, and an online query phase that retrieves relevant chunks and passes them as context to the model at inference time.

Offline Phase

Ingestion Pipeline

Load → chunk → embed → store. Documents enter as raw files and exit as dense vectors in a vector database.

Online Phase

Query Pipeline

Embed query → ANN search → retrieve top-k chunks → assemble prompt → generate grounded answer.

Key Property

Verifiability

Every claim in the response can be traced back to a specific source chunk with a document ID and offset.

Key Property

Freshness

The knowledge store can be updated continuously without model retraining — new documents simply flow through the ingestion pipeline.

The Ingestion Pipeline in Detail

Ingestion is more complex than it appears. A naive approach — split every document at 512-token boundaries — produces chunks that fracture sentences, sever tables, and destroy context. Production ingestion requires deliberate choices at every stage.

Document Loading & Parsing

Raw files (PDF, DOCX, HTML, code) are parsed into structured text. This step must preserve semantic boundaries: headings, paragraphs, tables, and code blocks. On Vertex AI, Document AI handles complex layouts including scanned PDFs via OCR, extracting structure via the Document object model.

Chunking Strategy

Text is split into overlapping segments. Fixed-size chunking (e.g., 512 tokens, 50-token overlap) is simple but brittle. Semantic chunking — splitting at paragraph or section boundaries — preserves coherence. Recursive character splitting is a practical middle ground used by LangChain and LlamaIndex integrations on Vertex.

Embedding Generation

Each chunk is passed to an embedding model to produce a dense vector. Vertex AI's text-embedding-004 model (768 dimensions) is the current recommended choice. It supports task-type parameters: RETRIEVAL_DOCUMENT for corpus chunks, RETRIEVAL_QUERY for query-time embeddings.

Vector Store Indexing

Vectors are written to Vertex AI Vector Search (formerly Matching Engine), which uses ScaNN (Scalable Nearest Neighbors) for approximate nearest neighbor search at billion-scale. Alternatively, AlloyDB for PostgreSQL with pgvector provides a SQL-native option for lower-latency transactional workloads.

Metadata Attachment

Each vector is stored with metadata: source file path, document title, section heading, creation date, access control labels. This enables pre-filtering at retrieval time — for example, restricting search to documents modified in the last 90 days or documents from a specific department.

Production Pattern — Vertex AI RAG Engine

As of late 2024, Vertex AI ships a managed RAG Engine API that handles ingestion and retrieval as a service. You call ImportRagFiles with a GCS path, and the API parses, chunks, embeds, and indexes automatically. This is appropriate for prototypes and moderate-scale workloads. For pipelines requiring custom chunking logic, metadata enrichment, or multi-modal ingestion, the custom pipeline approach covered in this module is still necessary.

Chunking Trade-offs

Chunk size is the single most impactful ingestion decision. Small chunks (128–256 tokens) capture precise facts but lose surrounding context, making retrieved chunks hard for the model to interpret in isolation. Large chunks (1024–2048 tokens) preserve context but dilute relevance — the retrieved chunk contains more noise alongside the relevant passage.

The parent-child chunking pattern resolves this: index small child chunks for precise retrieval, but at query time, return the parent chunk (larger context window) to the model. This is natively supported in LlamaIndex and can be replicated in LangChain's ParentDocumentRetriever.

Overlap between consecutive chunks (typically 10–15% of chunk size) prevents information loss at boundaries. A fact split across chunk boundaries appears in at least one chunk in full.

ScaNN Google's Scalable Approximate Nearest Neighbor library, the underlying search algorithm for Vertex AI Vector Search. Achieves sub-millisecond p99 latency at billion-vector scale via learned quantization and tree-based partitioning.

text-embedding-004 Vertex AI's current recommended text embedding model. 768-dimensional output, supports task_type parameter for asymmetric retrieval (separate embedding spaces for documents and queries).

ANN Search Approximate Nearest Neighbor search. Returns vectors within a guaranteed approximation ratio of the true nearest neighbors, trading a small accuracy loss for orders-of-magnitude speed improvement over exact search.

Real Deployment — Lufthansa Group (2024)

Lufthansa Group's technical operations team built a RAG pipeline on Google Cloud to answer mechanic queries against 2.4 million pages of aircraft maintenance manuals. The ingestion pipeline used Document AI for PDF parsing, custom semantic chunking aligned to maintenance procedure sections, and Vertex AI Vector Search for retrieval. Query latency averaged 1.2 seconds end-to-end. The system reduced manual document search time by an estimated 40% per query session, as reported in Google Cloud Next 2024 session content.

Vertex AI RAG Engine vs. Custom Pipeline

The managed RAG Engine trades flexibility for simplicity. It handles the entire ingestion pipeline internally, supports direct GCS and Google Drive sources, and exposes a single RetrieveContexts API for query time. For many enterprise use cases — internal knowledge bases, support document retrieval, policy Q&A — it is the right choice.

Custom pipelines are warranted when you need: custom chunking logic (e.g., splitting by XML tags or code function boundaries), multi-step metadata enrichment (calling an LLM to generate summary embeddings alongside content embeddings), hybrid retrieval combining vector search with keyword BM25, or cross-corpus federation across multiple vector stores.

Lesson 1 Quiz

RAG Architecture & Corpus Ingestion · 4 questions

What problem does RAG fundamentally solve that fine-tuning does not?

Correct. Fine-tuning bakes knowledge into weights at training time — updating it requires another training run. RAG separates the knowledge store from the model, so the corpus can be updated continuously without touching model weights.

Not quite. RAG's core value is decoupling knowledge from model weights, enabling updates without retraining. Inference speed, parameter count, and prompt engineering are separate concerns.

Which Vertex AI embedding model is currently recommended for RAG document ingestion, and what dimension does it output?

Correct. text-embedding-004 produces 768-dimensional vectors and supports the task_type parameter for asymmetric retrieval — using RETRIEVAL_DOCUMENT during ingestion and RETRIEVAL_QUERY at query time.

Not correct. The current recommendation is text-embedding-004 at 768 dimensions. textembedding-gecko was an earlier generation; multimodalembedding handles images and video.

What is the "parent-child chunking" pattern designed to achieve?

Correct. Small child chunks are indexed for precision — the ANN search finds the most relevant passage exactly. But the model receives the larger parent chunk, which supplies the surrounding context needed to interpret the retrieved fact.

Not correct. The pattern is about retrieval precision vs. context richness: use small chunks to find the right passage, then return the larger surrounding context to the model.

In the Vertex AI RAG Engine managed service, which API call initiates document ingestion from a GCS path?

Correct. ImportRagFiles accepts a GCS URI (or Google Drive URL) and handles parsing, chunking, embedding, and indexing internally. The companion query-time call is RetrieveContexts.

Not correct. The managed RAG Engine uses ImportRagFiles for ingestion and RetrieveContexts for querying. The other names are not part of the RAG Engine API surface.

Lab 1: Designing a Corpus Ingestion Pipeline

Interactive AI lab — practice RAG ingestion architecture decisions

Your Scenario

You are a data engineer at a financial services firm. Your team has 85,000 regulatory compliance documents (PDFs, HTML pages, Word files) ranging from 2 to 400 pages each. You need to design a RAG ingestion pipeline on Google Cloud that supports daily updates, metadata filtering by regulation type and jurisdiction, and sub-2-second query latency.

Ask the assistant about chunking strategies, embedding model choices, vector store selection, metadata schema design, and pipeline orchestration for this scenario. Try at least 3 exchanges to complete the lab.

RAG Pipeline Design Assistant

Lab 1

Ready to help you design your compliance document ingestion pipeline. You have 85,000 documents of highly variable length — that's a common enterprise RAG challenge. Where would you like to start? I can walk you through chunking strategy, embedding model selection, vector store options on Vertex AI, or metadata schema design for regulatory filtering.

Module 6 · Lesson 2

Embedding Strategy & Vector Search

Choosing the right embedding model, indexing configuration, and retrieval parameters for production.

Why do two RAG systems with the same LLM produce radically different answer quality?

In 2024, Spotify's engineering team published a post-mortem on their internal knowledge base search system. Their first RAG prototype used general-purpose embeddings and returned irrelevant chunks for music-industry-specific terminology — "key" meaning musical key retrieved documents about API keys, "release" meaning album release retrieved documents about software releases. Switching to a domain-adapted embedding model, fine-tuned on Spotify's internal corpus, reduced irrelevant retrieval by 60%. The lesson: embedding model selection is not a default choice.

Understanding Embedding Space

An embedding model maps text to a point in a high-dimensional vector space such that semantically similar texts are geometrically close. The quality of this mapping determines retrieval quality — no amount of downstream prompt engineering compensates for an embedding model that cannot distinguish relevant from irrelevant content in your specific domain.

Two dimensions matter: semantic faithfulness (does the model understand the meaning of your domain's vocabulary?) and retrieval asymmetry (can the model bridge the gap between how users phrase questions and how documents express answers?). General-purpose models handle common knowledge well but degrade on technical, legal, medical, or domain-specific corpora.

Vertex AI Embedding Model Options

Vertex AI exposes several embedding models with different trade-offs:

General Purpose

text-embedding-004

768 dims. Best default for mixed-domain corpora. Supports task_type for asymmetric retrieval. MTEB benchmark rank: strong. Maximum input: 2,048 tokens.

Long Document

text-multilingual-embedding-002

768 dims. 100+ languages. Use when your corpus includes multilingual content or when document language is heterogeneous across the corpus.

Code

code-embedding models

Dedicated models for code retrieval. Understands function signatures, variable names, and programming-language-specific semantics that general text models miss.

Custom

Fine-tuned via Vertex

Vertex AI allows embedding model fine-tuning on domain-specific data using supervised contrastive learning. Requires labeled (query, relevant chunk) pairs.

Asymmetric Retrieval — task_type Parameter

A key insight in modern retrieval: the optimal embedding for a document chunk and the optimal embedding for a query about that chunk are different. Documents are dense, formal, and contain answers. Queries are short, informal, and contain questions. Using the same embedding function for both produces a systematic mismatch.

Vertex AI's text-embedding-004 addresses this via the task_type parameter. During ingestion, set task_type to RETRIEVAL_DOCUMENT. At query time, set it to RETRIEVAL_QUERY. The model applies different projection heads for each, effectively aligning query space to document space. This consistently improves recall@10 by 5–15% over symmetric retrieval on BEIR benchmarks.

Measured Impact

Google's internal evaluation on the Natural Questions benchmark shows that switching from symmetric to asymmetric retrieval (same model, different task_type) improves top-1 retrieval accuracy by approximately 8 percentage points. For a corpus of 100,000 chunks where 1 in 100 queries previously retrieved the wrong chunk first, asymmetric retrieval recovers ~8,000 of those queries with no other change.

Vertex AI Vector Search Architecture

Vertex AI Vector Search (formerly Matching Engine) is a managed approximate nearest neighbor service built on Google's ScaNN algorithm. It supports three deployment modes relevant to RAG pipelines:

Index Creation

Create an Index resource specifying dimensions (must match your embedding model), distance measure (DOT_PRODUCT_DISTANCE for normalized embeddings, COSINE_DISTANCE for unnormalized), and approximate neighbor count. The index is built asynchronously from a JSONL file in GCS.

Index Endpoint Deployment

Deploy the Index to an IndexEndpoint on dedicated machine types (e2-standard-2 to n2-highcpu-32). The endpoint exposes a gRPC or REST FindNeighbors API. Dedicated endpoints provide deterministic latency; public endpoints are shared and cheaper for development.

Upsert & Streaming Updates

Vectors can be streamed into a deployed index via UpsertDatapoints — no full rebuild required. This is critical for RAG pipelines with frequent document updates. Deletion is supported via RemoveDatapoints. Eventual consistency: updates appear in search within seconds.

Filtering with Restricts

Each vector can carry token restricts (string labels for categorical filters) and numeric restricts (for range queries). At query time, pass filter conditions alongside the query vector. Pre-filtering reduces the ANN search space before distance computation — critical for large corpora with access controls or date-range requirements.

Retrieval Parameters: top_k and Score Thresholds

The top_k parameter controls how many chunks are retrieved per query. The right value is not obvious: too low and you miss relevant chunks; too high and you flood the model context with noise, degrading generation quality. Empirical evaluation on a test set of query-answer pairs is the only reliable method for calibrating this.

In practice, retrieve more (k=20) and apply a reranking step before passing to the model. A cross-encoder reranker (such as Vertex AI's built-in reranking or a custom model) re-scores each retrieved chunk against the query using full attention, selecting the final top 3–5 for the prompt. Cross-encoders are too slow to run over the entire corpus but fast enough over 20 candidates.

Score thresholds (minimum cosine similarity) filter out retrievals where even the top-k results are semantically distant from the query. If all retrieved chunks fall below threshold, return "I don't know" rather than hallucinate — a critical quality gate for regulated industries.

Cross-encoder A reranking model that takes a (query, chunk) pair as joint input and outputs a relevance score. More accurate than bi-encoder retrieval but too slow to run over full corpora — used as a reranking step over a small candidate set.

Restricts Vertex AI Vector Search metadata filters attached to each datapoint. Token restricts support categorical filtering (e.g., department=legal). Numeric restricts support range queries (e.g., created_date >= 2024-01-01).

DOT_PRODUCT_DISTANCE Distance measure for Vector Search. Equivalent to cosine similarity when vectors are L2-normalized (as Vertex embeddings are by default). Faster to compute than cosine on normalized vectors.

AlloyDB pgvector Alternative

For RAG pipelines that need to combine vector search with structured SQL queries — for example, retrieving compliance documents matching a vector query AND authored by a specific regulatory body AND effective after a specific date — AlloyDB for PostgreSQL with the pgvector extension provides a unified SQL interface. The trade-off: lower ANN throughput than dedicated Vector Search at multi-million scale, but simpler architecture and full ACID compliance for transactional workloads.

Lesson 2 Quiz

Embedding Strategy & Vector Search · 4 questions

What does setting task_type=RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY do in Vertex AI text-embedding-004?

Correct. The model uses different learned projection heads for documents and queries, bridging the gap between formal document language and informal query phrasing. This consistently improves retrieval recall by 5–15% over symmetric retrieval.

Not quite. The dimensions remain 768 in both cases. The task_type parameter applies different projection heads so that query vectors and document vectors are aligned in embedding space despite their stylistic differences.

When would you choose AlloyDB with pgvector over Vertex AI Vector Search for a RAG pipeline?

Correct. AlloyDB with pgvector's key advantage is combining vector ANN search with SQL predicates in a single query — essential for workloads that filter on structured metadata alongside semantic similarity. Vector Search offers higher throughput at massive scale.

Not correct. AlloyDB pgvector shines when you need to combine vector search with structured SQL in a single query. For pure ANN at billion-scale with maximum throughput, Vertex AI Vector Search is the better choice.

What is the purpose of a cross-encoder reranker in a RAG pipeline?

Correct. The bi-encoder retrieval stage (ANN search) is fast but coarser. A cross-encoder reranker takes the top-k candidates (e.g., 20) and re-scores each (query, chunk) pair with full cross-attention, producing a more accurate relevance ranking for the final top-3 to 5 chunks passed to the model.

Not quite. A cross-encoder reranker runs after ANN retrieval on a small candidate set. It processes (query, chunk) pairs jointly with full attention — more accurate than bi-encoder scoring but too slow to run over the entire corpus.

What does the Vertex AI Vector Search "Restricts" feature enable?

Correct. Token restricts (string labels) and numeric restricts (range filters) are attached to each datapoint. At query time, filters are applied before ANN distance computation, reducing the effective search space and enabling use cases like "find the most relevant compliance chunks in the finance department from the last 12 months."

Not correct. Restricts are metadata filters on the vector datapoints themselves — string labels (token restricts) and numeric ranges. They pre-filter the ANN search space at query time, not IAM access controls.

Lab 2: Embedding Model Selection & Vector Search Configuration

Interactive AI lab — practice embedding and retrieval design decisions

Your Scenario

You are evaluating embedding strategies for a RAG system serving medical research queries over 500,000 PubMed abstracts. Users phrase questions in natural language ("what drugs reduce LDL in diabetic patients?") while the abstracts use clinical and pharmacological terminology. Your retrieval evaluation shows the current symmetric general-purpose embeddings have 43% top-1 accuracy on a test set.

Discuss embedding model options, the asymmetric retrieval task_type approach, whether to fine-tune, Vector Search index configuration, and reranking strategy. Try at least 3 exchanges to complete the lab.

Embedding & Retrieval Strategy Assistant

Lab 2

43% top-1 accuracy on medical queries is a common starting point with generic symmetric embeddings — and it's very improvable. The vocabulary mismatch between patient-language questions and clinical-language abstracts is your main obstacle. Let's work through your options. Do you want to start with the quick wins (task_type configuration, retrieval parameter tuning) or the higher-effort paths (domain fine-tuning, reranker integration)?

Module 6 · Lesson 3

Query Pipeline & Grounded Generation

From user question to grounded answer — retrieval orchestration, prompt assembly, and citation handling.

How do you ensure an LLM uses only the retrieved context and doesn't silently fall back to its parametric knowledge?

In March 2024, Air Canada lost a civil case partly because its RAG-based customer service chatbot fabricated a bereavement discount policy that didn't exist. The retrieval system had failed to find a relevant chunk, and the model fell back on parametric knowledge to generate a plausible-sounding but entirely fictitious answer. The court held Air Canada responsible for its chatbot's statements. The case became a canonical reference in enterprise AI risk management for the necessity of retrieval failure detection and confidence-gated responses.

The Query Pipeline

The query pipeline executes in milliseconds but involves several distinct steps, each with failure modes that must be handled explicitly in a production system. Treating the pipeline as a single black-box call is the most common source of quality and safety failures.

Query Analysis & Transformation

The raw user query may need transformation before embedding. Query expansion adds synonyms or related terms. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the question, then embeds that — exploiting the fact that the embedding of a hypothetical answer is closer to real document embeddings than the query itself. Step-back prompting rewrites a specific question as a more general one to improve retrieval of background knowledge.

Vector Retrieval

The transformed query is embedded with task_type=RETRIEVAL_QUERY and passed to Vector Search FindNeighbors. Top-k candidates are returned with their datapoint IDs and distances. Metadata restricts are applied at this stage. The raw chunks are fetched from the backing store (GCS, Firestore, or Cloud SQL) using the datapoint IDs.

Reranking & Score Gating

Retrieved chunks are reranked by a cross-encoder. Chunks below a minimum relevance score threshold are dropped. If zero chunks remain above threshold, the pipeline short-circuits to a "no relevant information found" response — avoiding hallucination on unknown questions. This is the Air Canada failure mode prevention step.

Prompt Assembly

Retrieved chunks are formatted into the context window with explicit structural markers. The system prompt instructs the model to answer only from provided context, cite sources by chunk ID, and explicitly state "I don't have enough information" when the context is insufficient — not fabricate a plausible answer.

Grounded Generation

Gemini 1.5 Pro or Flash generates the answer. Vertex AI's Grounding with Google Search (or with your corpus) can be enabled directly in the Vertex AI API, which handles retrieval and generation in a single call and attaches grounding metadata to the response automatically.

Citation Extraction & Verification

Post-generation, citations referenced by the model are verified against the actual retrieved chunks. This post-hoc attribution check catches cases where the model references a source ID it hallucinated rather than one that was provided. Only verified citations are returned to the user.

Prompt Design for Grounded Generation

The system prompt in a RAG pipeline carries unusual weight. It must override the model's default behavior — which is to be helpful by filling gaps from parametric knowledge — and enforce strict grounding. The following system prompt pattern has been validated in production deployments:

Validated System Prompt Pattern

Answer exclusively from the provided CONTEXT sections below. Do not use any knowledge not present in the context. If the context does not contain sufficient information to answer the question, respond with exactly: "I don't have enough information in the provided documents to answer this question." Do not speculate, extrapolate, or fill gaps. For each factual claim in your answer, append [SOURCE: chunk_id] referencing the context section that contains that claim.

HyDE: Hypothetical Document Embeddings

HyDE is a retrieval technique published by Luyu Gao et al. (CMU, 2022) that addresses the query-document vocabulary mismatch more aggressively than asymmetric task_type alone. Instead of embedding the user's question directly, you first prompt an LLM to generate a hypothetical document that would answer the question, then embed that hypothetical document for retrieval.

The intuition: a hypothetical answer to "what is the dosage of metformin for type 2 diabetes?" will use the same clinical vocabulary as actual dosage documentation, even if the user's query was phrased informally. Empirically, HyDE improves recall on technical corpora by 15–25% over direct query embedding, at the cost of one additional LLM call per query (typically using a fast, cheap model like Gemini Flash for the hypothetical generation).

Multi-turn RAG: Conversation History

Production RAG systems must handle conversational context. A follow-up question like "what about the dosage for children?" has no standalone meaning without the prior turn. Two approaches:

Conversation condensation: Before retrieval, an LLM rewrites the current question incorporating context from previous turns into a standalone query. Simple and effective.

Multi-query retrieval: Retrieve separately for the current question and for a condensed summary of the conversation, then merge the result sets before reranking. More thorough but doubles retrieval latency.

Vertex AI's Conversation API can be combined with RAG by maintaining the conversation history in application code and condensing it before each retrieval call.

Real Deployment — Box (2024)

Box integrated Vertex AI RAG into Box AI, enabling users to query their enterprise documents conversationally. Their engineering team published implementation details at Google Cloud Next 2024, noting that conversation condensation (rewriting multi-turn follow-up questions as standalone queries before retrieval) was the single highest-impact change in their evaluation: it reduced retrieval misses on conversational queries from 31% to 9%.

Vertex AI Grounding API

For teams that want managed grounding without building a retrieval pipeline, Vertex AI's Grounding feature handles retrieval and citation natively. When you call the Gemini API with grounding enabled, Vertex AI automatically retrieves from Google Search (for public knowledge) or from a configured RAG corpus (for private data), injects the retrieved context, generates a grounded response, and returns GroundingMetadata including source URLs, grounding chunks, and a grounding support score per claim.

The grounding support score (0–1) per claim is particularly valuable: it allows post-processing to filter low-confidence claims before presenting them to users, implementing a confidence-gated response strategy without custom pipeline code.

HyDE Hypothetical Document Embeddings. A retrieval technique that generates a hypothetical answer to the query and embeds that hypothetical document for retrieval, improving recall on technical corpora by bridging query-document vocabulary gaps.

Grounding Support Score Vertex AI Grounding API output. A per-claim confidence score (0–1) indicating how well each claim in the generated response is supported by the retrieved context. Used for confidence-gated response filtering.

Lesson 3 Quiz

Query Pipeline & Grounded Generation · 4 questions

What specific failure in the Air Canada chatbot case made it a canonical reference for RAG safety?

Correct. When retrieval found nothing relevant, the model defaulted to generating a plausible-sounding bereavement discount policy that didn't exist. The court held Air Canada responsible. This is the canonical argument for retrieval failure detection and confidence-gated "I don't know" responses.

Not correct. The failure was retrieval coming up empty and the model filling the gap with a completely fabricated policy. It wasn't outdated data or a mismatch — the policy simply didn't exist and the model invented it.

Why does HyDE improve retrieval quality on technical corpora?

Correct. A hypothetical answer to "what drugs treat condition X?" will naturally use the same clinical/technical terminology as actual documents in the corpus, even if the user's query was informal. This closes the vocabulary gap more effectively than task_type alone.

Not quite. HyDE works by generating a hypothetical answer (using a fast LLM call) and embedding that — because a hypothetical answer naturally uses the same domain vocabulary as real corpus documents, making it closer in embedding space to relevant chunks.

In a multi-turn RAG system, what does "conversation condensation" solve?

Correct. "What about the dosage for children?" cannot be retrieved against in isolation — the retriever doesn't know what "it" refers to. Conversation condensation rewrites this as "what is the pediatric dosage for metformin?" using the prior context, making it a retrievable standalone query.

Not quite. Conversation condensation addresses the referential ambiguity problem: follow-up questions often reference prior turns implicitly ("what about that?") and can't be retrieved from in isolation. The fix is rewriting them as standalone queries.

What does the Vertex AI Grounding API's "grounding support score" enable in a production RAG system?

Correct. The grounding support score is per-claim, not per-response. It lets post-processing logic identify which specific claims are well-supported by retrieved context vs. potentially extrapolated — implementing fine-grained confidence-gating without custom pipeline code.

Not quite. The grounding support score is a per-claim score in the generated response (0–1 per factual claim), used post-generation to filter out low-confidence claims before they reach the user. It operates after generation, not before.

Lab 3: Designing the Query Pipeline

Interactive AI lab — practice query orchestration and grounded generation design

Your Scenario

You are building a RAG system for a legal firm where associates query a corpus of 200,000 case law documents. The system must never fabricate case citations, must support multi-turn conversational queries ("what about the exception in the 1987 case?"), and must provide confidence levels that senior partners can use to decide whether to verify a reference manually.

Design the query pipeline: discuss retrieval failure handling, conversation condensation, HyDE applicability for legal queries, grounding support scores, and the system prompt strategy. Try at least 3 exchanges to complete the lab.

Query Pipeline Design Assistant

Lab 3

Legal RAG is one of the highest-stakes RAG applications precisely because a fabricated case citation looks entirely plausible. The Air Canada case you've likely read about is instructive here — retrieval failure with no confidence gate led to a fabricated policy being stated as fact. For legal, the bar is higher: zero tolerance for hallucinated citations. Let's design your pipeline with that constraint as the primary requirement. Where do you want to start — the retrieval failure handling architecture, or the multi-turn conversation strategy?

Module 6 · Lesson 4

Evaluation, Monitoring & Production Hardening

Measuring RAG quality, detecting drift, and building the operational infrastructure for a live pipeline.

How do you know your RAG pipeline is actually working — and how do you know when it stops?

After Anthropic published its Constitutional AI paper in 2022, several enterprises reported deploying RAG systems that performed well in pre-production evaluation but silently degraded over the first six months in production. The common cause: corpus drift — the document base was updated but the embeddings were stale, causing retrieval to return outdated chunks while the model generated answers that contradicted current policy. Without monitoring, this drift went undetected for weeks before support tickets revealed the discrepancy.

The RAG Evaluation Framework: RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework developed at Exploding Gradients (2023) that defines the key metrics for RAG quality evaluation. It is the de facto standard for RAG evaluation and integrates directly with Vertex AI Evaluation Service.

RAGAS Metric

Context Precision

Of the chunks retrieved, what fraction were actually relevant to the question? Measures retrieval precision — low score means noisy retrieval filling the context with irrelevant text.

RAGAS Metric

Context Recall

Of all the information needed to answer the question, what fraction was present in the retrieved chunks? Measures retrieval coverage — low score means the model lacks key facts.

RAGAS Metric

Faithfulness

Are all claims in the generated answer supported by the retrieved context? Directly measures hallucination rate. The most important safety metric for regulated applications.

RAGAS Metric

Answer Relevance

Does the generated answer actually address the question asked? A response can be faithful to context but still not answer the question (e.g., answering a different but related question).

Vertex AI Evaluation Service for RAG

Vertex AI Evaluation Service (part of Vertex AI Studio) provides a managed environment for running RAGAS and custom metrics against a test set. You provide a dataset of (question, ground-truth-answer, retrieved-context, generated-answer) tuples, and the service computes metrics using an LLM judge (typically Gemini Pro) to assess faithfulness and relevance.

Key integration points: the Evaluation Service connects to your Vertex AI pipeline runs, allowing you to automatically run evaluation after each corpus update or model change. Results are logged to Vertex AI Experiments for tracking metric trends over time — this is how you detect corpus drift systematically rather than waiting for user complaints.

LLM-as-Judge Calibration

RAGAS faithfulness metric uses an LLM to judge whether each claim in the response is supported by the retrieved context. This works well for factual claims but has known failure modes: the judge LLM may be too lenient on claims that are "directionally correct" but factually slightly wrong. For high-stakes applications (legal, medical, financial), augment LLM-as-judge with deterministic citation verification: check whether each cited chunk ID actually contains the quoted text using string matching.

Production Monitoring Architecture

A production RAG pipeline requires monitoring at three levels: retrieval quality, generation quality, and pipeline health.

Retrieval Quality Signals

Log the top-k retrieval scores for every query. Track the distribution of max-similarity scores over time — a downward shift indicates corpus staleness or embedding model mismatch. Alert when the fraction of queries where max similarity falls below threshold (e.g., 0.7) exceeds a 5% threshold. Implement in Cloud Monitoring with custom metrics from your application.

Generation Quality Signals

Run online faithfulness checks on a sampled fraction (1–5%) of production responses using Vertex AI Evaluation. Log grounding support scores from the Grounding API on all responses. Track the rate of "I don't have enough information" responses — a rising rate may indicate corpus gaps or retrieval degradation.

Pipeline Health Signals

Monitor ingestion pipeline latency and error rates via Cloud Monitoring. Alert on corpus update failures — a failed ingestion that isn't detected means the knowledge store silently falls behind. Track Vector Search index freshness: the gap between document creation time and vector upsert time should stay below your SLO (e.g., 15 minutes for near-real-time use cases).

User Feedback Loop

Instrument explicit thumbs up/down on responses. Log the query, retrieved chunks, and response for each negative feedback instance. Route high-confidence negative feedback to human review and use it to extend the RAGAS evaluation test set over time. This creates a flywheel: production failures improve the test set, which improves evaluation coverage.

Corpus Drift Detection & Remediation

Corpus drift occurs in two forms. Content drift: documents are updated but the old embeddings remain indexed, causing the retriever to return outdated chunks. Coverage drift: new documents covering new topics are added but not yet indexed, causing retrieval to fail on queries about those topics.

Detection: compare your evaluation set's context recall metric week-over-week. A drop in context recall typically indicates coverage drift. A drop in faithfulness with stable recall indicates content drift (correct documents retrieved but their content has changed).

Remediation: implement a document fingerprinting system — hash each document's content at ingestion time, and re-run hashes periodically. Documents whose content hash has changed trigger automatic re-embedding and vector upsert. On Vertex AI, this can be implemented as a Cloud Scheduler job that triggers a Dataflow pipeline checking GCS object metadata and running incremental re-indexing.

End-to-End Pipeline on Vertex AI

A fully productionized RAG pipeline on Google Cloud uses the following architecture:

Ingestion: Cloud Storage → Eventarc trigger → Cloud Run ingestion service (Document AI + chunking + text-embedding-004) → Vector Search UpsertDatapoints + Firestore (metadata + raw chunks).

Query: API Gateway → Cloud Run query service → text-embedding-004 (query) → Vector Search FindNeighbors → Firestore fetch → optional reranker → Gemini 1.5 Pro with grounding → response with citations.

Monitoring: All service logs → Cloud Logging → BigQuery export → RAGAS evaluation via Vertex AI Evaluation Service (scheduled via Cloud Scheduler) → dashboards in Looker Studio.

Real Deployment — BBVA (2024)

BBVA deployed a RAG pipeline on Google Cloud for its internal regulatory compliance assistant, querying 1.2 million regulatory documents across 30+ jurisdictions. Their production monitoring architecture, described at Google Cloud Next 2024, runs RAGAS evaluation on a 500-question test set after every corpus update (triggered by document change events). Faithfulness dropped from 94% to 79% in Q2 2024 when a batch of regulatory PDFs with non-standard table layouts caused chunking failures — detected within 4 hours by the automated evaluation, before user complaints. Manual re-ingestion with Document AI table extraction restored faithfulness to 93% within the same business day.

RAGAS Retrieval Augmented Generation Assessment. Open-source framework defining Context Precision, Context Recall, Faithfulness, and Answer Relevance metrics for systematic RAG evaluation. Integrates with Vertex AI Evaluation Service.

Corpus Drift The degradation of retrieval quality over time as the document corpus changes while stored embeddings remain stale. Manifests as declining context recall (coverage drift) or declining faithfulness (content drift).

Document Fingerprinting Storing a content hash of each document at ingestion time. Periodic hash comparison detects changed documents and triggers automatic re-embedding and vector upsert, preventing content drift.

Lesson 4 Quiz

Evaluation, Monitoring & Production Hardening · 4 questions

Which RAGAS metric most directly measures whether a RAG system is hallucinating?

Correct. Faithfulness measures whether every claim in the generated response is supported by the retrieved context. A claim not traceable to a retrieved chunk is, by definition, hallucinated — either from parametric knowledge or confabulated. This is the primary safety metric for regulated applications.

Not correct. Faithfulness is the RAGAS metric that directly measures hallucination — whether each claim in the response is supported by the retrieved context. Context Precision/Recall measure retrieval quality; Answer Relevance measures whether the response addresses the question.

A production RAG system's context recall metric drops 15% over 4 weeks while faithfulness remains stable. What is the most likely cause?

Correct. Falling context recall means the retriever is failing to find information needed to answer questions — indicating coverage gaps in the index. If faithfulness were falling too, it would suggest content drift (stale embeddings of changed documents). Stable faithfulness means the retrieved documents are accurate, just incomplete.

Not quite. Content drift would cause faithfulness to fall (stale embeddings return outdated content that contradicts correct answers). Coverage drift causes context recall to fall — needed information isn't in the index at all. Stable faithfulness with falling recall points to coverage drift.

In the BBVA deployment described in this lesson, what triggered the automated detection of the 79% faithfulness drop?

Correct. BBVA's architecture runs automated RAGAS evaluation on a 500-question test set after every corpus update. The non-standard PDF table layouts caused chunking failures in the new regulatory batch — faithfulness dropped from 94% to 79% and was caught by the automated evaluation run within 4 hours, before any user complaints.

Not quite. The detection mechanism was automated RAGAS evaluation triggered by corpus update events — specifically, an evaluation run on a 500-question test set that fires after each batch ingestion, catching quality degradation before users notice.

What is "document fingerprinting" in the context of RAG pipeline maintenance?

Correct. Document fingerprinting stores a content hash (e.g., SHA-256) of each document's text at ingestion. Periodic jobs re-hash the live corpus and compare — any document with a changed hash is flagged for re-embedding and vector upsert, preventing content drift from silently degrading retrieval quality.

Not quite. Document fingerprinting is a corpus maintenance technique: hash each document's content at ingestion, periodically re-hash, and re-embed any document whose content has changed. It's the primary mechanism for preventing content drift.

Lab 4: RAG Evaluation & Monitoring Strategy

Interactive AI lab — practice RAGAS metrics and production monitoring design

Your Scenario

You have just deployed a RAG pipeline for an insurance company's underwriting team. It queries 300,000 policy documents and historical claims data. Six weeks post-launch, the underwriting team reports that the system is "giving different answers to the same questions" and seems "confused about our new policy updates from last month." You need to diagnose the problem and design a monitoring system to prevent recurrence.

Diagnose whether this is content drift or coverage drift, discuss what RAGAS metrics to run and how to interpret them, design the monitoring architecture, and plan the remediation. Try at least 3 exchanges to complete the lab.

RAG Evaluation & Monitoring Assistant

Lab 4

"Different answers to the same questions" combined with "confused about new policy updates from last month" is a strong signal — and it points in a specific direction. When users describe inconsistency, it often means the retriever is returning a mix of old and new content: some queries hit the updated policy chunks, others hit stale embeddings of the old version. Let's work through the diagnosis. First question: do you know whether the policy updates from last month were re-ingested through the pipeline, or were they just added to the source system without triggering a re-embedding run?

Module 6 Test

Building a RAG Pipeline End to End · 15 questions · 80% to pass

1. What is the primary architectural difference between RAG and fine-tuning as approaches to grounding LLM responses in specific knowledge?

Correct.

Incorrect. The key architectural difference is where knowledge lives — external updatable store (RAG) vs. model weights (fine-tuning).

2. Which Vertex AI service handles document parsing including scanned PDF OCR as part of the RAG ingestion pipeline?

Correct. Document AI handles complex PDF layouts, scanned documents via OCR, and produces structured Document objects that preserve semantic boundaries for downstream chunking.

Incorrect. Document AI is the Google Cloud service for complex document parsing including OCR on scanned PDFs, preserving structural information like tables, headings, and form fields.

3. What is the "parent-child chunking" pattern's core trade-off?

Correct. Index small for precision retrieval; return large for model context quality. This pattern is native to LlamaIndex and replicable via LangChain's ParentDocumentRetriever.

Incorrect. Parent-child chunking indexes small chunks for ANN precision but passes the larger parent chunk to the model, providing better context for generating a coherent answer.

4. In Vertex AI Vector Search, what is the purpose of "token restricts" on datapoints?

Correct. Token restricts are string labels attached to each vector datapoint. At query time, you pass filter conditions (e.g., department=legal, status=active) that pre-filter the search space before ANN distance computation.

Incorrect. Token restricts are string label metadata on vectors — categorical filters applied before ANN search. They're unrelated to authentication or text truncation.

5. Why is chunk overlap (e.g., 50 tokens of overlap between consecutive 512-token chunks) important?

Correct. Without overlap, a sentence split across the boundary of two consecutive chunks might appear in neither in a retrievable form. Overlap ensures boundary-spanning information is captured in at least one chunk.

Incorrect. Overlap ensures information at chunk boundaries appears fully in at least one chunk. A sentence split across a hard boundary with no overlap might appear incomplete in both adjacent chunks.

6. What Vertex AI managed service call initiates ingestion from a GCS path in the RAG Engine?

Correct. ImportRagFiles handles the full pipeline — parse, chunk, embed, index — from a GCS or Drive source. The query-time companion is RetrieveContexts.

Incorrect. The RAG Engine uses ImportRagFiles for ingestion and RetrieveContexts for querying.

7. The Spotify engineering team improved retrieval by 60% primarily by doing what?

Correct. General-purpose embeddings failed on music-industry-specific vocabulary. A domain-adapted model fine-tuned on Spotify's corpus resolved the vocabulary mismatch, reducing irrelevant retrieval by 60%.

Incorrect. Spotify's improvement came from domain-adapted embedding fine-tuning. General models couldn't disambiguate music-industry terms (like "key" or "release") from their generic meanings.

8. Which statement correctly describes HyDE (Hypothetical Document Embeddings)?

Correct. A hypothetical answer naturally uses the vocabulary of real documents, making it closer in embedding space to relevant corpus chunks than a casual user query would be. Published by Luyu Gao et al. (CMU, 2022).

Incorrect. HyDE generates a hypothetical document that would answer the query — using domain vocabulary — then embeds that hypothetical document for ANN search rather than embedding the original query.

9. What is the role of a score threshold (minimum similarity) in the retrieval pipeline?

Correct. Score thresholds are the primary defense against "nothing relevant retrieved → model hallucinated a plausible answer" — the exact Air Canada failure mode. If retrieval confidence is below threshold, short-circuit to "I don't know."

Incorrect. Score thresholds gate the pipeline on retrieval quality. If no retrieved chunk exceeds the minimum similarity, the system returns a "no information found" response rather than risking hallucination.

10. In the RAGAS framework, a high Context Precision score combined with a low Context Recall score indicates what condition?

Correct. High precision = what is retrieved is relevant. Low recall = not everything relevant is being retrieved. This pattern suggests coverage gaps in the corpus or a top-k value that's too small.

Incorrect. High precision with low recall means the retriever finds good chunks when it finds anything, but misses significant portions of the relevant information — pointing to coverage gaps or insufficient top-k.

11. What is "conversation condensation" and why is it needed in multi-turn RAG?

Correct. "What about the dosage for children?" is not a retrievable query without knowing from prior context that the conversation is about metformin. Condensation rewrites it as "what is the pediatric dosage for metformin?" — a standalone retrievable query.

Incorrect. Conversation condensation rewrites implicit follow-up questions as standalone queries. "What about the exception?" is meaningless to a retriever without the prior context that establishes what "it" refers to.

12. Which Vertex AI Grounding API output enables per-claim confidence gating of generated responses?

Correct. The grounding support score is a per-claim 0–1 value in the GroundingMetadata. Post-processing logic can filter claims below a threshold (e.g., 0.7) from the displayed response, implementing fine-grained confidence gating.

Incorrect. The Grounding API returns a grounding support score per claim (0–1). This is distinct from a single overall response score and allows selective filtering of low-confidence individual claims.

13. What is "document fingerprinting" designed to prevent in a production RAG pipeline?

Correct. Document fingerprinting (content hashing) detects when source documents have changed since last ingestion. Changed documents trigger re-embedding and vector upsert, preventing content drift where the retriever returns chunks reflecting old policy or outdated facts.

Incorrect. Document fingerprinting specifically addresses content drift: detecting that a document has changed since it was last embedded, and triggering re-ingestion so the vector store reflects current content.

14. Box's engineering team reported that one change reduced retrieval misses on conversational queries from 31% to 9%. What was it?

Correct. As reported at Google Cloud Next 2024, conversation condensation was Box AI's highest single-impact improvement — resolving the implicit reference problem in conversational follow-up queries.

Incorrect. Box's reported highest-impact change was conversation condensation, rewriting follow-up questions that depend on prior context into standalone retrieval queries.

15. In the full end-to-end Vertex AI RAG architecture described in Lesson 4, what triggers the ingestion pipeline when a new document is added to Cloud Storage?

Correct. The production architecture uses Eventarc to respond to GCS object creation events in near-real-time, triggering the Cloud Run ingestion service. This is the event-driven pattern for maintaining index freshness without polling overhead.

Incorrect. The architecture uses Eventarc for event-driven ingestion — GCS object creation events trigger Eventarc, which invokes the Cloud Run ingestion service. This avoids polling and achieves near-real-time index freshness.