In 2023, Google DeepMind's internal documentation team confronted a problem shared by thousands of enterprises: a corpus of tens of thousands of engineering documents that language models hallucinated over freely. The fix was not a bigger model — it was a retrieval layer. Their approach, now published in the Vertex AI documentation, became the template for production RAG on Google Cloud.
Retrieval-Augmented Generation (RAG) addresses a structural problem with large language models: they encode knowledge at training time, making that knowledge static, unverifiable, and sometimes fabricated. RAG separates the knowledge store from the reasoning engine, letting you update one without retraining the other.
The canonical architecture has two phases: an offline ingestion phase that converts documents into searchable vector representations, and an online query phase that retrieves relevant chunks and passes them as context to the model at inference time.
Ingestion is more complex than it appears. A naive approach — split every document at 512-token boundaries — produces chunks that fracture sentences, sever tables, and destroy context. Production ingestion requires deliberate choices at every stage.
As of late 2024, Vertex AI ships a managed RAG Engine API that handles ingestion and retrieval as a service. You call ImportRagFiles with a GCS path, and the API parses, chunks, embeds, and indexes automatically. This is appropriate for prototypes and moderate-scale workloads. For pipelines requiring custom chunking logic, metadata enrichment, or multi-modal ingestion, the custom pipeline approach covered in this module is still necessary.
Chunk size is the single most impactful ingestion decision. Small chunks (128–256 tokens) capture precise facts but lose surrounding context, making retrieved chunks hard for the model to interpret in isolation. Large chunks (1024–2048 tokens) preserve context but dilute relevance — the retrieved chunk contains more noise alongside the relevant passage.
The parent-child chunking pattern resolves this: index small child chunks for precise retrieval, but at query time, return the parent chunk (larger context window) to the model. This is natively supported in LlamaIndex and can be replicated in LangChain's ParentDocumentRetriever.
Overlap between consecutive chunks (typically 10–15% of chunk size) prevents information loss at boundaries. A fact split across chunk boundaries appears in at least one chunk in full.
Lufthansa Group's technical operations team built a RAG pipeline on Google Cloud to answer mechanic queries against 2.4 million pages of aircraft maintenance manuals. The ingestion pipeline used Document AI for PDF parsing, custom semantic chunking aligned to maintenance procedure sections, and Vertex AI Vector Search for retrieval. Query latency averaged 1.2 seconds end-to-end. The system reduced manual document search time by an estimated 40% per query session, as reported in Google Cloud Next 2024 session content.
The managed RAG Engine trades flexibility for simplicity. It handles the entire ingestion pipeline internally, supports direct GCS and Google Drive sources, and exposes a single RetrieveContexts API for query time. For many enterprise use cases — internal knowledge bases, support document retrieval, policy Q&A — it is the right choice.
Custom pipelines are warranted when you need: custom chunking logic (e.g., splitting by XML tags or code function boundaries), multi-step metadata enrichment (calling an LLM to generate summary embeddings alongside content embeddings), hybrid retrieval combining vector search with keyword BM25, or cross-corpus federation across multiple vector stores.
You are a data engineer at a financial services firm. Your team has 85,000 regulatory compliance documents (PDFs, HTML pages, Word files) ranging from 2 to 400 pages each. You need to design a RAG ingestion pipeline on Google Cloud that supports daily updates, metadata filtering by regulation type and jurisdiction, and sub-2-second query latency.
In 2024, Spotify's engineering team published a post-mortem on their internal knowledge base search system. Their first RAG prototype used general-purpose embeddings and returned irrelevant chunks for music-industry-specific terminology — "key" meaning musical key retrieved documents about API keys, "release" meaning album release retrieved documents about software releases. Switching to a domain-adapted embedding model, fine-tuned on Spotify's internal corpus, reduced irrelevant retrieval by 60%. The lesson: embedding model selection is not a default choice.
An embedding model maps text to a point in a high-dimensional vector space such that semantically similar texts are geometrically close. The quality of this mapping determines retrieval quality — no amount of downstream prompt engineering compensates for an embedding model that cannot distinguish relevant from irrelevant content in your specific domain.
Two dimensions matter: semantic faithfulness (does the model understand the meaning of your domain's vocabulary?) and retrieval asymmetry (can the model bridge the gap between how users phrase questions and how documents express answers?). General-purpose models handle common knowledge well but degrade on technical, legal, medical, or domain-specific corpora.
Vertex AI exposes several embedding models with different trade-offs:
A key insight in modern retrieval: the optimal embedding for a document chunk and the optimal embedding for a query about that chunk are different. Documents are dense, formal, and contain answers. Queries are short, informal, and contain questions. Using the same embedding function for both produces a systematic mismatch.
Vertex AI's text-embedding-004 addresses this via the task_type parameter. During ingestion, set task_type to RETRIEVAL_DOCUMENT. At query time, set it to RETRIEVAL_QUERY. The model applies different projection heads for each, effectively aligning query space to document space. This consistently improves recall@10 by 5–15% over symmetric retrieval on BEIR benchmarks.
Google's internal evaluation on the Natural Questions benchmark shows that switching from symmetric to asymmetric retrieval (same model, different task_type) improves top-1 retrieval accuracy by approximately 8 percentage points. For a corpus of 100,000 chunks where 1 in 100 queries previously retrieved the wrong chunk first, asymmetric retrieval recovers ~8,000 of those queries with no other change.
Vertex AI Vector Search (formerly Matching Engine) is a managed approximate nearest neighbor service built on Google's ScaNN algorithm. It supports three deployment modes relevant to RAG pipelines:
The top_k parameter controls how many chunks are retrieved per query. The right value is not obvious: too low and you miss relevant chunks; too high and you flood the model context with noise, degrading generation quality. Empirical evaluation on a test set of query-answer pairs is the only reliable method for calibrating this.
In practice, retrieve more (k=20) and apply a reranking step before passing to the model. A cross-encoder reranker (such as Vertex AI's built-in reranking or a custom model) re-scores each retrieved chunk against the query using full attention, selecting the final top 3–5 for the prompt. Cross-encoders are too slow to run over the entire corpus but fast enough over 20 candidates.
Score thresholds (minimum cosine similarity) filter out retrievals where even the top-k results are semantically distant from the query. If all retrieved chunks fall below threshold, return "I don't know" rather than hallucinate — a critical quality gate for regulated industries.
For RAG pipelines that need to combine vector search with structured SQL queries — for example, retrieving compliance documents matching a vector query AND authored by a specific regulatory body AND effective after a specific date — AlloyDB for PostgreSQL with the pgvector extension provides a unified SQL interface. The trade-off: lower ANN throughput than dedicated Vector Search at multi-million scale, but simpler architecture and full ACID compliance for transactional workloads.
You are evaluating embedding strategies for a RAG system serving medical research queries over 500,000 PubMed abstracts. Users phrase questions in natural language ("what drugs reduce LDL in diabetic patients?") while the abstracts use clinical and pharmacological terminology. Your retrieval evaluation shows the current symmetric general-purpose embeddings have 43% top-1 accuracy on a test set.
In March 2024, Air Canada lost a civil case partly because its RAG-based customer service chatbot fabricated a bereavement discount policy that didn't exist. The retrieval system had failed to find a relevant chunk, and the model fell back on parametric knowledge to generate a plausible-sounding but entirely fictitious answer. The court held Air Canada responsible for its chatbot's statements. The case became a canonical reference in enterprise AI risk management for the necessity of retrieval failure detection and confidence-gated responses.
The query pipeline executes in milliseconds but involves several distinct steps, each with failure modes that must be handled explicitly in a production system. Treating the pipeline as a single black-box call is the most common source of quality and safety failures.
The system prompt in a RAG pipeline carries unusual weight. It must override the model's default behavior — which is to be helpful by filling gaps from parametric knowledge — and enforce strict grounding. The following system prompt pattern has been validated in production deployments:
Answer exclusively from the provided CONTEXT sections below. Do not use any knowledge not present in the context. If the context does not contain sufficient information to answer the question, respond with exactly: "I don't have enough information in the provided documents to answer this question." Do not speculate, extrapolate, or fill gaps. For each factual claim in your answer, append [SOURCE: chunk_id] referencing the context section that contains that claim.
HyDE is a retrieval technique published by Luyu Gao et al. (CMU, 2022) that addresses the query-document vocabulary mismatch more aggressively than asymmetric task_type alone. Instead of embedding the user's question directly, you first prompt an LLM to generate a hypothetical document that would answer the question, then embed that hypothetical document for retrieval.
The intuition: a hypothetical answer to "what is the dosage of metformin for type 2 diabetes?" will use the same clinical vocabulary as actual dosage documentation, even if the user's query was phrased informally. Empirically, HyDE improves recall on technical corpora by 15–25% over direct query embedding, at the cost of one additional LLM call per query (typically using a fast, cheap model like Gemini Flash for the hypothetical generation).
Production RAG systems must handle conversational context. A follow-up question like "what about the dosage for children?" has no standalone meaning without the prior turn. Two approaches:
Conversation condensation: Before retrieval, an LLM rewrites the current question incorporating context from previous turns into a standalone query. Simple and effective.
Multi-query retrieval: Retrieve separately for the current question and for a condensed summary of the conversation, then merge the result sets before reranking. More thorough but doubles retrieval latency.
Vertex AI's Conversation API can be combined with RAG by maintaining the conversation history in application code and condensing it before each retrieval call.
Box integrated Vertex AI RAG into Box AI, enabling users to query their enterprise documents conversationally. Their engineering team published implementation details at Google Cloud Next 2024, noting that conversation condensation (rewriting multi-turn follow-up questions as standalone queries before retrieval) was the single highest-impact change in their evaluation: it reduced retrieval misses on conversational queries from 31% to 9%.
For teams that want managed grounding without building a retrieval pipeline, Vertex AI's Grounding feature handles retrieval and citation natively. When you call the Gemini API with grounding enabled, Vertex AI automatically retrieves from Google Search (for public knowledge) or from a configured RAG corpus (for private data), injects the retrieved context, generates a grounded response, and returns GroundingMetadata including source URLs, grounding chunks, and a grounding support score per claim.
The grounding support score (0–1) per claim is particularly valuable: it allows post-processing to filter low-confidence claims before presenting them to users, implementing a confidence-gated response strategy without custom pipeline code.
You are building a RAG system for a legal firm where associates query a corpus of 200,000 case law documents. The system must never fabricate case citations, must support multi-turn conversational queries ("what about the exception in the 1987 case?"), and must provide confidence levels that senior partners can use to decide whether to verify a reference manually.
After Anthropic published its Constitutional AI paper in 2022, several enterprises reported deploying RAG systems that performed well in pre-production evaluation but silently degraded over the first six months in production. The common cause: corpus drift — the document base was updated but the embeddings were stale, causing retrieval to return outdated chunks while the model generated answers that contradicted current policy. Without monitoring, this drift went undetected for weeks before support tickets revealed the discrepancy.
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework developed at Exploding Gradients (2023) that defines the key metrics for RAG quality evaluation. It is the de facto standard for RAG evaluation and integrates directly with Vertex AI Evaluation Service.
Vertex AI Evaluation Service (part of Vertex AI Studio) provides a managed environment for running RAGAS and custom metrics against a test set. You provide a dataset of (question, ground-truth-answer, retrieved-context, generated-answer) tuples, and the service computes metrics using an LLM judge (typically Gemini Pro) to assess faithfulness and relevance.
Key integration points: the Evaluation Service connects to your Vertex AI pipeline runs, allowing you to automatically run evaluation after each corpus update or model change. Results are logged to Vertex AI Experiments for tracking metric trends over time — this is how you detect corpus drift systematically rather than waiting for user complaints.
RAGAS faithfulness metric uses an LLM to judge whether each claim in the response is supported by the retrieved context. This works well for factual claims but has known failure modes: the judge LLM may be too lenient on claims that are "directionally correct" but factually slightly wrong. For high-stakes applications (legal, medical, financial), augment LLM-as-judge with deterministic citation verification: check whether each cited chunk ID actually contains the quoted text using string matching.
A production RAG pipeline requires monitoring at three levels: retrieval quality, generation quality, and pipeline health.
Corpus drift occurs in two forms. Content drift: documents are updated but the old embeddings remain indexed, causing the retriever to return outdated chunks. Coverage drift: new documents covering new topics are added but not yet indexed, causing retrieval to fail on queries about those topics.
Detection: compare your evaluation set's context recall metric week-over-week. A drop in context recall typically indicates coverage drift. A drop in faithfulness with stable recall indicates content drift (correct documents retrieved but their content has changed).
Remediation: implement a document fingerprinting system — hash each document's content at ingestion time, and re-run hashes periodically. Documents whose content hash has changed trigger automatic re-embedding and vector upsert. On Vertex AI, this can be implemented as a Cloud Scheduler job that triggers a Dataflow pipeline checking GCS object metadata and running incremental re-indexing.
A fully productionized RAG pipeline on Google Cloud uses the following architecture:
Ingestion: Cloud Storage → Eventarc trigger → Cloud Run ingestion service (Document AI + chunking + text-embedding-004) → Vector Search UpsertDatapoints + Firestore (metadata + raw chunks).
Query: API Gateway → Cloud Run query service → text-embedding-004 (query) → Vector Search FindNeighbors → Firestore fetch → optional reranker → Gemini 1.5 Pro with grounding → response with citations.
Monitoring: All service logs → Cloud Logging → BigQuery export → RAGAS evaluation via Vertex AI Evaluation Service (scheduled via Cloud Scheduler) → dashboards in Looker Studio.
BBVA deployed a RAG pipeline on Google Cloud for its internal regulatory compliance assistant, querying 1.2 million regulatory documents across 30+ jurisdictions. Their production monitoring architecture, described at Google Cloud Next 2024, runs RAGAS evaluation on a 500-question test set after every corpus update (triggered by document change events). Faithfulness dropped from 94% to 79% in Q2 2024 when a batch of regulatory PDFs with non-standard table layouts caused chunking failures — detected within 4 hours by the automated evaluation, before user complaints. Manual re-ingestion with Document AI table extraction restored faithfulness to 93% within the same business day.
You have just deployed a RAG pipeline for an insurance company's underwriting team. It queries 300,000 policy documents and historical claims data. Six weeks post-launch, the underwriting team reports that the system is "giving different answers to the same questions" and seems "confused about our new policy updates from last month." You need to diagnose the problem and design a monitoring system to prevent recurrence.