In 2023, Google published the Retrieval-Augmented Generation (RAG) evaluation framework that underpins Vertex AI's grounding checks. The motivation was concrete: internal teams found that even well-tuned models produced plausible-sounding but unsupported claims in roughly 15–20% of long-form responses when grounding was not explicitly measured. Measuring faithfulness — not just fluency — became the engineering priority.
A grounded agent retrieves documents, then synthesises an answer. Two failure modes are equally damaging: the agent ignores retrieved content and invents an answer, or it retrieves irrelevant chunks and attempts to reconcile them. Both failures are nearly invisible to human reviewers reading only the final output — the text sounds coherent either way.
Vertex AI's grounding evaluation layer, surfaced through the Vertex AI RAG Engine and Agent Evaluation Service (generally available as of Q1 2025), attacks this by scoring three distinct signals independently rather than collapsing them into a single quality score.
Vertex AI's Agent Evaluation Service (part of Vertex AI Generative AI Studio) allows teams to define an evaluation dataset — a JSONL file with query, expected answer, and retrieved context — and run automated scoring using Gemini as a judge model. This is the same approach used in Google's own internal GenAI product teams, documented in the 2024 Google Cloud Next session "Evaluating Production GenAI Applications."
The service returns per-metric scores between 0 and 1, a per-question verdict, and a calibrated overall grounding score. Teams at Wayfair and Deutsche Telekom (both documented Google Cloud case studies, 2024) used this pipeline to gate deployments: agents with a faithfulness score below 0.85 are held back for additional retriever tuning before reaching production traffic.
Faithfulness measures model behaviour given what was retrieved. Context recall measures retriever behaviour. A faithfulness score of 1.0 on low-recall retrieval still produces wrong answers — the model is faithfully repeating incomplete information. Both dimensions must be tracked independently.
In Google's 2024 internal RAG benchmark across 12 document-grounded applications, teams that tracked all four metrics (recall, precision, faithfulness, relevance) independently reduced production hallucination incidents by 61% compared to teams using only end-to-end BLEU/ROUGE scores. The insight: composite scores hide which component is failing.
Your team is deploying a document-grounded agent on Vertex AI that answers questions from a company's internal policy library (PDF, 400+ documents). Before going to production, you need to build an evaluation dataset and define your metric thresholds. Discuss the design with the lab assistant below.
In 2024, the Vertex AI team published a reference architecture for continuous RAG evaluation as part of the Google Cloud Architecture Center. The pattern emerged from observing that teams deploying retrieval-augmented pipelines had no systematic way to detect grounding regressions between model updates. A Gemini 1.0 to 1.5 upgrade that improved fluency simultaneously degraded faithfulness on a specific document category — a regression that went undetected for three weeks in teams without automated eval pipelines.
The core idea is to treat evaluation like a test suite in software engineering: eval runs automatically on every significant change to the agent (model version, prompt, retriever config, chunking strategy). The pipeline is defined in code, version-controlled, and triggered by CI/CD events.
In Vertex AI's architecture, this means a Cloud Build or GitHub Actions pipeline that: (1) samples from a golden evaluation dataset, (2) runs the full agent pipeline end-to-end against those samples, (3) submits results to the Vertex AI Agent Evaluation Service, and (4) gates the deployment if any metric falls below threshold.
The vertexai.evaluation Python SDK (released October 2024) provides the primary interface. The key object is EvalTask, which accepts a dataset, a list of metric names, and an experiment name for Vertex AI Experiments tracking. Running eval_task.evaluate() submits the job to managed infrastructure — no GPU allocation required, since Gemini handles scoring server-side.
Custom metrics can be defined as CustomMetric objects with a scoring function that returns a float between 0 and 1. This allows domain-specific checks — for example, verifying that financial figures cited in an answer match source documents to within a specified tolerance, which is how one documented Google Cloud banking customer implemented their compliance check in 2024.
Deutsche Bank's Google Cloud deployment (documented at Google Cloud Next 2024) runs eval pipelines on 2% of live traffic in addition to pre-deployment golden-set evaluation. Shadow scoring uses the same Gemini judge but operates asynchronously on sampled production queries, creating a continuous signal of real-world grounding quality separate from the controlled eval dataset.
LLM-as-judge scoring is not free. Running Gemini Pro on a 500-example eval set costs roughly $0.50–$2.00 per run depending on context length. Teams typically run full eval on model/prompt changes and lighter heuristic checks (chunk retrieval overlap, response length distribution) on every commit.
You're a data engineer at a financial services firm. Your grounded agent answers questions from 10-K filings. A Gemini model upgrade is scheduled for next month, and you need a CI/CD eval pipeline that gates deployment on grounding quality. Discuss implementation details with the assistant.
In late 2023, the Google Cloud documentation team deployed a RAG-based assistant for developer documentation. Initial context recall was 0.61 — meaning the retriever missed relevant passages in nearly 40% of queries. The root cause was that developer questions used colloquial phrasing ("how do I make my function run faster") while documentation used formal terminology ("Cloud Functions performance optimization"). The fix required not a model change, but a retrieval architecture change: query rewriting before embedding lookup.
Evaluation data reveals specific failure patterns. Each has a targeted fix within the Vertex AI RAG Engine:
The Vertex AI RAG Engine supports a query transformation step before vector lookup. Configured via the retrieval_config parameter in the RAG Engine API, teams can enable Gemini-powered query rewriting with a system prompt that specifies domain vocabulary. The rewritten query is embedded and used for vector search; the original query is retained for the final answer generation prompt.
Google's internal documentation team reported recall improvement from 0.61 to 0.82 using this approach alone, with no changes to the underlying document corpus or chunk boundaries.
Vertex AI Vector Search supports hybrid queries that combine dense vector similarity with sparse BM25 keyword matching. The API accepts a sparse_embedding field alongside the dense embedding vector. Vertex AI handles the fusion using Reciprocal Rank Fusion (RRF) — a parameter-free method that combines ranked lists from both retrieval systems without requiring additional tuning of combination weights.
Hybrid retrieval particularly benefits structured queries ("find all mentions of Section 5.2.1") and exact-match requirements (product codes, legal citations) where semantic similarity alone is insufficient. In a 2024 Google Cloud case study, a legal tech firm improved context recall on regulatory document queries from 0.68 to 0.87 by adding BM25 as a secondary retrieval signal.
Even with improved retrieval, the top-k chunks may not be optimally ordered for the generative model. Vertex AI integrates with Vertex AI Ranking API (generally available 2024) — a cross-encoder reranker that takes (query, chunk) pairs and scores relevance more accurately than the initial retrieval embedding similarity, at the cost of additional latency.
The Ranking API is called after initial vector retrieval (e.g., top 20 chunks) and re-orders them before the top 5–8 are passed to the generative model. This two-stage approach — fast approximate retrieval followed by precise reranking — balances latency and accuracy in production RAG pipelines.
In the Google Cloud Architecture Center's RAG optimization guide (published March 2024), teams using query rewriting + hybrid retrieval + reranking together achieved an average context recall improvement of 28 percentage points over baseline dense-only retrieval, across five documented customer deployments in legal, financial, and healthcare domains.
Your healthcare document agent has a context recall of 0.58 and faithfulness of 0.91. Evaluation shows failures cluster around two query types: questions using patient-facing terminology versus clinical terminology, and multi-part questions spanning two document sections. Discuss diagnosis and remediation with the assistant.
In 2024, Anthropic published research (documented in their model card for Claude 3) showing that faithfulness failures in RAG settings are not random — they cluster around specific prompt structures. When retrieved context is presented before the question, faithfulness improves by 8–12 percentage points compared to presenting context after the question. Google replicated this finding internally for Gemini models, and Vertex AI's RAG Engine prompt templates now default to context-first ordering as a result.
If evaluation shows high context recall (the right passages are being retrieved) but low faithfulness (the model is not staying within them), the problem is in the generation stage. Three patterns account for the majority of generation-level grounding failures:
1. Parametric Preference: The model defaults to knowledge from pre-training when retrieved context conflicts with its priors. This is especially common for well-known entities where the model has strong parametric memory.
2. Context Diffusion: When too many retrieved chunks are presented (high top-k), the model averages across them rather than citing specific passages, producing vague answers that aren't strictly grounded in any single source.
3. Instruction Drift: Verbose or complex system prompts cause the model to follow the spirit of the instruction rather than its letter, particularly around citation constraints.
Vertex AI's grounding-optimized prompt templates, available in Prompt Design in Generative AI Studio, implement several documented best practices:
Context-first ordering: Retrieved passages appear before the question in the prompt, not after. The model processes context while the question is still in working memory, reducing parametric override.
Explicit citation constraint: System prompt language such as "Answer using only information from the provided documents. If the answer is not in the documents, say 'I don't have information on this in the provided context.'" This instruction, verified across Gemini 1.5 and 2.0 models (2024 internal evaluation), reduces parametric override by forcing explicit acknowledgment of knowledge boundaries.
Per-passage labeling: Numbering retrieved passages (e.g., [Document 1], [Document 2]) and instructing the model to cite source numbers in its answer creates an auditable grounding trail and measurably improves faithfulness in Gemini models by surfacing which passages were actually used.
Context diffusion is a real phenomenon — documented in Stanford's LOST IN THE MIDDLE paper (Liu et al., 2023) and confirmed in Google's Gemini evaluation: models pay less attention to passages in the middle of long context windows. In grounded RAG applications, this means that padding the context with marginally-relevant chunks can degrade faithfulness even if recall is high.
The operational fix is dynamic top-k: instead of always retrieving k=10 chunks, retrieve based on a similarity score threshold (e.g., only include chunks with cosine similarity above 0.72). Vertex AI Vector Search supports a distance_threshold parameter in the FindNeighbors API for exactly this purpose. Teams that switched from fixed top-k to threshold-based retrieval in documented 2024 deployments reduced context window size by 40% while maintaining or improving faithfulness scores.
For Gemini models called via Vertex AI, the GroundingConfig object can be passed in generation requests to enable server-side grounding enforcement. When disable_attribution is False (default), Gemini returns a grounding_attributions field in its response — a list of (text_span, source_segment) pairs linking each claim in the generated text to its supporting retrieved passage.
These attributions enable post-hoc faithfulness verification and are the basis for the grounding score shown in Vertex AI's Generative AI Studio playground. In production, they can be logged to BigQuery alongside responses to enable the shadow scoring pipeline described in Lesson 2.
Higher generation temperature increases creativity but reduces faithfulness in grounded settings. Google's internal testing across Gemini 1.5 Pro found that temperature above 0.3 measurably degraded faithfulness scores for document-grounded QA tasks. Production grounded agents on Vertex AI typically set temperature to 0.0–0.2 for maximum faithfulness, reserving higher temperatures for summarization or creative synthesis tasks where some interpretation is acceptable.
The full improvement cycle documented in Google Cloud's Architecture Center: (1) Run eval → identify whether failure is retrieval or generation stage. (2) If retrieval: apply query rewriting, hybrid search, reranking. (3) If generation: tighten prompt constraints, reduce top-k via threshold, lower temperature, enable grounding attributions. (4) Re-run eval to confirm improvement. (5) Gate deployment on the combined metric suite. Teams that follow this cycle systematically report 2–3 iteration cycles to reach production-grade grounding quality from a baseline RAG deployment.
Your insurance claims agent retrieves relevant policy documents with 0.84 recall, but faithfulness is only 0.67. Analysis shows the model frequently cites figures from its parametric memory (e.g., "standard deductibles are typically $500") rather than from retrieved policy documents, and elaborates beyond what the documents state. Work through generation-stage fixes with the lab assistant.