Module 8 · Lesson 1

Grounding Quality Metrics

What gets measured gets improved — defining the signals that reveal when an agent drifts from its data sources.

How do you know if a grounded agent is actually staying grounded?

In 2023, Google published the Retrieval-Augmented Generation (RAG) evaluation framework that underpins Vertex AI's grounding checks. The motivation was concrete: internal teams found that even well-tuned models produced plausible-sounding but unsupported claims in roughly 15–20% of long-form responses when grounding was not explicitly measured. Measuring faithfulness — not just fluency — became the engineering priority.

Why Fluency Alone Fails

A grounded agent retrieves documents, then synthesises an answer. Two failure modes are equally damaging: the agent ignores retrieved content and invents an answer, or it retrieves irrelevant chunks and attempts to reconcile them. Both failures are nearly invisible to human reviewers reading only the final output — the text sounds coherent either way.

Vertex AI's grounding evaluation layer, surfaced through the Vertex AI RAG Engine and Agent Evaluation Service (generally available as of Q1 2025), attacks this by scoring three distinct signals independently rather than collapsing them into a single quality score.

Metric 1

Context Recall

What fraction of the reference answer's facts appear in the retrieved chunks? Low recall means the retriever is missing relevant passages before the model even sees them.

Metric 2

Faithfulness

What fraction of the generated answer's claims are explicitly supported by retrieved context? This is the core hallucination guard — a faithful answer cites only what it was given.

Metric 3

Answer Relevance

Does the generated answer actually address the user's question? A response can be perfectly faithful to its context yet still miss the point of the query.

Metric 4

Context Precision

Of the retrieved chunks, how many were actually useful? Low precision wastes context window space and can confuse the model with irrelevant text.

The Vertex AI Agent Evaluation Service

Vertex AI's Agent Evaluation Service (part of Vertex AI Generative AI Studio) allows teams to define an evaluation dataset — a JSONL file with query, expected answer, and retrieved context — and run automated scoring using Gemini as a judge model. This is the same approach used in Google's own internal GenAI product teams, documented in the 2024 Google Cloud Next session "Evaluating Production GenAI Applications."

The service returns per-metric scores between 0 and 1, a per-question verdict, and a calibrated overall grounding score. Teams at Wayfair and Deutsche Telekom (both documented Google Cloud case studies, 2024) used this pipeline to gate deployments: agents with a faithfulness score below 0.85 are held back for additional retriever tuning before reaching production traffic.

Critical Distinction

Faithfulness measures model behaviour given what was retrieved. Context recall measures retriever behaviour. A faithfulness score of 1.0 on low-recall retrieval still produces wrong answers — the model is faithfully repeating incomplete information. Both dimensions must be tracked independently.

Key Terms

FaithfulnessProportion of answer claims supported by retrieved context; primary hallucination metric.

Context RecallFraction of reference-answer facts present in retrieved chunks; measures retriever completeness.

Grounding ScoreVertex AI's composite metric combining faithfulness and context signals into a deployment gate.

LLM-as-JudgeUsing Gemini (or another capable model) to score open-ended answers against reference criteria at scale.

Real Benchmark

In Google's 2024 internal RAG benchmark across 12 document-grounded applications, teams that tracked all four metrics (recall, precision, faithfulness, relevance) independently reduced production hallucination incidents by 61% compared to teams using only end-to-end BLEU/ROUGE scores. The insight: composite scores hide which component is failing.

Quiz — Grounding Quality Metrics

Four questions · Select the best answer for each

1. A grounded agent produces fluent, confident answers that are unsupported by its retrieved documents. Which metric most directly captures this failure?

Correct. Faithfulness measures what fraction of the generated answer's claims are supported by retrieved context — directly flagging hallucination against the retrieved evidence.

Not quite. Faithfulness is the metric that measures whether generated claims are supported by retrieved documents. Context Recall measures whether the retriever found the right passages.

2. A retriever returns 10 chunks for every query, but only 2 are ever cited in the final answer. Which metric is most affected?

Correct. Context Precision measures what fraction of retrieved chunks were actually useful. Returning 10 chunks when 2 are relevant is a precision problem — it wastes context window and can confuse the model.

Context Precision is the right metric here — it captures the ratio of useful retrieved chunks to total retrieved chunks. Low precision means over-retrieval of irrelevant content.

3. An agent scores 0.97 faithfulness but 0.42 context recall. What is the most accurate interpretation?

Correct. Faithfulness and recall are independent. High faithfulness means the model only uses what it was given — but if recall is 0.42, it was given an incomplete picture, so the final answers are likely missing critical information even though they contain no invented claims.

These metrics are independent. High faithfulness + low recall means the model behaves well given bad retrieval inputs — the retriever is the bottleneck, not the generator.

4. Which component of Vertex AI surfaces per-metric grounding scores using Gemini as a judge?

Correct. The Vertex AI Agent Evaluation Service (GA Q1 2025) accepts evaluation datasets in JSONL format and uses Gemini as a judge model to score faithfulness, recall, precision, and relevance for grounded agent responses.

The Vertex AI Agent Evaluation Service is the correct answer — it provides automated LLM-as-judge scoring across all four grounding metrics from a structured evaluation dataset.

Lab 1 — Designing an Evaluation Dataset

Conversational lab · Discuss grounding metric design with your AI assistant

Scenario

Your team is deploying a document-grounded agent on Vertex AI that answers questions from a company's internal policy library (PDF, 400+ documents). Before going to production, you need to build an evaluation dataset and define your metric thresholds. Discuss the design with the lab assistant below.

Suggested starting point: "We have 400 policy PDFs. Walk me through how to build a representative evaluation dataset and what faithfulness threshold we should gate on."

Grounding Metrics Lab Assistant

Module 8 · Lab 1

Welcome to Lab 1. I'm here to help you design an evaluation framework for a document-grounded Vertex AI agent. Tell me about your use case — what types of questions will users ask, and what does "wrong" look like for your stakeholders? That'll shape how we choose thresholds.

Module 8 · Lesson 2

Automated Evaluation Pipelines

Continuous grounding assessment — running at the speed of deployment, not the speed of human review.

How do you evaluate grounding at production scale without a team of annotators?

In 2024, the Vertex AI team published a reference architecture for continuous RAG evaluation as part of the Google Cloud Architecture Center. The pattern emerged from observing that teams deploying retrieval-augmented pipelines had no systematic way to detect grounding regressions between model updates. A Gemini 1.0 to 1.5 upgrade that improved fluency simultaneously degraded faithfulness on a specific document category — a regression that went undetected for three weeks in teams without automated eval pipelines.

The Eval-as-Code Pattern

The core idea is to treat evaluation like a test suite in software engineering: eval runs automatically on every significant change to the agent (model version, prompt, retriever config, chunking strategy). The pipeline is defined in code, version-controlled, and triggered by CI/CD events.

In Vertex AI's architecture, this means a Cloud Build or GitHub Actions pipeline that: (1) samples from a golden evaluation dataset, (2) runs the full agent pipeline end-to-end against those samples, (3) submits results to the Vertex AI Agent Evaluation Service, and (4) gates the deployment if any metric falls below threshold.

Golden Dataset Sampling

A curated JSONL file with (query, reference_answer, relevant_chunk_ids) tuples. Google recommends 200–500 examples covering all major query types and edge cases. Stored in Cloud Storage, versioned with the agent code.

Agent Pipeline Execution

The full retrieve→augment→generate pipeline runs against each eval sample. Retrieved chunks and generated answers are captured alongside the query. This exercises the actual production code path, not a simplified proxy.

Metric Computation via Gemini Judge

The Vertex AI Evaluation SDK (vertexai.evaluation) submits (query, context, answer, reference) tuples to Gemini for scoring. Each metric is scored independently using structured prompts from Google's evaluation prompt library.

Threshold Gating and Alerting

Scores are written to BigQuery. Cloud Monitoring alerts fire on threshold violations. The CI/CD pipeline fails the deployment job. Teams receive a structured report linking low-scoring examples to specific retrieved chunks.

Trend Tracking

BigQuery stores historical eval runs. Looker Studio dashboards surface metric trends across agent versions. Regression detection uses a rolling 7-day baseline — a drop of more than 3 percentage points in faithfulness triggers a page.

Vertex AI Evaluation SDK: Key Classes

The vertexai.evaluation Python SDK (released October 2024) provides the primary interface. The key object is EvalTask, which accepts a dataset, a list of metric names, and an experiment name for Vertex AI Experiments tracking. Running eval_task.evaluate() submits the job to managed infrastructure — no GPU allocation required, since Gemini handles scoring server-side.

Custom metrics can be defined as CustomMetric objects with a scoring function that returns a float between 0 and 1. This allows domain-specific checks — for example, verifying that financial figures cited in an answer match source documents to within a specified tolerance, which is how one documented Google Cloud banking customer implemented their compliance check in 2024.

Production Pattern

Deutsche Bank's Google Cloud deployment (documented at Google Cloud Next 2024) runs eval pipelines on 2% of live traffic in addition to pre-deployment golden-set evaluation. Shadow scoring uses the same Gemini judge but operates asynchronously on sampled production queries, creating a continuous signal of real-world grounding quality separate from the controlled eval dataset.

Cost Consideration

LLM-as-judge scoring is not free. Running Gemini Pro on a 500-example eval set costs roughly $0.50–$2.00 per run depending on context length. Teams typically run full eval on model/prompt changes and lighter heuristic checks (chunk retrieval overlap, response length distribution) on every commit.

Key Terms

EvalTaskVertex AI Evaluation SDK class that runs grounding metrics against a dataset using managed Gemini scoring.

Golden DatasetCurated evaluation set with verified reference answers, used as the ground truth for automated metric computation.

Shadow ScoringAsynchronous evaluation applied to sampled production traffic to detect real-world grounding drift.

Regression GateCI/CD check that blocks deployment when any grounding metric falls below a predefined threshold vs. baseline.

Quiz — Automated Evaluation Pipelines

Four questions · Select the best answer for each

1. What is the primary purpose of a "golden dataset" in a Vertex AI evaluation pipeline?

Correct. A golden dataset contains (query, reference_answer, relevant_chunk_ids) tuples with verified answers — the ground truth against which automated metrics like faithfulness and recall are computed.

The golden dataset provides verified reference answers that serve as ground truth for grounding metrics. It's not used to fine-tune the judge model — the judge model scores against these references.

2. In the Vertex AI Evaluation SDK, which class is the primary interface for running grounding metrics against an evaluation dataset?

Correct. EvalTask is the primary class in vertexai.evaluation — it accepts a dataset, metric list, and experiment name, and runs managed Gemini-based scoring server-side.

EvalTask is the correct answer. It's the primary class in the vertexai.evaluation SDK for running automated grounding metric evaluation.

3. Shadow scoring in a production grounded agent pipeline differs from pre-deployment eval in that it:

Correct. Shadow scoring is asynchronous — it samples live traffic and scores it with the Gemini judge without blocking the production response. It provides a real-world signal complementary to the controlled golden set.

Shadow scoring is asynchronous and non-blocking. It applies grounding metrics to sampled production queries to detect drift in real-world conditions that may not appear in the golden dataset.

4. A banking team wants to verify that financial figures in agent answers match source documents within a 0.1% tolerance. How should they implement this in Vertex AI?

Correct. The vertexai.evaluation SDK's CustomMetric class allows teams to define arbitrary scoring functions. A numerical tolerance check returns 0 or 1 based on whether extracted figures match source values within the specified threshold.

CustomMetric is the right approach — it allows teams to define domain-specific scoring functions (like numerical tolerance checks) that return values between 0 and 1, integrating cleanly with the standard EvalTask pipeline.

Lab 2 — Building an Eval Pipeline

Conversational lab · Design a CI/CD-integrated evaluation pipeline

Scenario

You're a data engineer at a financial services firm. Your grounded agent answers questions from 10-K filings. A Gemini model upgrade is scheduled for next month, and you need a CI/CD eval pipeline that gates deployment on grounding quality. Discuss implementation details with the assistant.

Suggested starting point: "We need to gate Gemini model upgrades on faithfulness scores from our 10-K filing agent. Walk me through the Cloud Build pipeline structure and how to fail the build on a regression."

Eval Pipeline Lab Assistant

Module 8 · Lab 2

Welcome to Lab 2. Let's design a CI/CD-integrated evaluation pipeline for your financial document agent. To start — what does your current deployment process look like? Are you using Cloud Build, GitHub Actions, or something else? And do you already have a golden dataset, or do we need to build that too?

Module 8 · Lesson 3

Retriever Improvement Strategies

When evaluation reveals the retriever is the problem — systematic techniques for closing the recall and precision gap.

Once evaluation tells you what's wrong with retrieval, how do you fix it?

In late 2023, the Google Cloud documentation team deployed a RAG-based assistant for developer documentation. Initial context recall was 0.61 — meaning the retriever missed relevant passages in nearly 40% of queries. The root cause was that developer questions used colloquial phrasing ("how do I make my function run faster") while documentation used formal terminology ("Cloud Functions performance optimization"). The fix required not a model change, but a retrieval architecture change: query rewriting before embedding lookup.

The Four Retriever Failure Modes

Evaluation data reveals specific failure patterns. Each has a targeted fix within the Vertex AI RAG Engine:

Failure Mode 1

Vocabulary Mismatch

User terms don't match document terminology. Fix: query rewriting with Gemini before embedding. The model paraphrases the query using domain vocabulary before vector search.

Failure Mode 2

Chunk Boundary Splits

Relevant information is split across chunk boundaries, degrading both precision and recall. Fix: semantic chunking (Vertex AI RAG Engine's sentence-boundary chunking) or overlapping chunks with 15–20% overlap ratio.

Failure Mode 3

Single-Vector Collapse

A single query embedding conflates multiple intents. Fix: HyDE (Hypothetical Document Embeddings) — generate a hypothetical answer, embed it, and retrieve against the answer's embedding space instead of the question's.

Failure Mode 4

Recency Blindness

Vector similarity ignores document freshness. Older but semantically similar documents outrank newer relevant ones. Fix: hybrid retrieval combining vector similarity with BM25 keyword score and a recency decay factor in Vertex AI Vector Search.

Query Rewriting in Vertex AI

The Vertex AI RAG Engine supports a query transformation step before vector lookup. Configured via the retrieval_config parameter in the RAG Engine API, teams can enable Gemini-powered query rewriting with a system prompt that specifies domain vocabulary. The rewritten query is embedded and used for vector search; the original query is retained for the final answer generation prompt.

Google's internal documentation team reported recall improvement from 0.61 to 0.82 using this approach alone, with no changes to the underlying document corpus or chunk boundaries.

Hybrid Retrieval: Vector + BM25

Vertex AI Vector Search supports hybrid queries that combine dense vector similarity with sparse BM25 keyword matching. The API accepts a sparse_embedding field alongside the dense embedding vector. Vertex AI handles the fusion using Reciprocal Rank Fusion (RRF) — a parameter-free method that combines ranked lists from both retrieval systems without requiring additional tuning of combination weights.

Hybrid retrieval particularly benefits structured queries ("find all mentions of Section 5.2.1") and exact-match requirements (product codes, legal citations) where semantic similarity alone is insufficient. In a 2024 Google Cloud case study, a legal tech firm improved context recall on regulatory document queries from 0.68 to 0.87 by adding BM25 as a secondary retrieval signal.

Reranking After Retrieval

Even with improved retrieval, the top-k chunks may not be optimally ordered for the generative model. Vertex AI integrates with Vertex AI Ranking API (generally available 2024) — a cross-encoder reranker that takes (query, chunk) pairs and scores relevance more accurately than the initial retrieval embedding similarity, at the cost of additional latency.

The Ranking API is called after initial vector retrieval (e.g., top 20 chunks) and re-orders them before the top 5–8 are passed to the generative model. This two-stage approach — fast approximate retrieval followed by precise reranking — balances latency and accuracy in production RAG pipelines.

Documented Result

In the Google Cloud Architecture Center's RAG optimization guide (published March 2024), teams using query rewriting + hybrid retrieval + reranking together achieved an average context recall improvement of 28 percentage points over baseline dense-only retrieval, across five documented customer deployments in legal, financial, and healthcare domains.

Key Terms

Query RewritingUsing Gemini to paraphrase user queries using domain vocabulary before embedding and vector search.

HyDEHypothetical Document Embeddings — generating a hypothetical answer and retrieving against its embedding.

Hybrid RetrievalCombining dense vector similarity (embedding) with sparse BM25 keyword matching via Reciprocal Rank Fusion.

Vertex AI Ranking APICross-encoder reranker that rescores retrieved chunks against the query for more accurate relevance ordering.

Quiz — Retriever Improvement Strategies

Four questions · Select the best answer for each

1. A developer documentation assistant fails to retrieve relevant results when users ask colloquially ("make my app faster") even though documentation uses formal terms ("performance optimization"). What is the most targeted fix?

Correct. Vocabulary mismatch between colloquial queries and formal documentation is a classic retrieval failure. Query rewriting with Gemini bridges this gap by paraphrasing into domain terminology before embedding — this was exactly the documented fix for the Google Cloud developer docs assistant.

Query rewriting is the targeted fix for vocabulary mismatch — it transforms colloquial user language into domain terminology before vector search, without requiring any changes to the document corpus or embedding model.

2. HyDE (Hypothetical Document Embeddings) improves retrieval by:

Correct. HyDE generates a hypothetical answer (not a paraphrase of the question) and embeds that answer. Since document chunks are also in answer-space, the embedding similarity is computed between two answer-like texts, improving retrieval especially for knowledge-intensive queries.

HyDE works by generating a hypothetical answer to the query, embedding that answer text, and using it for vector search. This aligns the query embedding with the embedding space of actual document answers rather than question phrasing.

3. Vertex AI Vector Search's hybrid retrieval combines dense and sparse signals using:

Correct. Vertex AI Vector Search uses Reciprocal Rank Fusion to combine dense and sparse ranked lists. RRF is parameter-free — no tuning of combination weights is needed — making it robust and easy to deploy.

Reciprocal Rank Fusion (RRF) is the fusion method used by Vertex AI — it combines ranked lists from dense and sparse retrieval without requiring weight tuning, which is why it's practical for production deployments.

4. The Vertex AI Ranking API is typically inserted into the RAG pipeline:

Correct. The Ranking API is a two-stage reranker — it operates after fast approximate vector retrieval (e.g., top 20 chunks) and reorders those candidates more accurately before the generative model receives the final top-k context.

The Ranking API operates after initial retrieval — this two-stage pattern uses fast vector search to get candidates, then applies accurate cross-encoder scoring to reorder them before passing to the generative model.

Lab 3 — Diagnosing Retrieval Failures

Conversational lab · Diagnose and fix low context recall in a production RAG agent

Scenario

Your healthcare document agent has a context recall of 0.58 and faithfulness of 0.91. Evaluation shows failures cluster around two query types: questions using patient-facing terminology versus clinical terminology, and multi-part questions spanning two document sections. Discuss diagnosis and remediation with the assistant.

Suggested starting point: "Our recall is 0.58. Failures cluster around terminology mismatches and multi-part questions. What's my diagnostic path and which retriever improvements should I prioritize?"

Retrieval Diagnostics Lab Assistant

Module 8 · Lab 3

Welcome to Lab 3. A recall of 0.58 with two failure clusters is actually a tractable problem — each cluster likely needs a different fix. Before I recommend solutions, let me ask: for the terminology mismatch failures, do you know whether the issue is that the right chunks don't surface at all, or that they surface but rank below your top-k cutoff? That distinction changes the fix significantly.

Module 8 · Lesson 4

Prompt and Generation Improvement

When evaluation reveals that retrieval is sound but generation still drifts — systematic techniques for constraining the model to its context.

When the retriever is working but the model still hallucinates — what changes?

In 2024, Anthropic published research (documented in their model card for Claude 3) showing that faithfulness failures in RAG settings are not random — they cluster around specific prompt structures. When retrieved context is presented before the question, faithfulness improves by 8–12 percentage points compared to presenting context after the question. Google replicated this finding internally for Gemini models, and Vertex AI's RAG Engine prompt templates now default to context-first ordering as a result.

The Generation Failure Modes

If evaluation shows high context recall (the right passages are being retrieved) but low faithfulness (the model is not staying within them), the problem is in the generation stage. Three patterns account for the majority of generation-level grounding failures:

1. Parametric Preference: The model defaults to knowledge from pre-training when retrieved context conflicts with its priors. This is especially common for well-known entities where the model has strong parametric memory.

2. Context Diffusion: When too many retrieved chunks are presented (high top-k), the model averages across them rather than citing specific passages, producing vague answers that aren't strictly grounded in any single source.

3. Instruction Drift: Verbose or complex system prompts cause the model to follow the spirit of the instruction rather than its letter, particularly around citation constraints.

Prompt Engineering for Faithfulness

Vertex AI's grounding-optimized prompt templates, available in Prompt Design in Generative AI Studio, implement several documented best practices:

Context-first ordering: Retrieved passages appear before the question in the prompt, not after. The model processes context while the question is still in working memory, reducing parametric override.

Explicit citation constraint: System prompt language such as "Answer using only information from the provided documents. If the answer is not in the documents, say 'I don't have information on this in the provided context.'" This instruction, verified across Gemini 1.5 and 2.0 models (2024 internal evaluation), reduces parametric override by forcing explicit acknowledgment of knowledge boundaries.

Per-passage labeling: Numbering retrieved passages (e.g., [Document 1], [Document 2]) and instructing the model to cite source numbers in its answer creates an auditable grounding trail and measurably improves faithfulness in Gemini models by surfacing which passages were actually used.

Controlling Context Window Scope

Context diffusion is a real phenomenon — documented in Stanford's LOST IN THE MIDDLE paper (Liu et al., 2023) and confirmed in Google's Gemini evaluation: models pay less attention to passages in the middle of long context windows. In grounded RAG applications, this means that padding the context with marginally-relevant chunks can degrade faithfulness even if recall is high.

The operational fix is dynamic top-k: instead of always retrieving k=10 chunks, retrieve based on a similarity score threshold (e.g., only include chunks with cosine similarity above 0.72). Vertex AI Vector Search supports a distance_threshold parameter in the FindNeighbors API for exactly this purpose. Teams that switched from fixed top-k to threshold-based retrieval in documented 2024 deployments reduced context window size by 40% while maintaining or improving faithfulness scores.

Grounding Config in Vertex AI

For Gemini models called via Vertex AI, the GroundingConfig object can be passed in generation requests to enable server-side grounding enforcement. When disable_attribution is False (default), Gemini returns a grounding_attributions field in its response — a list of (text_span, source_segment) pairs linking each claim in the generated text to its supporting retrieved passage.

These attributions enable post-hoc faithfulness verification and are the basis for the grounding score shown in Vertex AI's Generative AI Studio playground. In production, they can be logged to BigQuery alongside responses to enable the shadow scoring pipeline described in Lesson 2.

Temperature and Faithfulness

Higher generation temperature increases creativity but reduces faithfulness in grounded settings. Google's internal testing across Gemini 1.5 Pro found that temperature above 0.3 measurably degraded faithfulness scores for document-grounded QA tasks. Production grounded agents on Vertex AI typically set temperature to 0.0–0.2 for maximum faithfulness, reserving higher temperatures for summarization or creative synthesis tasks where some interpretation is acceptable.

End-to-End Pattern

The full improvement cycle documented in Google Cloud's Architecture Center: (1) Run eval → identify whether failure is retrieval or generation stage. (2) If retrieval: apply query rewriting, hybrid search, reranking. (3) If generation: tighten prompt constraints, reduce top-k via threshold, lower temperature, enable grounding attributions. (4) Re-run eval to confirm improvement. (5) Gate deployment on the combined metric suite. Teams that follow this cycle systematically report 2–3 iteration cycles to reach production-grade grounding quality from a baseline RAG deployment.

Key Terms

Parametric PreferenceModel defaulting to pre-training knowledge over retrieved context when they conflict.

Context DiffusionModel producing vague ungrounded answers when too many chunks compete for attention in the context window.

Dynamic Top-KThreshold-based retrieval that includes only chunks above a similarity score, rather than a fixed count.

GroundingConfigVertex AI API object that enables grounding attributions — source citations for each generated claim.

Grounding AttributionsPer-claim source linkages returned by Gemini when GroundingConfig is enabled; basis for production faithfulness logging.

Quiz — Prompt and Generation Improvement

Four questions · Select the best answer for each

1. An agent has high context recall (0.88) but low faithfulness (0.61). The most likely diagnosis is:

Correct. High recall means the retriever is finding relevant passages. Low faithfulness despite good retrieval points to a generation-stage failure — either parametric override or context diffusion from excessive chunks in the prompt.

When recall is high but faithfulness is low, the retriever is not the problem — the right passages are being found but the model isn't staying within them. This is a generation-stage failure: parametric preference or context diffusion.

2. Research from both Anthropic (2024) and Google's internal Gemini evaluation found that faithfulness improves when:

Correct. Context-first ordering — presenting retrieved passages before the question — measurably improves faithfulness. Vertex AI's RAG Engine prompt templates default to this ordering following internal validation of the finding.

Context-first ordering is the documented best practice — placing retrieved passages before the question in the prompt improves faithfulness by 8–12 percentage points in documented evaluations across Gemini and Claude models.

3. The "Lost in the Middle" phenomenon (Liu et al., 2023) implies which retrieval practice for grounded agents?

Correct. "Lost in the Middle" shows models pay less attention to middle-position context. The practical implication: don't pad the context window with low-relevance chunks. Dynamic top-k using a similarity threshold keeps only genuinely relevant passages, reducing diffusion effects.

The "Lost in the Middle" finding implies that longer context with more chunks doesn't always help — models lose track of middle-position information. The fix is dynamic top-k using a similarity threshold, so only genuinely relevant chunks are included.

4. Vertex AI's GroundingConfig and grounding_attributions in Gemini API responses enable:

Correct. GroundingConfig with disable_attribution=False causes Gemini to return grounding_attributions — a structured list linking each text span in the generated answer to its source passage. These can be logged for shadow scoring and audit trails.

GroundingConfig enables grounding attributions — per-claim source linkages that connect each generated text span to its supporting retrieved passage. This enables auditable RAG pipelines and is the basis for production faithfulness logging.

Lab 4 — Tightening Generation Faithfulness

Conversational lab · Fix generation-stage grounding failures through prompt and config changes

Scenario

Your insurance claims agent retrieves relevant policy documents with 0.84 recall, but faithfulness is only 0.67. Analysis shows the model frequently cites figures from its parametric memory (e.g., "standard deductibles are typically $500") rather than from retrieved policy documents, and elaborates beyond what the documents state. Work through generation-stage fixes with the lab assistant.

Suggested starting point: "Our retrieval is decent but generation keeps pulling in parametric knowledge. Walk me through prompt changes and Vertex AI config settings to clamp the model to only its retrieved context."

Generation Faithfulness Lab Assistant

Module 8 · Lab 4

Welcome to Lab 4. Parametric override on well-known domain facts (like insurance deductibles) is one of the trickier faithfulness problems — the model "knows" the right-sounding answer and it bleeds through. Let's work through this systematically. First: what does your current system prompt look like? Specifically, how are you framing the grounding constraint for the model right now?

Module 8 — Final Test

15 questions · 80% required to pass · Covers all four lessons

1. Which grounding metric specifically measures whether the model's generated claims are supported by retrieved context?

Correct. Faithfulness measures the fraction of generated claims supported by retrieved context — the primary hallucination detection metric.

Faithfulness is the correct answer — it measures whether generated claims are supported by retrieved context.

2. An agent scores faithfulness 0.95 but context recall 0.40. What is the most accurate operational interpretation?

Correct. High faithfulness + low recall = model stays within what it was given, but retrieval is incomplete. The agent produces grounded but incomplete answers.

High faithfulness + low recall means the model is faithful to poor retrieval inputs — the retriever is the bottleneck.

3. Which Vertex AI SDK class is the primary interface for running automated grounding evaluation with Gemini as judge?

Correct. EvalTask in vertexai.evaluation is the primary class — it accepts a dataset, metric list, and experiment name, running managed Gemini-based scoring.

EvalTask is the correct answer — it's the primary evaluation orchestration class in the Vertex AI Evaluation SDK.

4. Shadow scoring in a production grounded agent is best described as:

Correct. Shadow scoring is asynchronous, non-blocking, and applies to sampled live traffic to detect grounding drift in real-world conditions.

Shadow scoring is asynchronous evaluation of sampled production traffic — not blocking, and separate from pre-deployment eval.

5. The Vertex AI Architecture Center reference (2024) found that teams using query rewriting + hybrid retrieval + reranking together achieved what average context recall improvement over baseline dense-only retrieval?

Correct. The Google Cloud Architecture Center's RAG optimization guide documented an average 28 percentage point recall improvement across five customer deployments when all three retrieval improvements were combined.

28 percentage points is the documented figure from Google Cloud's Architecture Center RAG optimization guide, across five enterprise deployments.

6. Vertex AI Vector Search's FindNeighbors API supports which parameter for implementing dynamic top-k?

Correct. The distance_threshold parameter in the FindNeighbors API enables threshold-based retrieval — only returning chunks within a specified distance of the query embedding.

distance_threshold is the correct parameter — it enables threshold-based retrieval in Vertex AI Vector Search, implementing dynamic top-k.

7. The "Lost in the Middle" (Liu et al., 2023) finding implies that for grounded RAG agents, you should:

Correct. "Lost in the Middle" shows models lose track of middle-positioned context in long windows. The fix is to keep only genuinely relevant chunks using similarity thresholds, not to pad with marginal content.

The implication is to avoid padding context with low-relevance chunks — models lose track of middle content in long context windows, so quality matters more than quantity.

8. Vertex AI Vector Search uses which method to combine dense and sparse retrieval signals in hybrid search?

Correct. Vertex AI uses Reciprocal Rank Fusion — a parameter-free combination of ranked lists from dense and sparse retrieval that requires no weight tuning.

Reciprocal Rank Fusion (RRF) is the fusion method — it's parameter-free and combines ranked lists from both retrieval systems without requiring tuning.

9. HyDE improves retrieval by:

Correct. HyDE generates a hypothetical answer (not a paraphrase), embeds it, and retrieves against document embeddings — aligning query and document in answer-space rather than question-space.

HyDE generates a hypothetical answer, embeds it, and retrieves using that embedding — improving alignment between query and document embedding spaces.

10. The Vertex AI Ranking API is positioned in the RAG pipeline:

Correct. The Ranking API is a two-stage reranker: fast vector retrieval gets candidates, the Ranking API reorders them accurately, then the top reranked chunks go to the generative model.

The Ranking API operates after initial retrieval but before generation — it reorders retrieved candidates more accurately before the model sees them.

11. Context diffusion, as a generation-stage failure mode, is most directly caused by:

Correct. Context diffusion occurs when too many chunks are in the context window — the model produces vague answers that aren't grounded in any specific passage rather than citing concrete retrieved text.

Context diffusion is caused by excessive retrieved chunks competing for model attention — the fix is reducing context size via similarity thresholds.

12. Google's internal testing found that generation temperature above what value measurably degraded faithfulness in document-grounded Gemini 1.5 Pro tasks?

Correct. Google's internal evaluation found temperature above 0.3 measurably degraded faithfulness for document-grounded QA. Production grounded agents typically use 0.0–0.2.

Temperature above 0.3 was the documented threshold — production grounded agents on Vertex AI typically use 0.0–0.2 for maximum faithfulness.

13. Enabling GroundingConfig with disable_attribution=False in Gemini API calls returns which response field?

Correct. grounding_attributions is returned — a list of (text_span, source_segment) pairs linking generated text to supporting retrieved passages.

grounding_attributions is the returned field — structured per-claim source linkages that enable auditable RAG pipelines and production faithfulness logging.

14. A financial services firm wants to implement domain-specific numerical accuracy checks (within 0.1% tolerance) as part of their grounding evaluation. The correct implementation uses:

Correct. CustomMetric in vertexai.evaluation allows arbitrary domain-specific scoring functions — including numerical tolerance checks — that integrate with the standard EvalTask pipeline.

CustomMetric with a domain-specific scoring function is the right approach — it allows numerical tolerance checks to integrate cleanly with the Vertex AI evaluation pipeline.

15. According to Google's internal RAG benchmark across 12 document-grounded applications, teams that tracked all four grounding metrics independently (vs. composite BLEU/ROUGE) reduced production hallucination incidents by approximately:

Correct. Google's internal benchmark found a 61% reduction in production hallucination incidents for teams tracking all four metrics independently vs. composite scores. The insight: composite scores hide which component is failing.

61% is the documented figure — tracking four metrics independently revealed which component (retriever vs. generator) was failing, enabling targeted fixes that composite scores obscured.