Module 8 · Lesson 1

Deployment Strategies for Vertex AI Agents

From notebook to production: canary releases, blue/green deployments, and traffic management on Vertex AI Agent Engine.

How do you move a validated agent into production without risking a catastrophic rollout?

In March 2024, Google publicly documented how its internal Search generative experience teams used staged rollout patterns on Vertex AI before expanding AI answers to all Search users. Initial exposure was held at 1% of traffic while latency, grounding accuracy, and user satisfaction signals were evaluated — a playbook now built into Vertex AI Agent Engine's deployment controls.

Why Deployment Strategy Matters for Agents

Deploying a language-model agent is categorically different from deploying a stateless microservice. Agents make multi-step decisions, call external tools, maintain conversation state, and can produce outputs that are difficult to fully evaluate in staging. A bad deployment can generate harmful, incorrect, or embarrassing responses at scale before automated checks catch them.

Vertex AI Agent Engine (the managed runtime that backs Vertex AI Agents, formerly Vertex AI Conversation) exposes three primary deployment controls: traffic splits, agent versions, and rollout schedules. These map onto classic software patterns but carry additional nuance for stateful, probabilistic systems.

Blue/Green Deployment

In a blue/green deployment, you maintain two complete, independent environments — the live (blue) and the staged candidate (green). Traffic is switched atomically. For agents, this means two fully provisioned agent versions with separate Datastore or tool integrations. The advantage is instant rollback: if green misbehaves, you redirect the load balancer back to blue within seconds.

Vertex AI's agent versioning feature supports this natively. Each published version receives a stable endpoint URL. A Cloud Load Balancing URL map or an API Gateway route rule can toggle between them without redeploying anything.

Architecture Note

Blue/green works best when your agent's external tool bindings (e.g., Cloud Functions, BigQuery connections) are environment-scoped. Sharing a single database between blue and green agents creates state-mutation conflicts that make rollback hazardous.

Canary Releases

A canary release routes a small percentage of real traffic — typically 1–5% — to the new agent version while the majority continues hitting the stable version. Vertex AI Agent Engine supports traffic percentage splits directly on agent endpoints, letting you specify e.g. 95% to version A and 5% to version B.

The signal collection window matters enormously. Google's documented recommendation for production agents is a minimum of 48–72 hours at each traffic tier before advancing, because LLM quality regressions often surface only in long-tail query distributions that appear with sufficient volume. A 1% canary on a low-traffic internal agent might not produce statistically meaningful signal for days.

Pattern

Canary Release

1–5% real traffic to new version. Gradual promotion. Best for quality regressions.

Pattern

Blue / Green

Two full environments. Atomic switch. Best for infrastructure changes and instant rollback.

Pattern

Shadow Mode

New version receives mirrored traffic but responses are discarded. Zero user risk. Expensive.

Pattern

Feature Flags

Route specific user cohorts or session IDs to new agent logic. Fine-grained control.

Shadow Mode Testing

Shadow mode — sometimes called dark launch — duplicates every inbound request to both the current and candidate agent. The candidate's responses are logged but never returned to users. This is the safest evaluation method for high-stakes agents (financial, medical, legal) because no real user ever sees an unvalidated response.

The cost is real: you pay for two inference runs per request. On Vertex AI, shadow mode is typically implemented by adding a second Cloud Run service that mirrors traffic via a request fanout, then comparing outputs offline using the Vertex AI Model Evaluation pipeline.

Rollback Procedures

Every production deployment must have a documented rollback procedure with a defined trigger threshold. A practical template:

Trigger MetricError rate >2%, task completion rate drop >5%, or P95 latency >4s sustained for 10 minutes.

Rollback ActionRe-point traffic split to 100% stable version via Agent Engine traffic controls or API Gateway route update.

RTO TargetRecovery Time Objective should be <5 minutes for customer-facing agents. Automate via Cloud Monitoring alert → Cloud Function → Agent Engine Admin API.

Production Pattern

Google's Site Reliability Engineering documentation recommends automating rollback for agents using a release gate: a Cloud Monitoring alert that fires when SLO burn rate exceeds a threshold, triggering a Pub/Sub message consumed by a Cloud Function that calls the Agent Engine Admin API to reset the traffic split.

Lesson 1 Quiz

Deployment Strategies · 4 questions

What is the primary advantage of a blue/green deployment over a canary release for Vertex AI agents?

Correct. Blue/green maintains two full environments, so switching back to the stable version is an atomic traffic redirect — no gradual ramp-down required.

Not quite. Blue/green actually requires more infrastructure (two full environments). Its key advantage is instant rollback via atomic traffic switching.

Why does Google's documented guidance recommend a minimum 48–72 hour observation window at each canary tier for production agents?

Correct. Long-tail queries — rare but valid user inputs — only appear in meaningful volume after sustained traffic exposure. A short window may miss important failure modes.

The reason is statistical: LLM failures often hide in rare query patterns. You need enough traffic volume across enough time to surface these edge cases.

In shadow mode deployment, what happens to the candidate agent's responses?

Correct. Shadow mode is a "dark launch" — the candidate receives real traffic mirrors but its outputs are discarded from the user-facing response. Zero user risk.

Shadow mode never exposes candidate responses to users. Responses are captured for offline evaluation only — that's the entire point of the pattern.

Which Google Cloud service can be used to automate agent rollback when a Cloud Monitoring SLO burn rate alert fires?

Correct. The pattern is: Alert → Pub/Sub notification → Cloud Function → Agent Engine Admin API call to reset traffic split. This keeps rollback latency under 5 minutes.

The recommended automation pattern chains: Cloud Monitoring alert fires → Pub/Sub topic → Cloud Function subscriber → Agent Engine Admin API to update the traffic split.

Lab 1: Deployment Strategy Design

Practice designing agent deployment plans with an AI assistant

Your Scenario

You are deploying a new version of a customer-service agent for a mid-size e-commerce company. The agent handles order lookups, returns, and product questions. Current traffic is approximately 50,000 conversations per day. The new version includes a redesigned returns flow and updated product catalog grounding.

Ask the assistant to help you design a canary rollout plan, define rollback triggers, or compare deployment strategies for this specific scenario. Try at least 3 exchanges to complete the lab.

Deployment Strategy Assistant

Lab 1

I'm your deployment strategy assistant for Vertex AI agents. You're working on a canary rollout for a customer-service agent handling 50,000 conversations/day. What aspect would you like to design first — the traffic split schedule, rollback triggers, or something else?

Module 8 · Lesson 2

Observability and Monitoring for Production Agents

Tracing multi-step reasoning, measuring quality at scale, and building dashboards that actually surface agent failures.

What does "healthy" mean for an agent that makes probabilistic decisions across dozens of tool calls per session?

When Duolingo deployed its Duolingo Max AI tutoring features (powered by GPT-4 via Azure OpenAI) in 2023, the engineering team published that their biggest monitoring challenge was not latency or errors — it was detecting pedagogically incorrect explanations at scale. Standard HTTP monitoring was useless. They built custom evaluation pipelines that sampled 1% of conversations and ran them through a secondary LLM judge to flag quality regressions. Vertex AI's Online Evaluation feature is Google's productized answer to exactly this problem.

The Three Layers of Agent Observability

Agent monitoring requires three distinct layers that do not overlap cleanly with traditional APM:

Layer 1

Infrastructure Metrics

Latency (P50/P95/P99), error rates, token consumption, quota utilization. Standard Cloud Monitoring dashboards cover this layer.

Layer 2

Behavioral Traces

Which tools were called, in what order, with what inputs/outputs. Cloud Trace + LangSmith-style structured logging covers this layer.

Layer 3

Quality Signals

Groundedness, task completion, user satisfaction, factual accuracy. Requires LLM-based evaluation or human review. Most teams neglect this layer.

Cloud Monitoring Integration

Vertex AI Agent Engine automatically emits metrics to Cloud Monitoring under the aiplatform.googleapis.com namespace. Key metrics to dashboard immediately after deployment:

prediction/latenciesDistribution of response latency in milliseconds. Alert on P95 exceeding your SLO target.

prediction/error_countTotal errors by error type. Distinguish model errors from tool call failures from quota exhaustion.

prediction/request_countQPS by endpoint version. Essential for verifying canary traffic splits are behaving as configured.

token_count (input/output)Per-session token consumption. Sudden spikes often indicate prompt injection or runaway conversation loops.

Distributed Tracing with Cloud Trace

For agents built with the Vertex AI SDK, OpenTelemetry instrumentation is the recommended tracing approach. When you initialize your agent application, wrapping it with an OTEL tracer context that exports to Cloud Trace gives you span-level visibility into each reasoning step.

A single user conversation might produce a trace with spans for: initial LLM call → tool selection → Vertex AI Search retrieval → second LLM call → response generation. Without this trace, you cannot distinguish "the agent was slow" from "the search retrieval was slow and forced a retry."

Critical Pattern

Always propagate a session ID and conversation turn ID as trace attributes. This lets you reconstruct the full conversation history for any flagged trace — without these IDs, individual spans are diagnostically useless when user complaints arrive.

Vertex AI Online Evaluation

Vertex AI's Online Evaluation feature (GA as of 2024) allows you to configure automated quality assessments that run asynchronously on sampled production traffic. You define evaluation criteria — groundedness, coherence, instruction following, custom rubrics — and specify a sampling rate (e.g., 5%). A judge model scores each sampled conversation and writes results to BigQuery.

This is the productized equivalent of what Duolingo built manually. The BigQuery output can drive a Looker Studio dashboard or feed back into Cloud Monitoring as custom metrics to trigger alerts when quality scores drop below a threshold.

GroundednessIs the response supported by retrieved context?

CoherenceIs the multi-turn conversation logically consistent?

SafetyDoes the response violate configured content policies?

Task CompletionDid the agent accomplish the stated user goal?

User Feedback Integration

Explicit user feedback signals (thumbs up/down, satisfaction ratings, explicit corrections) are among the highest-value monitoring inputs available. In Vertex AI Agent Builder, you can log feedback events to Cloud Logging using the Conversation API's feedback endpoint. Aggregate these in BigQuery alongside your quality evaluation scores to build a composite health index.

Google's internal guidance documents an important caveat: user satisfaction and quality are not the same signal. Users often rate confident wrong answers positively. Layer both explicit feedback and automated quality evaluation for a complete picture.

Lesson 2 Quiz

Observability and Monitoring · 4 questions

Which observability layer do most production agent teams neglect, according to the lesson?

Correct. Quality signals — groundedness, task completion, factual accuracy — require LLM-based evaluation or human review, making them harder to instrument. Most teams stop at layers 1 and 2.

Quality signals (layer 3) are what most teams neglect. Infrastructure and trace metrics are well-served by existing tooling, but quality requires custom evaluation pipelines.

A sudden spike in token_count per session metrics often indicates what type of problem?

Correct. Token spikes per session are a diagnostic signal for either prompt injection (attackers adding large context payloads) or agents looping — repeatedly calling tools without resolving the user's goal.

Token spikes per session specifically suggest either a prompt injection attack (adding large payloads to context) or a reasoning loop where the agent keeps retrying without converging.

What is the primary purpose of propagating a session ID and turn ID as Cloud Trace attributes?

Correct. Without session and turn IDs attached to spans, individual trace entries are orphaned — you can see that a span failed but cannot find the conversation context that caused it.

Session and turn IDs in trace attributes allow you to reconstruct the full conversation for any problematic span. Without them, debugging specific failures is nearly impossible.

Why does Google's guidance note that user satisfaction ratings and quality evaluation scores are not interchangeable monitoring signals?

Correct. A confident, fluent, but factually wrong agent response often receives positive user ratings. Satisfaction metrics alone give a false sense of quality health.

The key insight is that users rate confidence and fluency, not correctness. An agent that confidently gives wrong answers can score high on satisfaction while failing quality evaluation.

Lab 2: Monitoring Dashboard Design

Design observability strategies for production agents

Your Scenario

Your e-commerce customer-service agent has been running in production for two weeks. You're receiving complaints that some users report receiving incorrect return policy information, but your latency and error dashboards look normal. You need to design a monitoring strategy that would surface this class of issue.

Ask the assistant about setting up quality evaluation, building Cloud Monitoring alert policies, designing BigQuery schemas for conversation logging, or using Vertex AI Online Evaluation. Aim for at least 3 exchanges.

Monitoring Strategy Assistant

Lab 2

I'm here to help you build a monitoring strategy that catches quality issues, not just infrastructure failures. Your situation — normal latency metrics but user complaints about incorrect policy information — is a classic Layer 3 observability gap. Where would you like to start: Vertex AI Online Evaluation configuration, BigQuery logging schema, or Cloud Monitoring alert policies?

Module 8 · Lesson 3

Feedback Loops and Agent Improvement

Closing the loop from production signals to improved agent versions: data flywheels, RLHF on Vertex AI, and prompt iteration at scale.

How do you systematically turn the failures your monitoring surfaces into a better agent, without breaking what already works?

Waymo's 2023 technical documentation on its autonomous vehicle AI described a concept it calls the "data flywheel": every mile driven by a production vehicle produces labeled training data (via the vehicle's own safety systems) that improves the next model version. Google's documentation on Vertex AI agent improvement explicitly borrows this framing — production conversations are treated as a continuous source of training signal, not as disposable logs.

The Agent Data Flywheel

A data flywheel for production agents works in cycles. Each cycle collects production conversations, filters and labels them, uses them to improve the agent (via fine-tuning, prompt updates, or grounding data updates), deploys the improved version, and restarts collection. The key discipline is systematic filtering: not all conversations are equally valuable as training signal.

Step 1

Collect

Log all conversations to Cloud Logging / BigQuery. Include session IDs, turn-level responses, tool call traces, and any user feedback signals.

Step 2

Filter

Select high-signal examples: low-rated turns, fallback activations, long conversations where the agent failed to resolve the task, grounding failures.

Step 3

Label

Human raters or LLM judges annotate filtered conversations with correct responses. Vertex AI Data Labeling Service supports both workflows.

Step 4

Improve

Apply corrections via prompt updates, fine-tuning runs on Vertex AI, grounding data refresh, or tool instruction updates.

Prompt Iteration at Scale

For most production agents, prompt engineering is the fastest and cheapest improvement lever — faster than fine-tuning, cheaper than adding infrastructure. Vertex AI Prompt Management (released 2024) provides versioned prompt storage with A/B testing capability directly in the console or via SDK.

A rigorous prompt iteration workflow at scale uses the following discipline: identify a category of failure from production logs → formulate a hypothesis about why the prompt causes the failure → write a candidate prompt → evaluate the candidate against a regression test set of known-correct conversations before deploying. Without the regression test set, "fixing" one failure mode routinely breaks a different one.

Anti-Pattern Warning

The most common agent improvement mistake is optimizing prompts against the complaints in your support queue without checking whether the fix degrades the majority of unproblematic conversations. Always maintain a balanced evaluation set that includes both failure cases and successful baseline conversations.

Supervised Fine-Tuning on Vertex AI

When prompt iteration hits diminishing returns — typically when the failure mode requires the model to have new knowledge or a fundamentally different behavior pattern — supervised fine-tuning (SFT) on Vertex AI is the next lever. The process uses Vertex AI Generative AI Studio's tuning pipeline:

Dataset FormatJSONL with {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "model", ...}]} structure. Minimum ~100 examples; 500–1000 for meaningful behavioral change.

Tuning JobLaunched via Vertex AI SDK or Console. Runs on TPU accelerators. Typical cost for Gemini 1.5 Flash SFT: $2–8 per 1,000 training examples.

EvaluationRun the tuned model through your standard eval set before promoting. Compare against base model on both target task and held-out general capability tests.

RLHF and Preference Optimization

Vertex AI supports Reinforcement Learning from Human Feedback (RLHF) via its model tuning pipeline for select Gemini models. The practical application for production agents is typically preference tuning: given pairs of agent responses to the same user query (one rated positively, one negatively), the model is tuned to prefer the positive style.

The data collection requirement is the main bottleneck. You need paired preference examples at scale — typically thousands of pairs for meaningful signal. Google's 2024 guidance suggests that for most production agent improvement cycles, SFT on high-quality curated examples outperforms RLHF unless you have access to hundreds of human raters or very high production traffic to generate natural preference pairs.

Grounding Data Updates

For agents that rely on Vertex AI Search datastores or enterprise data connectors, many factual errors trace not to the model but to stale or incomplete grounding data. A disciplined grounding update cadence — typically weekly for frequently-changing domains like product catalogs or policy documents — can resolve a significant fraction of quality complaints without any model changes.

Monitor retrieval coverage (the fraction of user queries that return relevant grounding documents) in addition to response quality. A drop in retrieval coverage usually indicates that user query patterns have drifted from the current grounding corpus, not that the model has regressed.

Improvement Priority Order

Google's recommended improvement hierarchy for production agents: 1) Grounding data refresh → 2) Prompt iteration with regression testing → 3) Tool/instruction updates → 4) Supervised fine-tuning → 5) RLHF/preference optimization. Start at the top; each step is faster and cheaper than the next.

Lesson 3 Quiz

Feedback Loops and Agent Improvement · 4 questions

What is the most common anti-pattern when iterating on agent prompts based on user complaints?

Correct. Fixing specific complaint cases without a regression test set that includes healthy conversations routinely introduces new failures — a well-documented failure mode in production AI systems.

The classic anti-pattern is fixing the complaints you see while unknowingly breaking the conversations you don't see complaints about. Always test against a balanced evaluation set.

According to Google's recommended improvement hierarchy, what should you try before supervised fine-tuning when a production agent has quality issues?

Correct. The hierarchy is: grounding data refresh → prompt iteration → tool/instruction updates → SFT → RLHF. Start with the fastest, cheapest interventions first.

Fine-tuning is step 4 in Google's hierarchy. Before reaching for SFT, try grounding data refresh, prompt iteration with regression testing, and tool/instruction updates — all faster and cheaper.

What does a drop in retrieval coverage in a Vertex AI Search-grounded agent most likely indicate?

Correct. Retrieval coverage drops when user queries start asking about topics not covered in the indexed corpus. A datastore refresh or expansion resolves this without any model changes.

Coverage drops typically mean the grounding corpus hasn't kept up with what users are asking about — query drift. Update the datastore before investigating model-side causes.

Why does Google's guidance suggest that SFT often outperforms RLHF for most production agent improvement cycles?

Correct. RLHF's data requirement is the bottleneck. Thousands of preference pairs require either high production traffic (to generate natural pairs) or large human rater teams — resources most teams don't have.

The data bottleneck is the key issue: RLHF needs thousands of preference pairs to work well. Without high traffic or a large rater pool, SFT on curated examples is more practical and effective.

Lab 3: Building an Improvement Pipeline

Design a data flywheel and improvement workflow for a production agent

Your Scenario

Your monitoring dashboard has surfaced a pattern: 8% of conversations involving return shipping labels end with users expressing frustration or asking the same question multiple times. The Vertex AI Online Evaluation pipeline shows groundedness scores dropping for queries about "international return shipping." You need to design an improvement pipeline to fix this issue.

Ask the assistant to help you design the data collection strategy, filtering approach, labeling workflow, or decide which improvement lever (grounding refresh, prompt iteration, or fine-tuning) to use first. Try at least 3 exchanges to complete this lab.

Agent Improvement Assistant

Lab 3

I'm your agent improvement assistant. You have a clear failure signal: international return shipping queries have low groundedness scores and cause conversation loops. Before jumping to fine-tuning, let's diagnose properly. Do you want to start with the grounding data investigation, the data collection/filtering strategy, or the evaluation methodology to confirm the root cause?

Module 8 · Lesson 4

Safety, Compliance, and Responsible AI in Production

Content safety controls, PII handling, audit logging, and operationalizing Google's Responsible AI principles on Vertex AI.

What happens when your production agent encounters a safety edge case at 3 AM when no engineer is watching?

In February 2024, Google's Gemini image generation was temporarily disabled after the system produced historically inaccurate images when prompted for historical figures. The post-mortem, discussed publicly by Google CEO Sundar Pichai, centered on insufficient production monitoring for cultural sensitivity edge cases — cases that passed safety benchmarks but failed in real-world production traffic patterns. This event directly influenced Vertex AI's updated safety evaluation guidelines and the emphasis on ongoing production safety monitoring rather than one-time pre-deployment checks.

Vertex AI Safety Controls Architecture

Vertex AI provides a layered safety architecture for production agents. Understanding which layer handles which risk is essential for correct configuration:

Layer 1

Model-Level Safety

Gemini's built-in safety training. Rejects clearly harmful content by default. Cannot be fully disabled for customer-facing deployments.

Layer 2

Safety Filters

Configurable harm category thresholds (HARM_CATEGORY_HATE_SPEECH, DANGEROUS_CONTENT, etc.) set per deployment. Vertex AI Safety API.

Layer 3

Vertex AI Guardrails

Custom content filters, banned topics, output constraints. Configured via Agent Builder safety settings. Can be organization-specific.

Layer 4

Application Controls

Your application code validates, filters, and logs responses before returning to users. Last line of defense. Should never be the only defense.

PII Detection and Handling

Production agents frequently receive sensitive user data — names, account numbers, addresses, health information — in conversation context. Vertex AI integrates with Cloud DLP (Data Loss Prevention) to provide automated PII detection and redaction in conversation logs. The integration works at two points:

Input InspectionCloud DLP inspects user inputs before they enter the agent context. Optionally redacts PII in logs while passing unredacted content to the model for processing.

Output InspectionCloud DLP inspects agent responses before they're returned to the user. Catches cases where the model inadvertently surfaces PII from its training data or grounding corpus.

Log RedactionAll conversation logs stored in Cloud Logging or BigQuery should have PII redacted at write time. Configure Cloud DLP deidentification templates for your specific data types.

Audit Logging and Compliance

Enterprise deployments require immutable audit logs documenting every agent interaction — who asked what, what the agent responded, what tools were called, and what data was accessed. Vertex AI writes to Cloud Audit Logs automatically for data access events, but conversation-level audit logging requires explicit configuration.

The recommended architecture for regulated industries (financial services, healthcare) logs: timestamp, user identifier (pseudonymized), session ID, input hash, output hash, tool calls list, grounding document IDs accessed. The input/output hashes allow compliance review without storing raw PII in audit logs. Cloud Logging with log sinks to Cloud Storage provides the immutability and retention controls required by most compliance frameworks.

GDPR / CCPA Consideration

If your agent operates in the EU or serves California residents, conversation logs containing user data are subject to data subject access and deletion rights. Design your BigQuery logging schema with a pseudonymous user_id that can be joined to a separate identity table — making deletion (nulling the identity table row) operationally feasible without rewriting conversation logs.

Operationalizing Google's Responsible AI Principles

Google's seven Responsible AI principles (published in 2018 and updated with specific AI guidance through 2024) translate into concrete production controls for Vertex AI agents:

Be Socially BeneficialDocument intended use cases and explicitly out-of-scope uses in your agent's system prompt and in operational runbooks.

Avoid Unfair BiasRun regular demographic bias audits using Vertex AI Evaluation with stratified test sets across demographic dimensions. Log and act on disparate performance findings.

Be AccountableMaintain an AI system card documenting the agent's capabilities, limitations, training data lineage, and known failure modes. Update it with each major version.

Preserve PrivacyImplement Cloud DLP integration, pseudonymous logging, and data minimization in grounding corpus construction.

Incident Response for Agent Safety Events

Every production agent deployment should have a documented AI Safety Incident Response Plan before going live — not after the first incident. The plan should define: (1) what constitutes a safety event, (2) who is on-call and how they're alerted, (3) the escalation path, (4) the immediate containment action (traffic shift to safe fallback or agent shutdown), and (5) the post-incident review process.

Google's Gemini image generation incident is instructive: the system was publicly disabled within hours of the issues being identified. Having pre-authorized kill switches — traffic splits that can redirect to a safe fallback agent without requiring change management approval — is a non-negotiable production requirement for public-facing agents.

Production Safety Checklist

Before launching a public-facing agent: ✓ Safety filter thresholds configured and tested ✓ Cloud DLP PII inspection enabled on inputs and outputs ✓ Immutable audit logging configured with appropriate retention ✓ Pre-authorized kill switch with sub-5-minute RTO ✓ Safety incident response runbook documented and on-call rotation established ✓ AI system card published and approved by responsible owner.

Lesson 4 Quiz

Safety, Compliance, and Responsible AI · 4 questions

Why does Vertex AI's safety architecture use multiple layers rather than relying solely on Gemini's built-in model safety training?

Correct. Model-level safety addresses broadly harmful content, but organizational policies, banned topics, and domain-specific risks require additional configurable layers that the model alone cannot anticipate.

Defense in depth is the principle: model safety covers generic harms, but each production context has unique domain-specific risks that require additional layered controls beyond the model's training.

What is the recommended approach for storing conversation audit logs to comply with data subject deletion rights under GDPR?

Correct. Pseudonymization with a joinable identity table allows deletion (by removing the identity mapping) without rewriting immutable conversation logs — satisfying both audit requirements and deletion rights.

The recommended approach uses a pseudonymous user_id in logs joined to a separate identity table. Deleting the identity row effectively pseudonymizes historical records without rewriting immutable logs.

What key lesson does Google's February 2024 Gemini image generation incident illustrate about AI safety in production?

Correct. The incident involved cases that passed standard safety benchmarks but failed in real production traffic patterns — a clear demonstration that pre-deployment evaluation is necessary but not sufficient.

The incident showed that safety benchmarks can miss real-world failure patterns. Production monitoring for edge cases — not just pre-launch evaluation — is necessary for responsible AI deployment.

What does Cloud DLP's output inspection in a Vertex AI agent pipeline protect against?

Correct. Output DLP inspection catches the specific risk of model memorization — where a model may reproduce personal data from its training set — as well as PII leakage from grounding documents.

Output inspection protects against the model surfacing PII from its own training data (memorization) or from grounding documents. This is a distinct risk from input PII and requires separate handling.

Lab 4: Safety and Compliance Design

Design safety controls and compliance architecture for a production agent

Your Scenario

Your e-commerce customer-service agent is being expanded to handle loyalty program account inquiries, which means users will now share account numbers, email addresses, and purchase history in conversation. Your legal team has flagged GDPR compliance requirements. You need to design the safety and compliance architecture before expanding this capability.

Ask the assistant to help you design the PII handling strategy, audit logging architecture, safety filter configuration, incident response runbook, or AI system card. Try at least 3 exchanges to complete this lab.

Safety & Compliance Assistant

Lab 4

I'm your safety and compliance design assistant. Adding loyalty account handling introduces PII risks your current architecture may not address. Before users share account numbers in conversations, you need Cloud DLP integration, pseudonymous audit logging, and a safety incident runbook. Which aspect would you like to design first?

Module 8 Test

Deploying, Monitoring, and Improving Agents in Production · 15 questions · Pass at 80%

1. In Vertex AI Agent Engine, which feature allows you to route 5% of live traffic to a new agent version while keeping 95% on the stable version?

Correct. Vertex AI Agent Engine supports traffic percentage splits directly on agent endpoints, enabling canary releases without external load balancer configuration.

Traffic percentage splits on agent endpoints is the correct answer — this is built into Agent Engine and allows canary releases without external infrastructure changes.

2. What is "shadow mode" deployment in the context of Vertex AI agents?

Correct. Shadow mode duplicates traffic to the candidate agent but discards its responses — zero user exposure to unvalidated outputs while enabling real-traffic evaluation.

Shadow mode = mirrored traffic to candidate, responses discarded. Users never see candidate outputs. It's the safest evaluation method but costs double the inference compute.

3. Which Cloud Monitoring metric namespace does Vertex AI Agent Engine use for automatically emitted metrics?

Correct. Vertex AI emits metrics to Cloud Monitoring under the aiplatform.googleapis.com namespace, including latency, error count, and request count metrics.

The correct namespace is aiplatform.googleapis.com — all Vertex AI services emit Cloud Monitoring metrics under this namespace.

4. When designing distributed tracing for a multi-step agent, what two identifiers should always be propagated as trace attributes?

Correct. Session ID and turn ID allow you to reconstruct the full conversation context for any flagged trace, making individual span diagnostics actionable.

Session ID and turn ID are the critical trace attributes — without them, you can see that a span failed but cannot find the conversation context that caused the failure.

5. Vertex AI Online Evaluation samples production conversations and scores them using a judge model. Where does it write results by default?

Correct. Online Evaluation writes evaluation results to BigQuery, enabling SQL analysis, Looker Studio dashboards, and scheduled alerting via custom Cloud Monitoring metrics.

Online Evaluation results go to BigQuery by default, allowing rich analytical queries and integration with BI tools like Looker Studio.

6. What is the first recommended step in Google's agent improvement hierarchy when quality issues surface in production?

Correct. Grounding data refresh is step 1 — it's fast, cheap, and resolves a significant fraction of factual quality issues without any model changes.

The hierarchy starts with grounding data refresh — the fastest, cheapest fix. Only move to prompt iteration, then fine-tuning, then RLHF if grounding doesn't resolve the issue.

7. A drop in retrieval coverage for a Vertex AI Search-grounded agent most likely indicates what?

Correct. Coverage drops when the corpus hasn't kept up with user query evolution — update the datastore to cover emerging topics.

Retrieval coverage drops are a corpus staleness signal: users are asking about topics not indexed in your datastore. Update the corpus before investigating model-side causes.

8. When does Google's guidance suggest using RLHF over supervised fine-tuning for agent improvement?

Correct. RLHF's bottleneck is preference data volume — you need thousands of pairs, requiring either high traffic or substantial human rating capacity.

RLHF only outperforms SFT when you can generate sufficient preference pairs (thousands). This requires high traffic volume or a large dedicated rater pool.

9. Which Vertex AI safety layer handles configurable harm category thresholds like HARM_CATEGORY_HATE_SPEECH?

Correct. Harm category thresholds (BLOCK_NONE, BLOCK_LOW_AND_ABOVE, etc.) are configured via Layer 2 — the Safety API — per deployment.

Harm category thresholds are a Layer 2 control configured via the Vertex AI Safety API. Layer 1 is fixed model safety; Layer 3 is custom organizational guardrails.

10. For GDPR compliance, what is the recommended BigQuery logging architecture to support data subject deletion requests?

Correct. Pseudonymization with a joinable identity table satisfies both audit (immutable logs) and deletion (remove identity mapping) requirements simultaneously.

Pseudonymous user_id + separate identity table is the recommended pattern. Nulling the identity row satisfies deletion rights without rewriting immutable conversation logs.

11. What does Cloud DLP output inspection in an agent pipeline specifically protect against?

Correct. Output DLP catches the specific risk of model memorization and grounding corpus PII leakage in agent responses before they reach users.

Output DLP inspection guards against the model inadvertently including PII from its training data (memorization) or from retrieved grounding documents in its responses.

12. What is a "pre-authorized kill switch" in the context of production agent safety incident response?

Correct. A pre-authorized kill switch is an on-call engineer's ability to redirect traffic to a safe fallback immediately — without needing change management sign-off that would add minutes to the response time.

A kill switch is a pre-configured traffic redirect that on-call engineers can activate immediately without change management delays. The Gemini image incident showed why sub-hour containment capability matters.

13. What automated rollback architecture does Google's SRE documentation recommend for agents with SLO burn rate violations?

Correct. The alert → Pub/Sub → Cloud Function → Admin API chain minimizes rollback latency and keeps the entire flow within Google Cloud's managed services.

The recommended chain is: Alert fires → Pub/Sub topic → Cloud Function subscriber → Agent Engine Admin API to reset the traffic split. This keeps RTO under 5 minutes.

14. Which observability technique is best suited for investigating why an agent is slow, when you need to distinguish between LLM inference latency and retrieval latency?

Correct. Distributed tracing with Cloud Trace provides span-level visibility — you can see exactly which step (LLM call, retrieval, tool invocation) consumed what time in each conversation turn.

Distributed tracing is the right tool for latency root-cause analysis. Cloud Trace span data shows you exactly which component — LLM inference vs. retrieval vs. tool call — is the bottleneck.

15. When iterating on agent prompts based on user complaint cases, what must you always include in the evaluation set to avoid regressions?

Correct. A balanced evaluation set including both failures you're fixing and healthy baselines you must not break is the safeguard against the classic "fix one, break another" regression pattern.

Always test against a balanced set: the failure cases you're fixing PLUS the successful baseline conversations. Without baselines, you'll fix complaints while unknowingly breaking what was working.