Lesson 1 · Module 2

Google Cloud Authentication for Vertex AI

Before your agent can think, it must be trusted — understanding ADC, service accounts, and credential scopes.

Why does authentication fail silently, and how do you catch it before production?

When Google launched Vertex AI's Generative AI Studio in early 2023, dozens of enterprise pilots immediately ran into the same wall: their perfectly valid API keys were rejected. The problem was not the keys. It was that Vertex AI, unlike older Google APIs, requires OAuth 2.0 scopes tied to a Google Cloud project — not standalone API keys. Teams that had migrated from PaLM API's preview period had to rebuild their credential pipelines from scratch.

Application Default Credentials (ADC)

Vertex AI SDK uses Application Default Credentials — a credential resolution chain that checks multiple locations in a fixed order. When your code calls vertexai.init(), the SDK does not ask you for a key. It asks the environment.

The ADC chain, in order: (1) The environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to a service account JSON file. (2) The well-known file at ~/.config/gcloud/application_default_credentials.json created by gcloud auth application-default login. (3) Google Cloud metadata server — only available inside Compute Engine, Cloud Run, GKE, or Cloud Functions. (4) Failure with a clear error message.

Critical Distinction

gcloud auth login authenticates you for the gcloud CLI. gcloud auth application-default login creates credentials your code can use. These are completely separate credential stores. Forgetting this difference is the #1 authentication mistake in Vertex AI development.

Service Accounts for Production

For production agents, you never use user credentials. You create a service account — a non-human Google identity your agent assumes. Service accounts need two things: a key file or Workload Identity (how the SDK authenticates to Google), and IAM roles (what the authenticated identity is allowed to do).

The minimum IAM role for Vertex AI inference is roles/aiplatform.user. For reading models from Model Registry add roles/aiplatform.viewer. Never grant roles/owner or roles/editor to a service account running an agent — this violates least-privilege and creates a credential explosion risk.

Local Development

Run gcloud auth application-default login
Set project: gcloud config set project YOUR_PROJECT
SDK auto-discovers credentials via ADC chain
Never commit .json key files to git

Production / Cloud Run

Attach service account to Cloud Run service
Enable Workload Identity — no key files needed
Metadata server handles token refresh automatically
Grant only roles/aiplatform.user to service account

Required API Enablement

Authentication alone is not enough. The Vertex AI API must be explicitly enabled in your Google Cloud project. Even with perfect credentials, a call to aiplatform.googleapis.com against a project where the API is disabled returns a 403 with the message "Vertex AI API has not been used in project [X] before or it is disabled."

Enable it once via: gcloud services enable aiplatform.googleapis.com — or through the Cloud Console under APIs & Services. This is a project-level setting, not per-credential.

# Step 1: Enable the API (one-time per project)
gcloud services enable aiplatform.googleapis.com

# Step 2: Set up ADC for local dev
gcloud auth application-default login
gcloud config set project my-agent-project-id

# Step 3: Verify ADC is working
gcloud auth application-default print-access-token
      

Quota & Billing

Vertex AI Gemini API calls are billed per 1,000 characters of input and output. There is no free tier for Gemini 1.5 Pro in Vertex AI (unlike the Google AI Studio / Gemini API free tier). Ensure billing is enabled on your project before your first call, or you will receive a 403 billing-not-enabled error regardless of authentication state.

Key Terms

ADCApplication Default Credentials — the SDK's automatic credential resolution chain across environment variable, well-known file, and metadata server.

Service AccountA non-human Google Cloud identity with its own email, roles, and optional key files, used by production workloads.

Workload IdentityA GKE/Cloud Run mechanism that binds a Kubernetes service account to a Google service account, eliminating the need for key files in production.

roles/aiplatform.userThe minimum IAM role granting permission to invoke Vertex AI prediction endpoints, including Gemini models.

Quiz — Authentication & ADC

Lesson 1 · 4 questions · Select the best answer

1. What does gcloud auth application-default login create that gcloud auth login does NOT?

Correct. gcloud auth application-default login writes credentials to ~/.config/gcloud/application_default_credentials.json — the well-known file that SDKs check in the ADC resolution chain. The regular gcloud auth login only authenticates the CLI.

Not quite. These are two separate credential stores. gcloud auth login authenticates the CLI; gcloud auth application-default login creates credentials readable by SDK code via ADC.

2. What is the FIRST location the ADC chain checks for credentials?

Correct. The ADC chain first checks the environment variable GOOGLE_APPLICATION_CREDENTIALS. If it points to a valid service account JSON, the SDK uses that — no further checks needed.

The environment variable GOOGLE_APPLICATION_CREDENTIALS is checked first. The metadata server is last (only available on Google Cloud infrastructure), and the well-known file is second.

3. Which IAM role is the minimum required to call Gemini models on Vertex AI?

Correct. roles/aiplatform.user grants permission to invoke Vertex AI prediction endpoints. It is the least-privilege role for running inference — always prefer it over broader roles like editor.

The minimum is roles/aiplatform.user. Broader roles like editor violate least-privilege; roles/viewer is read-only and does not permit inference calls.

4. A Cloud Run service returns 403 despite having a valid service account with aiplatform.user. What is the most likely cause?

Correct. Even with valid credentials and correct IAM roles, if aiplatform.googleapis.com has not been enabled in the project, all calls return 403. It must be enabled once per project via gcloud services enable aiplatform.googleapis.com.

Cloud Run does support Workload Identity. Service account keys on Cloud Run are not file-based — they use the metadata server. The most common cause of this exact scenario is the Vertex AI API not being enabled in the project.

Lab 1 — Authentication Troubleshooting

Practice diagnosing Vertex AI authentication failures with the AI assistant

Scenario: Your First 403

You have just set up a new Google Cloud project and written your first Vertex AI SDK call. You run it and get a 403. Your task is to work through the authentication checklist with the assistant — identifying what checks to perform and in what order.

Ask the assistant about your specific error, what the ADC chain checks, how to verify your setup, or how to configure service accounts. Complete at least 3 exchanges to finish this lab.

Try asking: "I ran gcloud auth login but my SDK code still gets a 403 — what am I missing?" or "How do I confirm my ADC credentials are actually being picked up?"

Auth Troubleshooting Assistant

Vertex AI Auth

Ready to debug your authentication setup. Describe your error or ask about any part of the Vertex AI credential chain — ADC, service accounts, IAM roles, or API enablement. What are you running into?

Lesson 2 · Module 2

Installing the Vertex AI SDK and Project Initialization

Setting up your Python environment, initialising the SDK correctly, and understanding what vertexai.init() actually does.

What happens behind the scenes when vertexai.init() is called, and which parameters are truly required?

In Google's own internal migration from the Bard API to Vertex AI Gemini during Q1 2024, the primary source of developer confusion was SDK initialization order. Teams were calling GenerativeModel() before vertexai.init(), resulting in models that connected to the wrong project. The fix required adding strict init-before-model guards in their internal wrapper library — a pattern now documented in the official Vertex AI Python SDK best practices guide.

Installing the SDK

The Vertex AI Python SDK is distributed as google-cloud-aiplatform. The Gemini-specific GenerativeModel interface requires version 1.38.0 or later (released November 2023). Many tutorials use the older PaLM client library google-generativeai — this is a completely different package for the Google AI Studio API, not Vertex AI.

# Install — use a virtual environment
pip install "google-cloud-aiplatform>=1.38.0"

# Verify installation
python -c "import vertexai; print(vertexai.__version__)"

# Common mistake: this is NOT the Vertex AI SDK
# pip install google-generativeai  ← for Google AI Studio only
      

What vertexai.init() Does

The vertexai.init() call does three things: it resolves and caches credentials via the ADC chain, sets the default project and location for all subsequent SDK calls in the process, and validates that the project ID is a string (not a project number — both work, but a project ID is preferred for readability).

It does not make a network call. Credentials are not verified until the first actual model call. This means a misconfigured vertexai.init() will not fail immediately — it fails at inference time, which is why testing your auth before writing business logic matters.

import vertexai

# Minimum required: project and location
vertexai.init(
    project="my-agent-project-id",
    location="us-central1"
)

# With explicit credentials (optional — ADC is used if omitted)
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "/path/to/key.json",
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

vertexai.init(
    project="my-agent-project-id",
    location="us-central1",
    credentials=credentials
)
      

Choosing a Location

Gemini models on Vertex AI are available in specific regions. As of mid-2024, us-central1 has the broadest model availability and highest quota limits. europe-west4 and asia-southeast1 support Gemini 1.5 Pro and Flash but may have lower default quotas.

The location you set in vertexai.init() must match the location of any resources you reference — Vertex AI Endpoints, Vector Search indexes, and Model Registry entries are all regional. Cross-region calls are not supported.

Required Parameters

project — your GCP project ID string
location — the region (e.g. us-central1)

Optional Parameters

credentials — explicit credential object (ADC used if omitted)
staging_bucket — GCS bucket for training artifacts
experiment — Vertex AI Experiments name

Environment Variable Pattern

Best practice for production: read project and location from environment variables rather than hardcoding. Use os.environ.get("GOOGLE_CLOUD_PROJECT") and os.environ.get("GOOGLE_CLOUD_REGION", "us-central1"). This makes your agent portable across environments without code changes.

Key Terms

google-cloud-aiplatformThe official Python package for the Vertex AI SDK. Version ≥1.38.0 required for Gemini GenerativeModel support.

vertexai.init()The SDK initialization function that caches credentials and sets the default project/location for all subsequent calls in the process.

locationThe Google Cloud region where model inference runs and where regional resources (endpoints, indexes) must reside. Not interchangeable across calls.

Quiz — SDK Installation & Init

Lesson 2 · 4 questions

1. What is the correct pip package name for the Vertex AI Python SDK?

Correct. google-cloud-aiplatform is the Vertex AI SDK. The package google-generativeai is for the separate Google AI Studio / Gemini API — a common source of confusion.

The Vertex AI SDK package is google-cloud-aiplatform. The package google-generativeai is for the Google AI Studio API (separate product, different authentication, different quotas).

2. When does vertexai.init() actually verify that your credentials are valid?

Correct. vertexai.init() does not make a network call. It caches parameters and resolves the credential chain locally. The first actual HTTP request — triggered by a model call like generate_content() — is when credentials are truly verified.

vertexai.init() resolves credentials locally without a network call. Credential validity is only tested when the first real API request is made — typically the first generate_content() call.

3. Which two parameters are REQUIRED in a vertexai.init() call?

Correct. project (your GCP project ID) and location (the region) are the two required parameters. credentials is optional — ADC is used automatically if omitted.

The two required parameters are project and location. Credentials are optional because the SDK uses Application Default Credentials automatically when no explicit credential object is provided.

4. You set location="us-central1" in vertexai.init() but your Vector Search index is in europe-west4. What happens when you try to query it?

Correct. Vertex AI resources are strictly regional. A client initialized for us-central1 cannot access resources in europe-west4. You must either move the resource or reinitialize the SDK for the correct region.

Vertex AI does not support cross-region resource access. All resources — endpoints, indexes, model registry entries — must be in the same region as your SDK initialization. Cross-region calls fail.

Lab 2 — SDK Setup & Init Patterns

Practice writing correct vertexai.init() configurations with the AI assistant

Scenario: Configuring for Multiple Environments

Your team needs to deploy the same agent to three environments: local dev, staging (Cloud Run, europe-west4), and production (Cloud Run, us-central1). You need to write a single init pattern that works across all three without hardcoded values.

Work with the assistant to design an environment-aware initialization pattern. Ask about reading config from environment variables, handling missing variables gracefully, or how to structure your code for testability.

Try asking: "How should I structure vertexai.init() so the same code works locally and in Cloud Run?" or "What's the best way to handle a missing GOOGLE_CLOUD_PROJECT environment variable?"

SDK Configuration Assistant

Vertex AI Init

Let's build an environment-aware vertexai initialization pattern. Tell me about your deployment targets and I'll help you design an init strategy that works across all of them without hardcoded configuration.

Lesson 3 · Module 2

Making Your First Gemini API Call

GenerativeModel, generate_content(), and the anatomy of a Vertex AI Gemini response object.

What does a raw Gemini API response look like, and which fields matter for building a reliable agent?

At Google I/O 2024, Google demonstrated a live coding session building a Gemini 1.5 Pro agent in under 10 minutes using the Vertex AI SDK. The presenter emphasized one counter-intuitive point: the response.text shortcut property raises an exception if the model returns multiple candidates or if generation is blocked by safety filters — a gotcha that had already caused production failures in several Google Cloud customer deployments. The correct production pattern, they showed, always accesses response.candidates[0].content.parts[0].text with explicit safety check logic.

The GenerativeModel Class

After vertexai.init(), you create a model instance with GenerativeModel(model_name). The model name string format for Vertex AI differs from Google AI Studio: you use publisher model names like "gemini-1.5-pro-002" or "gemini-1.5-flash-002" — not the full resource path.

The GenerativeModel object is stateless and thread-safe. You can instantiate it once at module load time and reuse it across many requests. Creating it is free — no network call is made until generate_content() is called.

import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project="my-project", location="us-central1")

# Instantiate once — stateless, reusable, thread-safe
model = GenerativeModel("gemini-1.5-flash-002")

# Simplest possible call
response = model.generate_content("What is the capital of France?")
print(response.text)  # ← convenient but can raise ValueError
      

The Response Object

The response from generate_content() is a GenerateContentResponse object. Understanding its structure prevents production failures:

# Safe production pattern for accessing response text
response = model.generate_content("Summarise the risks of AI agents.")

# Check finish reason before accessing text
candidate = response.candidates[0]

if candidate.finish_reason.name == "STOP":
    text = candidate.content.parts[0].text
    print(text)
elif candidate.finish_reason.name == "SAFETY":
    print("Blocked by safety filter")
    print(candidate.safety_ratings)
elif candidate.finish_reason.name == "MAX_TOKENS":
    print("Output truncated — increase max_output_tokens")
else:
    print(f"Unexpected finish reason: {candidate.finish_reason}")
      

Finish Reason Values

The finish_reason field tells you why the model stopped generating. This is critical for agent reliability:

Normal Finish Reasons

STOP — model finished naturally (expected)
MAX_TOKENS — hit token limit, increase max_output_tokens

Error Finish Reasons

SAFETY — blocked by safety filter
RECITATION — blocked for potential copyright
OTHER — generic failure, check safety_ratings

Generation Configuration

You control model behaviour through a GenerationConfig object. The most important parameters for agents are temperature (randomness, 0.0–2.0), max_output_tokens (maximum response length), and candidate_count (how many responses to generate — almost always 1 for agents).

from vertexai.generative_models import GenerativeModel, GenerationConfig

model = GenerativeModel("gemini-1.5-pro-002")

config = GenerationConfig(
    temperature=0.2,          # low = deterministic, good for agents
    max_output_tokens=1024,   # cap output length
    candidate_count=1,        # always 1 for production agents
    top_p=0.8,               # nucleus sampling
)

response = model.generate_content(
    "Draft a one-paragraph executive summary of Q3 results.",
    generation_config=config
)
      

Token Counting

Before sending large payloads, use model.count_tokens(prompt) to get the exact token count. This prevents hitting quota limits mid-conversation and helps you design effective context window strategies. A token in Gemini is approximately 4 characters of English text.

Key Terms

GenerativeModelThe Vertex AI SDK class representing a Gemini model. Stateless and reusable. Created with a model name string like "gemini-1.5-flash-002".

generate_content()The primary method for sending a prompt to the model and receiving a GenerateContentResponse object.

finish_reasonA field on each response candidate indicating why generation stopped: STOP, MAX_TOKENS, SAFETY, RECITATION, or OTHER.

GenerationConfigA configuration object controlling temperature, max_output_tokens, candidate_count, top_p, and other sampling parameters.

Quiz — First API Call & Response Object

Lesson 3 · 4 questions

1. Why can response.text raise a ValueError in production?

Correct. response.text is a convenience property that raises ValueError when the response contains a safety block or multiple candidates. Production agents should always check candidate.finish_reason before accessing text content.

The issue is that response.text raises ValueError when generation is blocked by safety filters (finish_reason=SAFETY) or when candidate_count > 1. Always check finish_reason first in production code.

2. What finish_reason value indicates the model completed its response normally?

Correct. STOP is the finish_reason value when the model completed generation naturally. Other values — SAFETY, MAX_TOKENS, RECITATION — indicate abnormal termination that requires handling.

The normal completion value is STOP. SAFETY means blocked by content filters, MAX_TOKENS means the output limit was hit, and RECITATION means the model was stopped to avoid reproducing copyrighted content.

3. What GenerationConfig temperature value is most appropriate for a production agent that needs consistent, deterministic responses?

Correct. Low temperature (0.1–0.3) makes the model more deterministic — it consistently picks the highest-probability tokens. This is ideal for agents where consistency and reliability matter more than creative variation.

Low temperature (0.0–0.3) produces more deterministic outputs, which is what production agents need. High temperature increases randomness and variability. Temperature 0.0 IS supported by Vertex AI.

4. What is the correct safe way to access response text from a Vertex AI Gemini response?

Correct. The safe pattern is to access response.candidates[0].content.parts[0].text after verifying finish_reason == "STOP". This handles safety blocks, truncation, and multi-part responses correctly.

The safe pattern requires checking candidate.finish_reason first, then accessing response.candidates[0].content.parts[0].text. Using response.text directly can raise ValueError on safety blocks.

Lab 3 — Response Handling Patterns

Practice building robust Gemini response handlers with the AI assistant

Scenario: Production-Safe Response Extraction

Your agent is going to production next week. The team has experienced two incidents where response.text raised ValueError in staging when prompts triggered safety filters. You need to write a robust extract_text(response) helper function that handles all finish_reason cases gracefully.

Work with the assistant to design this function. Ask about specific edge cases — what to return when safety blocks occur, how to handle MAX_TOKENS truncation, whether to raise exceptions or return sentinel values.

Try asking: "What should my extract_text function return when finish_reason is SAFETY?" or "How do I handle the case where candidates list is empty?"

Response Handling Assistant

Gemini Response

Let's build a production-safe response extraction function. Tell me about your use case — what should happen when the model's response is blocked, truncated, or returns multiple parts? I'll help you handle each case correctly.

Lesson 4 · Module 2

System Instructions, Safety Settings, and Streaming

Giving your agent a persona, controlling content filters, and returning responses incrementally for better UX.

How do system instructions differ from user prompts, and when does streaming change your error-handling logic?

In 2024, Coda.io integrated Vertex AI Gemini into their AI Assistant product, processing millions of document queries monthly. Their engineering team published a post-mortem noting that their initial implementation used a system prompt via the first user message — a pattern from OpenAI's API. Switching to Vertex AI's native system_instruction parameter reduced prompt injection attempts by measurably reducing the model's tendency to follow contradictory instructions in user messages, because native system instructions receive different tokenization priority in Gemini's attention layers.

System Instructions

System instructions define your agent's persona, constraints, and operating context. In Vertex AI, they are set at model instantiation using the system_instruction parameter — not as the first message in the conversation. This matters: Gemini models treat native system instructions differently from user-turn content at the attention level.

A well-written system instruction specifies who the agent is, what it can and cannot do, and how it should format its responses. Keep system instructions under 1,000 tokens — lengthy system prompts can crowd out actual context from the conversation history.

from vertexai.generative_models import GenerativeModel

model = GenerativeModel(
    "gemini-1.5-pro-002",
    system_instruction="""You are a financial compliance assistant for Acme Corp.
You answer questions about internal policies and regulations.
You do NOT provide investment advice.
You do NOT access external URLs or systems.
Always cite the policy document section when answering."""
)

response = model.generate_content(
    "What is our policy on gifts from vendors?"
)
      

Safety Settings

Gemini models have four built-in safety categories: HARM_CATEGORY_HARASSMENT, HARM_CATEGORY_HATE_SPEECH, HARM_CATEGORY_SEXUALLY_EXPLICIT, and HARM_CATEGORY_DANGEROUS_CONTENT. Each has a threshold you can configure. The default is BLOCK_MEDIUM_AND_ABOVE — this blocks content rated medium or higher probability of harm.

For most enterprise agents, the defaults are appropriate. For specific use cases — a medical information agent discussing dangerous drug interactions, or a security research tool — you can lower thresholds with a signed use case policy approved by Google. You cannot disable safety filters entirely without enterprise agreement.

from vertexai.generative_models import (
    GenerativeModel, SafetySetting, HarmCategory, HarmBlockThreshold
)

safety_settings = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_ONLY_HIGH
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
]

response = model.generate_content(
    prompt,
    safety_settings=safety_settings
)
      

Streaming Responses

For user-facing agents, streaming dramatically improves perceived responsiveness. Instead of waiting for the full response, your UI can display text as it arrives. The Vertex AI SDK provides generate_content(stream=True) which returns a generator of GenerateContentResponse chunks.

Critical difference with streaming: finish_reason is only set on the last chunk. If you are checking finish_reason for safety blocks, you must check it on the final chunk — not intermediate chunks which will show None or STOP prematurely.

# Streaming with safe finish_reason handling
stream = model.generate_content(
    "Write a detailed analysis of Q3 revenue trends.",
    generation_config=GenerationConfig(max_output_tokens=2048),
    stream=True
)

full_text = []
last_chunk = None

for chunk in stream:
    if chunk.candidates:
        part = chunk.candidates[0].content.parts[0]
        if hasattr(part, 'text'):
            full_text.append(part.text)
            print(part.text, end="", flush=True)
    last_chunk = chunk

# Check finish reason on LAST chunk only
if last_chunk and last_chunk.candidates:
    reason = last_chunk.candidates[0].finish_reason.name
    if reason != "STOP":
        print(f"\nWarning: generation ended with {reason}")
      

Streaming & Error Handling

When using streaming, API errors (network failures, quota exceeded) manifest as exceptions raised during iteration of the stream generator — not as error responses. Always wrap streaming loops in try/except blocks and implement backoff-retry logic for google.api_core.exceptions.ResourceExhausted (quota exceeded) errors.

Key Terms

system_instructionA GenerativeModel constructor parameter that sets the agent's persona and constraints. Receives different attention treatment than user-turn content.

SafetySettingA per-category configuration object controlling the threshold at which Gemini blocks responses: BLOCK_NONE, BLOCK_ONLY_HIGH, BLOCK_MEDIUM_AND_ABOVE, BLOCK_LOW_AND_ABOVE.

Streaming (stream=True)A generate_content() mode that returns a generator yielding response chunks as they arrive. finish_reason is only reliable on the final chunk.

ResourceExhaustedThe google.api_core exception raised when your project's Vertex AI quota is exceeded. Requires exponential backoff retry logic.

Quiz — System Instructions, Safety & Streaming

Lesson 4 · 3 questions

1. Where should system instructions be set in the Vertex AI SDK?

Correct. System instructions belong in the system_instruction parameter of GenerativeModel(). Using the first user message as a workaround (common from OpenAI patterns) does not receive the same attention-layer treatment and is more vulnerable to prompt injection.

The correct place is the system_instruction parameter in the GenerativeModel constructor. Vertex AI's Gemini models give native system instructions a different weight in their attention mechanism than user-turn content.

2. When streaming a Gemini response, on which chunk is the finish_reason reliably set?

Correct. In streaming mode, intermediate chunks show None or incomplete finish_reason values. Only the last chunk in the stream carries the definitive finish_reason. Always track the last chunk when doing safety checks in streaming scenarios.

Only the final chunk in the stream carries a reliable finish_reason. Intermediate chunks may show None. You must iterate through the full stream and check finish_reason on the last received chunk.

3. What exception type does Vertex AI raise when your project's API quota is exceeded?

Correct. google.api_core.exceptions.ResourceExhausted is raised when quota is exceeded. Implement exponential backoff retry logic catching this specific exception. Other google.api_core exceptions to handle include ServiceUnavailable and DeadlineExceeded.

The correct exception is google.api_core.exceptions.ResourceExhausted. This maps to HTTP 429. Vertex AI uses the google.api_core exception hierarchy — not custom vertexai exceptions — for transport-level errors.

Lab 4 — System Instructions & Streaming

Practice designing system instructions and streaming handlers with the AI assistant

Scenario: Building a Compliance Agent with Streaming

You are building a compliance Q&A agent for an insurance company. It must answer questions about internal policy documents, refuse to give legal advice, always cite sources, and stream responses back to the UI. You need to design both the system instruction and the streaming handler.

Work with the assistant on crafting an effective system instruction and writing the streaming loop with proper safety checks. Ask about what to include in the system instruction, how to test its effectiveness, or how to handle a safety block mid-stream.

Try asking: "What elements should my compliance agent's system instruction include?" or "How do I handle a SAFETY finish_reason when I'm streaming to a web UI?"

System Instruction & Streaming Assistant

Advanced Patterns

Let's build your compliance agent. I can help you write an effective system instruction, design the streaming loop, and handle edge cases like safety blocks or quota errors. What would you like to tackle first — the system instruction or the streaming handler?

Module 2 — Final Test

15 questions · Score 80% or above to pass · Authentication, SDK, API Calls, Streaming

1. What is the SECOND location checked in the ADC credential resolution chain?

Correct. The ADC chain: (1) GOOGLE_APPLICATION_CREDENTIALS env var, (2) well-known file from gcloud auth application-default login, (3) metadata server on Google Cloud infrastructure.

The ADC order is: (1) GOOGLE_APPLICATION_CREDENTIALS, (2) well-known file at ~/.config/gcloud/application_default_credentials.json, (3) metadata server. The second is the well-known file.

2. Which gcloud command creates credentials that the Vertex AI SDK can use via ADC?

Correct. Only gcloud auth application-default login creates credentials in the well-known file location that SDKs use. Regular gcloud auth login only authenticates the CLI.

gcloud auth application-default login is the command. Regular login only works for the CLI, not SDK code.

3. A new Google Cloud project returns 403 on all Vertex AI calls despite valid credentials and correct IAM roles. What step was skipped?

Correct. The Vertex AI API must be explicitly enabled per project: gcloud services enable aiplatform.googleapis.com. Without this, all calls return 403 regardless of credential and IAM state.

The Vertex AI API must be enabled for each project. Run gcloud services enable aiplatform.googleapis.com once per project.

4. For a production Cloud Run agent, what is the recommended credential approach?

Correct. Attaching a service account to Cloud Run lets the metadata server handle credential provision and refresh automatically — no key files needed. This is the least-privilege, most secure pattern.

Production Cloud Run services should attach a service account and rely on the metadata server. Never embed key files in images or use application-default credentials in production containers.

5. What is the pip package name for the Vertex AI Python SDK?

Correct. google-cloud-aiplatform is the Vertex AI SDK. Version ≥1.38.0 is needed for the Gemini GenerativeModel API.

The Vertex AI SDK is google-cloud-aiplatform. The package google-generativeai is for the separate Google AI Studio API.

6. What does vertexai.init() NOT do?

Correct. vertexai.init() does NOT make a network call. Credentials are resolved locally from the ADC chain but not verified until the first actual inference call.

vertexai.init() does NOT verify credentials with a network call. This is a common misconception — credential verification only happens at the first actual API request.

7. Which model name format is correct for Vertex AI SDK?

Correct. Vertex AI uses publisher model names like gemini-1.5-pro-002 without the models/ prefix (that prefix is used by the Google AI Studio API).

Vertex AI model names do not include prefixes like models/. Use the bare publisher model name: gemini-1.5-pro-002 or gemini-1.5-flash-002.

8. response.text raises ValueError. Which finish_reason most likely caused this?

Correct. A SAFETY finish_reason means the model returned no text content — it was blocked. Accessing response.text in this state raises ValueError because there is no text to return.

SAFETY is the most common cause. When the model is blocked, there is no text content, so response.text raises ValueError. Always check finish_reason first.

9. What is the safe path for accessing text from a Gemini response candidate?

Correct. The full path is response.candidates[0].content.parts[0].text. This reflects the Vertex AI response hierarchy: candidates → content → parts (a list because responses can have text + function call parts).

The correct path is response.candidates[0].content.parts[0].text. The choices[0].message.content pattern is OpenAI API syntax — not applicable to Vertex AI.

10. What temperature setting makes a production agent MOST deterministic?

Correct. Temperature 0.1 produces the most deterministic, consistent outputs. Lower temperature means the model strongly favours the highest-probability next token, reducing variability across calls.

Lower temperature means more deterministic outputs. Temperature 0.1 is the best choice here. Temperature 1.0+ increases randomness significantly.

11. Where in the GenerativeModel call chain do system instructions belong?

Correct. System instructions are set in the GenerativeModel() constructor via the system_instruction parameter. This ensures they receive proper attention-level treatment, separate from user-turn content.

Use the system_instruction parameter in GenerativeModel(). Placing it in the chat history as a user message is a less effective workaround that doesn't receive the same model treatment.

12. Which safety threshold allows content rated MEDIUM harm probability to pass through?

Correct. BLOCK_ONLY_HIGH only blocks content rated HIGH probability of harm — MEDIUM and LOW pass through. BLOCK_MEDIUM_AND_ABOVE (the default) blocks both MEDIUM and HIGH.

BLOCK_ONLY_HIGH allows MEDIUM-rated content through. BLOCK_MEDIUM_AND_ABOVE is the default that blocks MEDIUM. BLOCK_LOW_AND_ABOVE is the most restrictive.

13. When streaming a Gemini response, where do API quota errors manifest?

Correct. In streaming mode, transport-level errors like quota exhaustion raise exceptions during the for-loop iteration. Always wrap streaming loops in try/except catching google.api_core.exceptions.ResourceExhausted.

Streaming errors raise exceptions during generator iteration, not as special response chunks. You need try/except around your streaming loop to catch google.api_core.exceptions.ResourceExhausted.

14. What exception signals Vertex AI quota exhaustion?

Correct. google.api_core.exceptions.ResourceExhausted is the standard exception for quota exhaustion across all Google Cloud SDKs. Implement exponential backoff when catching this.

The correct exception is google.api_core.exceptions.ResourceExhausted. Vertex AI uses the google.api_core exception hierarchy, not custom SDK-specific exceptions for transport errors.

15. The minimum IAM role for invoking Gemini models on Vertex AI is:

Correct. roles/aiplatform.user is the least-privilege role that permits Vertex AI inference. Always use the minimum required role — never grant owner or editor to a service account running an agent.

The minimum required role is roles/aiplatform.user. Granting broader roles like owner or editor to an agent's service account violates least-privilege security principles.