Module 5 · Lesson 1

Session Architecture in Vertex AI Agent Builder

How agents maintain continuity across turns — and why stateless infrastructure demands explicit design

What does a Vertex AI agent actually remember between messages?

When Google demonstrated Bard's ability to hold multi-turn conversations during its I/O keynote in May 2023, the underlying infrastructure challenge was invisible to the audience: every cloud inference endpoint is stateless. The demo worked because session state was explicitly reconstructed from storage on each turn — not because the model remembered anything inherently. This design pattern, now formalized in Vertex AI Agent Builder's Session API, is the foundation of every production agent.

The Stateless Compute Problem

Cloud LLM inference endpoints treat each API call as independent. The model has no persistent memory. When a user sends their fifth message in a conversation, the agent framework must reconstruct the entire relevant context from storage and inject it into the new inference call. This is not a limitation — it is an architectural choice that enables horizontal scaling, fault tolerance, and cost control.

Vertex AI Agent Builder's Session API manages this reconstruction. A session is a server-side record identified by a unique session ID. It stores the ordered list of turns (user messages and model responses), any tool call results, and metadata like creation time and user ID. The session persists in Google-managed storage for up to 60 minutes of inactivity by default, configurable up to 24 hours.

Session Turn Lifecycle

User Message

→

Session API: Load State

→

Build Context Window

LLM Inference (stateless)

→

Session API: Persist Turn

→

Return Response

Creating and Managing Sessions

In the Vertex AI SDK for Python, sessions are created via the AgentEngine client. A session ID is returned that must be stored client-side and passed with every subsequent message in that conversation.

from vertexai import agent_engines
import vertexai

vertexai.init(project="my-project", location="us-central1")

# Get reference to deployed agent
agent = agent_engines.get("projects/123/locations/us-central1/reasoningEngines/456")

# Create a new session
session = agent.create_session(user_id="user-abc-123")
session_id = session["id"]  # Store this client-side

# Send first message
response = agent.stream_query(
    message="What are the refund policies for order #7291?",
    session_id=session_id,
    user_id="user-abc-123"
)

Session State Components

Each session object returned by the API contains several key components that your application code needs to understand:

Turn History

Ordered list of Content objects representing alternating user and model turns. This is the primary mechanism for multi-turn coherence.

Session Metadata

Fields including create_time, update_time, user_id, and display_name for tracking and auditing.

Tool Call Records

Function call and function response pairs are stored as part of the turn history, enabling the agent to reason about what actions it already took.

Session Events

Queryable log of all events in a session via list_events(), useful for debugging and building session replay features.

Listing and Retrieving Sessions

Production systems need to retrieve sessions for returning users. The Agent Builder API supports filtering by user_id so your application can surface a user's conversation history.

# List all sessions for a user
sessions = agent.list_sessions(user_id="user-abc-123")
for s in sessions:
    print(s["id"], s["create_time"], s["display_name"])

# Retrieve a specific session
session = agent.get_session(session_id="session-xyz-789")

# Inspect events in a session
events = agent.list_events(session_id="session-xyz-789")
for event in events:
    print(event["author"], event["timestamp"])

Architecture Note

The Agent Builder Session API is backed by Spanner under the hood, giving sessions strong consistency and global availability. However, session storage is not a replacement for your application database — treat it as a short-to-medium-term conversational cache, not permanent user data storage.

Session Expiry and Deletion

Sessions automatically expire after a configurable idle timeout. In production, you should also explicitly delete sessions after a conversation is definitively complete to manage costs and comply with data retention policies. Sessions can be deleted with agent.delete_session(session_id=...). Google's documentation notes that the default inactivity timeout is 60 minutes, with an absolute maximum lifetime of 24 hours per session regardless of activity.

Key Insight

Session IDs are the conversation "handle." Your frontend or backend must store the session ID after create_session() and attach it to every subsequent API call in that conversation. Losing the session ID means losing the conversation context — the agent will behave as if meeting the user for the first time.

session_id

The unique identifier for a conversation instance, returned by create_session() and required on every subsequent turn in that conversation.

user_id

An arbitrary string identifying the human user, used for session listing and access control. Not authenticated by Agent Builder — your application must validate it.

turn

A single exchange in a session: one user message and the corresponding agent response, including any tool calls made during that response.

Lesson 1 Quiz

Session Architecture in Vertex AI Agent Builder

1. Why does a Vertex AI agent require an explicit session API rather than relying on the LLM's built-in memory?

Correct. Inference endpoints are stateless compute; the session API reconstructs context from storage on each turn and injects it into the new inference call.

Not quite. The statelessness is architectural, not a model size limitation. Every LLM inference endpoint in the cloud works this way regardless of model size.

2. What is the default inactivity timeout for a Vertex AI Agent Builder session?

Correct. The default inactivity timeout is 60 minutes, configurable up to 24 hours — which is also the absolute maximum session lifetime.

Not quite. The default is 60 minutes of inactivity, with 24 hours being the configurable maximum and also the absolute session lifetime ceiling.

3. Which component must your frontend or backend application store after calling create_session()?

Correct. The session ID is the conversation handle. Without it, the agent treats every message as a new conversation with no prior context.

Not quite. You only need to store the session ID. The full session state is managed server-side by the Session API and retrieved automatically when you pass the session ID.

4. What does the list_events() method return for a given session?

Correct. list_events() returns the full event log for a session including user turns, model turns, and tool call/response pairs — essential for debugging complex agent behaviors.

Not quite. list_events() returns all events (not just final responses), and list_sessions() is the method for listing sessions by user.

Lab 1: Session Architecture

Explore session creation, retrieval, and lifecycle with an AI tutor

Practice: Designing Session Flows

In this lab, you'll discuss session architecture decisions with an AI tutor. Think through how you'd implement sessions in real production scenarios — a customer support chatbot, a multi-step form assistant, or a coding helper that needs to track files across turns.

Try asking: "If a user closes their browser mid-conversation and returns 45 minutes later, what happens to their session?" or "How should I store the session ID in a React frontend?"

Session Architecture Lab

AI Tutor Active

Welcome to Lab 1. I'm your Vertex AI session architecture tutor. Let's explore how sessions work in production agents. What aspect of session management are you most curious about — lifecycle, storage, client-side handling, or something else?

Module 5 · Lesson 2

Memory Types — In-Context, External, and Semantic

Beyond session history: how production agents store, retrieve, and prioritize information across conversations

When session history isn't enough, what are the architectural options?

When Duolingo launched Duolingo Max in March 2023 — their GPT-4-powered "Explain My Answer" and "Roleplay" features — they faced a direct memory problem. A Spanish learner who had struggled with the subjunctive mood in fifty previous lessons should have that history inform the current lesson. Session history alone couldn't carry fifty conversations. Duolingo's architecture uses a learner profile database that is selectively retrieved and injected into the prompt, a form of what the field calls external memory with semantic retrieval. This pattern is now a standard approach documented in both LangChain and Vertex AI's own agent frameworks.

The Three Memory Layers

Production agents in any framework — including Vertex AI Agent Builder — typically operate across three distinct memory layers. Understanding which layer to use for which information is one of the most consequential design decisions you'll make.

In-Context Memory

The session turn history injected into the current context window. Fast, free at point of use, but bounded by context window size (128K–1M tokens for Gemini models). Automatically managed by Agent Builder sessions.

External Memory

Structured data in Cloud Firestore, BigQuery, or Spanner queried via tool calls. Unbounded size, persistent across sessions, but requires explicit retrieval logic in your agent's tool definitions.

Semantic Memory

Vector embeddings in Vertex AI Vector Search or Firestore's native vector index. Enables fuzzy retrieval — finding relevant memories by meaning rather than exact match. Ideal for long-term user preferences and knowledge bases.

In-Context Memory: Session History

The Agent Builder session API automatically manages in-context memory. Every turn you add via stream_query() is appended to the session and reconstructed into the context window on the next call. For most conversations under 50 turns with moderate message length, this is all you need.

The risk is context window saturation. When a session accumulates many turns, you approach context limits — even with Gemini 1.5 Pro's 1M token window. Token costs also scale linearly with context length, so a very long session is significantly more expensive per turn than a new one.

Practical Limit

At roughly 500 tokens per turn average, a Gemini model with a 128K context window saturates after about 256 turns. For a customer support agent averaging 15 turns per conversation, this is rarely a concern. For a coding agent used daily over months, it is a genuine architectural challenge.

External Memory: Persistent Storage via Tools

External memory is implemented by giving your agent tools that read and write to a persistent store. When the agent needs to remember something across sessions, it calls a save_preference or update_user_profile tool. When it needs to recall something, it calls get_user_history or load_preferences.

This pattern is explicit and auditable — you can see exactly what the agent stored and retrieved. It requires you to design the schema of what gets remembered. Common external memory stores for Vertex AI agents are Cloud Firestore (document store, low latency), BigQuery (analytics queries over conversation history), and Cloud Spanner (transactional, globally consistent).

# External memory tool — agent calls this to save user preferences
def save_user_preference(user_id: str, preference_key: str, preference_value: str) -> str:
    """Save a user preference to persistent storage across sessions."""
    db = firestore.Client()
    doc_ref = db.collection("user_memory").document(user_id)
    doc_ref.set({preference_key: preference_value}, merge=True)
    return f"Saved: {preference_key} = {preference_value}"

def load_user_preferences(user_id: str) -> dict:
    """Retrieve all saved preferences for a user."""
    db = firestore.Client()
    doc = db.collection("user_memory").document(user_id).get()
    return doc.to_dict() if doc.exists else {}

Semantic Memory: Vector Search for Relevant Context

Semantic memory uses embeddings to retrieve the most relevant memories rather than all memories. This is critical when a user has a long history — you don't want to inject all 200 past conversations into the context window; you want the 5 most relevant ones to the current question.

Vertex AI Vector Search (formerly Matching Engine) and Firestore's native vector capabilities both support this pattern. You embed key memories (user statements, past resolutions, expressed preferences) using textembedding-gecko or Gemini's embedding API, store them in the vector index, and retrieve the top-K most relevant at query time.

# Semantic memory retrieval (conceptual)
from vertexai.language_models import TextEmbeddingModel

def retrieve_relevant_memories(user_id: str, current_query: str, top_k: int = 5) -> list:
    model = TextEmbeddingModel.from_pretrained("text-embedding-004")
    query_embedding = model.get_embeddings([current_query])[0].values
    
    # Query Firestore vector index for this user's memories
    results = db.collection("semantic_memory")\
        .where("user_id", "==", user_id)\
        .find_nearest(
            vector_field="embedding",
            query_vector=query_embedding,
            distance_measure="COSINE",
            limit=top_k
        ).get()
    return [r.to_dict()["content"] for r in results]

Design Principle

Start with in-context session memory. Add external memory when you need cross-session persistence. Add semantic memory only when external memory retrieves too much data to fit in context. Each layer adds architectural complexity — justify it with a real requirement.

Memory Type Decision Matrix

Use in-context (session) memory when: the information is only relevant within the current conversation, the conversation is bounded in length, and you don't need to access it from other sessions or services.

Use external memory when: information must persist across sessions (user preferences, account data), must be accessible to non-agent systems, or must be queryable in structured ways (e.g., "all users who prefer metric units").

Use semantic memory when: you have large historical data per user that can't all fit in context, relevance matters more than recency, and you need to surface similar past situations to inform the current response.

Lesson 2 Quiz

Memory Types — In-Context, External, and Semantic

1. A customer support agent needs to remember a user's preferred language setting across multiple separate conversations spanning weeks. Which memory type is most appropriate?

Correct. Cross-session persistence of structured preferences is the primary use case for external memory. Firestore or Spanner stores the preference and it's retrieved at session start via a tool call.

Not quite. In-context memory exists only within a single session. Semantic memory is overkill for a simple key-value preference. External memory (Firestore) is the right fit for structured, persistent user preferences.

2. What is the primary risk of allowing a session to accumulate many hundreds of turns without context management?

Correct. Each turn injects the full session history into the context window. As sessions grow, you approach context limits and pay linearly more tokens per inference call — a real cost and reliability concern.

Not quite. The core risk is context window saturation: each call injects all prior history, so very long sessions hit context limits and incur escalating token costs regardless of other factors.

3. Duolingo Max's architecture retrieves only the most relevant past lesson struggles rather than injecting a user's entire history. What memory pattern does this exemplify?

Correct. Semantic (vector) retrieval finds the most relevant memories by meaning similarity, not recency or exact match — ideal when the historical corpus is too large to inject wholesale.

Not quite. The key characteristic described — retrieving the most relevant items by meaning from a large history — is semantic/vector retrieval, not structured SQL queries or session chaining.

4. Which Vertex AI embedding model is referenced in Google's documentation for use with text-based semantic memory retrieval?

Correct. text-embedding-004 is Google's current production text embedding model, accessed via TextEmbeddingModel.from_pretrained("text-embedding-004") in the Vertex AI SDK.

Not quite. text-embedding-004 is the current Vertex AI text embedding model. textembedding-gecko is an older variant. The others listed are not real Vertex AI embedding model identifiers.

Lab 2: Memory Architecture

Design memory strategies for real agent scenarios

Practice: Choosing the Right Memory Layer

Work through memory architecture decisions with an AI tutor. You'll analyze realistic agent scenarios and decide which memory layer — in-context, external, or semantic — best fits each requirement.

Try: "I'm building a coding assistant. Users want it to remember their preferred language and style guide. What's the right memory approach?" or "What are the tradeoffs between Firestore and Vector Search for user memory?"

Memory Architecture Lab

AI Tutor Active

Welcome to Lab 2. I'm ready to help you work through memory architecture decisions for Vertex AI agents. Describe a real or hypothetical agent you're designing, and we'll figure out the right memory strategy together.

Module 5 · Lesson 3

Context Window Management and Summarization

Keeping agents coherent and cost-efficient as conversations grow beyond the context horizon

How do you keep an agent effective when the conversation is longer than the window?

When GitHub released Copilot Workspace in preview in April 2024, one of its defining engineering challenges was managing context across long-running coding sessions. A developer working through a complex refactor might generate thousands of tokens of file content, tool outputs, and planning messages. GitHub's approach — partially described in their engineering blog — involved progressive summarization: older exchanges are condensed into compact summaries that preserve key decisions while older verbatim text is dropped. This is now a documented pattern in the Vertex AI Agent Development Kit (ADK) as a custom context management strategy.

Understanding the Context Window Constraint

The context window is the total amount of text an LLM can process in a single inference call — inputs plus outputs combined. For Gemini 1.5 Pro, this is 1 million tokens. For Gemini 2.0 Flash, it's 1 million tokens as well. These are large windows, but they are not infinite, and every token costs money.

In Agent Builder sessions, the full turn history is reconstructed and injected on each call. A session with 100 turns, each averaging 500 tokens, already consumes 50,000 tokens of context before the user's new message or any tool results are added. At 1,000 turns, you're at 500,000 tokens — more than half a million-token window, with significant cost implications.

Rolling Window

Keep only the last N turns in the injected context. Simple to implement, but older context is lost permanently. Works well when recent turns contain all relevant information.

Hierarchical Summarization

Older turns are summarized into a compact narrative. The summary plus recent verbatim turns are injected together. Preserves semantic meaning while reducing token count.

Selective Retrieval

Only turns semantically relevant to the current query are retrieved and injected. Requires a vector index over turn embeddings. Most sophisticated, highest retrieval fidelity.

Hybrid: Fixed Summary + Sliding Window

A permanent summary of the entire session history plus the last K verbatim turns. Balances context quality with token efficiency. Recommended for most production agents.

Implementing Summarization in Vertex AI ADK

The Vertex AI Agent Development Kit (ADK) allows you to intercept and modify context before it's sent to the model. You implement a custom context processor that monitors session length and triggers summarization when a threshold is reached.

# Context summarization processor (Vertex AI ADK pattern)
from google.adk.agents import Agent
from google.adk.sessions import DatabaseSessionService
from vertexai.generative_models import GenerativeModel

SUMMARIZE_AFTER_TURNS = 20
KEEP_RECENT_TURNS = 5

def summarize_old_turns(old_turns: list) -> str:
    """Use Gemini Flash to summarize older conversation turns."""
    model = GenerativeModel("gemini-2.0-flash")
    history_text = "\n".join([
        f"{t['role']}: {t['content']}" for t in old_turns
    ])
    prompt = f"""Summarize this conversation history concisely, preserving:
- Key decisions made
- User preferences expressed  
- Important facts established
- Unresolved questions

History:
{history_text}

Summary:"""
    response = model.generate_content(prompt)
    return response.text

def maybe_compress_session(session_turns: list) -> list:
    """Return compressed turns if threshold exceeded."""
    if len(session_turns) <= SUMMARIZE_AFTER_TURNS:
        return session_turns
    
    old_turns = session_turns[:-KEEP_RECENT_TURNS]
    recent_turns = session_turns[-KEEP_RECENT_TURNS:]
    summary = summarize_old_turns(old_turns)
    
    summary_turn = {
        "role": "system",
        "content": f"[Conversation summary: {summary}]"
    }
    return [summary_turn] + recent_turns

Token Cost Implications

Context management is also a cost management strategy. At Gemini 2.0 Flash pricing (approximately $0.075 per million input tokens as of mid-2024), a session with 500K tokens of context history costs roughly $0.0375 in context reconstruction alone — per turn. Multiply by thousands of daily active users and you have a significant budget item that summarization can reduce by 60–80%.

The key insight is that Gemini Flash is cheap enough to use as a summarizer. Running a Flash model to condense 50 turns into 500 tokens costs a fraction of what those 50 turns cost in context reconstruction for each subsequent Pro model call.

Production Pattern

Many production teams run summarization asynchronously — after each user turn completes, a background Cloud Run job checks if the session has exceeded the threshold and updates the stored session with a prepended summary. This keeps the synchronous inference path fast while the summarization happens out-of-band.

System Prompt vs. Conversation History Injection

When injecting summarized context, placement matters. You have two options: prepend the summary as a system instruction (before the conversation history) or insert it as the first "assistant" turn in the history. Google's ADK documentation recommends the system instruction approach — it signals to the model that the summary is authoritative background rather than a model output to be questioned or contradicted.

Key Insight

Context window management isn't just about fitting within limits — it's about what the model attends to. Research on transformer attention patterns (including Google's 2024 "Lost in the Middle" analysis) shows that models attend most strongly to the beginning and end of context. Placing your summary at the beginning and recent verbatim turns at the end aligns with how attention actually distributes, improving coherence.

Context saturation

When accumulated session history approaches or exceeds the model's context window limit, causing degraded performance, errors, or forced truncation.

Hierarchical summarization

A compression strategy where older conversation turns are condensed into a summary narrative while recent turns are kept verbatim, reducing token count while preserving semantic continuity.

Sliding window

A simple context strategy that retains only the last N turns, discarding earlier history. Fast and cheap but lossy for long conversations with important early context.

Lesson 3 Quiz

Context Window Management and Summarization

1. Why is it often practical to use Gemini Flash specifically to perform summarization of older conversation turns?

Correct. The economic logic is decisive: Flash's low per-token cost makes summarization affordable, and the savings from reducing context reconstruction on subsequent Pro model calls easily justify it.

Not quite. The key reason is economic: Flash's low per-token cost makes it a practical choice for summarization — the cost of the Flash call is far less than the ongoing savings from reduced context on future Pro model calls.

2. Google's "Lost in the Middle" research on transformer attention suggests that to maximize how well a model uses injected context, you should place your session summary:

Correct. Transformers attend most strongly to the beginning and end of context. Placing the summary at the start and recent turns at the end aligns with actual attention distribution patterns.

Not quite. The "Lost in the Middle" finding is that attention is strongest at the beginning and end of context. Placing the summary at the start and recent turns at the end takes advantage of this pattern.

3. What is the advantage of running session summarization asynchronously (in a background job) rather than inline during inference?

Correct. Asynchronous summarization (e.g., via a Cloud Run job) avoids adding summarization latency to the user-facing response path, maintaining fast turn-around times.

Not quite. The primary advantage is latency — the user's response isn't delayed by the summarization step. Quality and rate limits are secondary considerations.

4. The "hybrid: fixed summary + sliding window" context strategy is described as recommended for most production agents. What does it combine?

Correct. The hybrid approach keeps a running compressed summary of all history combined with the last few verbatim turns — balancing context quality (the summary) with recent precision (verbatim turns).

Not quite. The hybrid strategy combines a compact summary of older history with recent verbatim turns — it doesn't require vector search or session chaining.

Lab 3: Context Compression

Design and reason through context management strategies

Practice: Building a Context Management Plan

In this lab, you'll work through the design of a context management system for a long-running agent. Discuss compression thresholds, summarization quality, cost tradeoffs, and async architectures with the AI tutor.

Try: "How do I decide when to trigger summarization — is 20 turns a good threshold?" or "What should a good session summary always preserve?" or "How do I test whether my summarization is losing important context?"

Context Compression Lab

AI Tutor Active

Welcome to Lab 3. Let's design a context management strategy for a production agent. Tell me about the agent you're building — what domain, what kind of conversations does it handle, and how long do typical sessions get? We'll figure out the right compression approach together.

Module 5 · Lesson 4

Long-Running Agents — Checkpointing and Resumption

When agents must pause, survive failures, and continue: durable execution patterns on Vertex AI

How do you build an agent that survives infrastructure interruptions without losing its place in a complex task?

Waymo's internal engineering blog (2023) described a class of reliability requirements for their operations AI systems: tasks that might take hours to complete — route planning, fleet dispatch optimization, maintenance scheduling — had to survive network interruptions, server restarts, and model redeployments without losing state. The solution was explicit checkpointing: after each significant decision or sub-task, the agent's state was committed to Firestore before proceeding. If the process died, a supervisor could resume from the last committed checkpoint. This pattern maps directly to what Vertex AI ADK formalizes as durable agent execution.

Why Long-Running Agents Need Checkpointing

Most conversational agents complete their work within a single inference call or a short chain of calls within one HTTP request. But a new class of agents — research agents, data pipeline agents, multi-step workflow agents — may need to work across dozens of sequential steps over minutes or hours. During that time, the executing process can fail for many reasons: Cloud Run instance eviction, timeout limits (Cloud Run maximum request timeout is 60 minutes), transient network errors, or model API failures.

Without checkpointing, any failure means restarting the entire task from scratch. With checkpointing, the agent can resume from the last committed state. This transforms long-running agents from fragile experiments into reliable production systems.

The Vertex AI ADK Session as a Checkpoint Store

The Agent Development Kit's session service can function as a lightweight checkpoint store. After each completed sub-task or decision point, you commit the current session state — which includes not just conversation history but also arbitrary structured data stored in session metadata — to the session service. A Cloud Tasks queue entry or Pub/Sub message can then trigger resumption from any failure point.

# Checkpoint-aware agent execution loop
from google.adk.sessions import DatabaseSessionService
from google.cloud import tasks_v2
import json

def execute_with_checkpoints(task_id: str, task_spec: dict):
    session_svc = DatabaseSessionService(db_url="...")
    
    # Try to resume from existing checkpoint
    session = session_svc.get_session(
        app_name="research-agent",
        user_id="system",
        session_id=task_id
    )
    
    if session is None:
        # First execution — create session with initial state
        session = session_svc.create_session(
            app_name="research-agent",
            user_id="system",
            state={"task_spec": task_spec, "completed_steps": [], "status": "running"}
        )
    
    completed_steps = session.state.get("completed_steps", [])
    all_steps = task_spec["steps"]
    remaining = [s for s in all_steps if s["id"] not in completed_steps]
    
    for step in remaining:
        execute_step(step)
        completed_steps.append(step["id"])
        # Commit checkpoint after each step
        session_svc.update_session(
            session_id=task_id,
            state={"completed_steps": completed_steps}
        )

Using Cloud Tasks for Durable Scheduling

Long-running agent workflows benefit from Cloud Tasks as the execution orchestrator. Instead of a single long HTTP request, the agent breaks its work into discrete steps, each dispatched as a Cloud Tasks entry. Cloud Tasks handles retry logic, backoff, and delivery guarantees. The agent's session store holds the inter-step state.

This architecture is explicitly recommended in Google's Vertex AI documentation for agents that exceed Cloud Run's 60-minute request timeout. It also naturally parallelizes — multiple Cloud Tasks workers can execute independent branches of a complex workflow simultaneously, each reading from and writing to the shared session checkpoint.

Durable Long-Running Agent Architecture

Task Initiation

→

Create Session + State

→

Enqueue Cloud Tasks Step 1

Cloud Tasks Worker

→

Execute Step + Agent Call

→

Checkpoint to Session

Session Store (Firestore)

→

Enqueue Next Step

→

Repeat Until Complete

Human-in-the-Loop Pausing

Long-running agents often need to pause for human approval before proceeding. The Vertex AI ADK's session state makes this straightforward: when the agent reaches a decision point requiring approval, it sets a status: "awaiting_approval" flag in session state and returns. A notification is sent to the human reviewer. When approval arrives, a new Cloud Tasks message is dispatched to resume execution, and the agent reads the approval from session state before continuing.

# Human-in-the-loop pause/resume pattern
def request_human_approval(session_id: str, action: str, context: dict):
    # Pause agent execution
    session_svc.update_session(
        session_id=session_id,
        state={
            "status": "awaiting_approval",
            "pending_action": action,
            "pending_context": context
        }
    )
    # Notify reviewer (via Pub/Sub → email/Slack/etc)
    publisher.publish(topic_path, json.dumps({
        "session_id": session_id,
        "action": action,
        "approval_url": f"/approve/{session_id}"
    }).encode())

def resume_after_approval(session_id: str, approved: bool, reviewer_notes: str):
    # Human approves via web UI, which calls this function
    session_svc.update_session(
        session_id=session_id,
        state={
            "status": "running",
            "approval_result": approved,
            "reviewer_notes": reviewer_notes
        }
    )
    # Re-enqueue the next step for execution
    task_client.create_task(parent=queue_path, task={
        "http_request": {"url": f"/execute-step/{session_id}"}
    })

Production Reality

In Q4 2023, Google's own internal document processing agents (used for Google Cloud's customer contract review pipeline, documented in a Google Next 2024 session) used exactly this pattern: agents that checkpointed to Firestore after each document section, with human review required before finalizing any amendment suggestions. The checkpoint pattern enabled reviewers to pick up tasks across shifts without losing agent context.

Idempotency in Agent Steps

When Cloud Tasks retries a failed step, your agent's step execution must be idempotent — running it twice must produce the same result as running it once. This means checking whether side effects (API calls, database writes, emails sent) already occurred before executing them. Use the session checkpoint to record that an action was completed, and always check before repeating it.

Key Insight

The combination of Vertex AI ADK sessions (for state), Cloud Tasks (for durable scheduling), and Pub/Sub (for event notification) gives you a complete long-running agent infrastructure entirely within Google Cloud. No third-party orchestration platform is required — these primitives are sufficient for most production long-running agent workloads.

Checkpoint

A committed record of an agent's progress state at a specific point in a multi-step task, enabling resumption from that point after any failure.

Idempotent step

An agent action that produces the same result whether executed once or multiple times, essential when Cloud Tasks may retry failed step executions.

Human-in-the-loop

An architectural pattern where an agent pauses execution at a decision point and awaits explicit human approval before proceeding, with state persisted during the wait.

Lesson 4 Quiz

Long-Running Agents — Checkpointing and Resumption

1. What is the primary reason Cloud Tasks is recommended over a single long HTTP request for long-running agent workflows on Vertex AI?

Correct. Cloud Run's 60-minute maximum request timeout is a hard constraint. Cloud Tasks breaks the work into retryable discrete steps, each within the timeout, orchestrated durably by the Tasks queue.

Not quite. The key constraint is Cloud Run's 60-minute timeout limit. Cloud Tasks enables execution beyond this limit by breaking work into discrete, individually short, retryable steps.

2. An agent step writes data to an external API. When implementing checkpointing, what property must this step have to be safe under Cloud Tasks retry?

Correct. Idempotency is essential when retry systems like Cloud Tasks may execute a step more than once. The agent must check whether the action already completed (via checkpoint state) before executing it again.

Not quite. The critical property is idempotency. Cloud Tasks may retry a step if delivery isn't confirmed. The agent must handle duplicate execution gracefully by checking whether the action already occurred.

3. In the human-in-the-loop pattern described, what state does the agent write to the session when it needs human approval?

Correct. The agent commits a structured state update — status, pending action, and context — to the session, then returns. The session persists while the human reviews, and resumption is triggered externally.

Not quite. The pattern is to write a structured state update to session storage (status, pending action, context) and return — the session persists the pause state while the human reviews asynchronously.

4. Which combination of Google Cloud services is described as sufficient for a complete long-running agent infrastructure without third-party orchestration?

Correct. ADK sessions provide state/checkpointing, Cloud Tasks provides durable step scheduling with retry, and Pub/Sub handles event notification — together they cover the full long-running agent lifecycle.

Not quite. The three primitives described are Vertex AI ADK sessions (state), Cloud Tasks (durable scheduling with retry), and Pub/Sub (event notification for things like human-in-the-loop alerts).

Lab 4: Long-Running Agent Design

Architect a durable, checkpointed agent workflow end-to-end

Practice: Designing a Checkpointed Agent Pipeline

In this capstone lab, you'll architect a complete long-running agent workflow — from task initiation through checkpointing, failure recovery, and human-in-the-loop approval. Work through your design with the AI tutor, covering state schema, step boundaries, idempotency, and resumption logic.

Try: "I want to build a research agent that gathers data from 10 sources and writes a report. How do I design the checkpoint schema?" or "How should I handle a step that calls an external API — what if it succeeds but the checkpoint write fails?"

Long-Running Agent Lab

AI Tutor Active

Welcome to the capstone lab. We're going to design a production-grade long-running agent together. Tell me about the workflow you have in mind — what does it need to accomplish, approximately how many steps, and what are the failure risks you're most worried about?

Module 5 Test

Managing Context — Sessions, Memory, and Long-Running Agents · 15 questions · Pass at 80%

1. In Vertex AI Agent Builder, what object must be stored client-side after calling create_session() to enable multi-turn conversations?

Correct. The session ID is the conversation handle — required on every subsequent API call to identify and reconstruct the conversation state.

The session ID is what must be stored. The full state is managed server-side; only the identifier is needed client-side.

2. What is the absolute maximum lifetime of a Vertex AI Agent Builder session, regardless of activity?

Correct. 24 hours is the absolute maximum session lifetime — both the configurable maximum inactivity timeout ceiling and the hard session age limit.

24 hours is the absolute maximum. 60 minutes is the default inactivity timeout (configurable up to 24 hours).

3. Which storage backend underlies the Vertex AI Agent Builder Session API for strong consistency?

Correct. The Session API is backed by Spanner, providing strong consistency and global availability for session data.

The Session API uses Spanner under the hood, not Firestore or BigQuery, giving it strong consistency properties.

4. An e-commerce agent needs to recall a user's size preference from six months ago (stored after a previous purchase conversation). Which memory type handles this correctly?

Correct. Cross-session persistent structured data — like user preferences — is the core use case for external memory stores like Firestore.

Session memory only lasts within one conversation (up to 24h). Persistent cross-session user data requires external storage like Firestore.

5. Semantic memory with vector search is best suited for which scenario?

Correct. Semantic retrieval finds the most relevant items by meaning similarity — ideal when you have too much history to inject wholesale but want relevant context surfaced intelligently.

Semantic memory excels at relevance-based retrieval from large corpora. Structured fields go in external stores; recent turns go in session context; audit logs go in structured logging systems.

6. The "Lost in the Middle" research finding about transformer attention affects how you should position injected context. What does it recommend?

Correct. Attention concentrates at the start and end of context — aligning your most important information (summary at start, recent turns at end) with these attention peaks improves coherence.

The research shows attention peaks at context boundaries (beginning and end), so place high-priority content there — summary at start, recent turns at end.

7. Why is Gemini Flash specifically practical as a summarization model in context compression pipelines?

Correct. The economic argument is decisive: Flash's low cost makes summarization affordable, and it prevents higher per-turn costs on subsequent Pro model calls with bloated context.

The reason is economic: Flash is cheap enough that summarization calls cost less than the ongoing savings from reduced context window usage on more expensive Pro model calls.

8. What is the recommended approach for inserting a session summary into the agent context, according to Google's ADK documentation?

Correct. System instruction placement signals to the model that the summary is authoritative background rather than a model output to be questioned, and benefits from strong beginning-of-context attention.

The recommendation is system instruction placement — it's authoritative (not something the model debates) and benefits from strong attention at the start of context.

9. What is the maximum request timeout for Cloud Run, and why does this affect long-running agent architecture?

Correct. Cloud Run's 60-minute hard timeout requires that any agent workflow taking longer must be decomposed into discrete steps orchestrated by Cloud Tasks with checkpointing between steps.

Cloud Run's maximum request timeout is 60 minutes. Workflows exceeding this must be broken into steps via Cloud Tasks, with session checkpoints preserving state between steps.

10. In the human-in-the-loop pattern, what happens to session state while the agent awaits human approval?

Correct. The session persists the pause state — status, pending action, context — while the human reviews asynchronously. Resumption is triggered externally via Cloud Tasks or a similar mechanism.

The session persists with the approval status and pending action recorded. No polling needed — resumption is triggered externally (e.g., by a Cloud Tasks message) when the human approves.

11. What property must a checkpointed agent step have to be safe when Cloud Tasks automatically retries a failed step delivery?

Correct. Idempotency means the step produces the same result whether executed once or multiple times — critical when retry systems may deliver a step more than once.

Idempotency is the required property. The agent must check whether a step's action already occurred (via checkpoint state) before executing it again during a retry.

12. Which Google Cloud service combination is described as sufficient for a complete long-running agent infrastructure?

Correct. ADK sessions handle state/checkpointing, Cloud Tasks provides durable step scheduling with retry, and Pub/Sub enables event-driven notifications — together a complete primitive stack.

The three sufficient primitives are ADK sessions (state), Cloud Tasks (durable scheduling), and Pub/Sub (event notification). Other combinations lack one or more of these capabilities natively.

13. At approximately what turn count does context window saturation become a practical concern for Gemini models with a 128K token context window, assuming 500 tokens per turn?

Correct. 128,000 tokens ÷ 500 tokens/turn ≈ 256 turns before the history alone fills the context window, before accounting for the new message, system prompt, or tool results.

128K tokens ÷ 500 tokens/turn = 256 turns. Beyond this, the history alone would exceed the window — and tool results, system prompts, and new messages need space too.

14. The list_events() method on an Agent Builder session is primarily useful for:

Correct. list_events() returns the complete event log including user turns, model turns, and tool call/response pairs — making it the primary debugging tool for agent behavior analysis.

list_events() returns the full event log for debugging and replay. list_sessions() is for listing sessions by user. Billing data is in Cloud Billing, not the session API.

15. When implementing the hybrid context strategy (fixed summary + sliding window), where should the summary be positioned in the injected context to maximize model attention?

Correct. Beginning placement (as system instruction) aligns with peak attention at context start, and recent turns at the end align with peak attention at context end — both positions where transformers attend most strongly.

The optimal placement leverages the "Lost in the Middle" finding: summaries at the beginning (system instruction) and recent turns at the end — both positions of peak transformer attention.