When Google demonstrated Bard's ability to hold multi-turn conversations during its I/O keynote in May 2023, the underlying infrastructure challenge was invisible to the audience: every cloud inference endpoint is stateless. The demo worked because session state was explicitly reconstructed from storage on each turn — not because the model remembered anything inherently. This design pattern, now formalized in Vertex AI Agent Builder's Session API, is the foundation of every production agent.
Cloud LLM inference endpoints treat each API call as independent. The model has no persistent memory. When a user sends their fifth message in a conversation, the agent framework must reconstruct the entire relevant context from storage and inject it into the new inference call. This is not a limitation — it is an architectural choice that enables horizontal scaling, fault tolerance, and cost control.
Vertex AI Agent Builder's Session API manages this reconstruction. A session is a server-side record identified by a unique session ID. It stores the ordered list of turns (user messages and model responses), any tool call results, and metadata like creation time and user ID. The session persists in Google-managed storage for up to 60 minutes of inactivity by default, configurable up to 24 hours.
In the Vertex AI SDK for Python, sessions are created via the AgentEngine client. A session ID is returned that must be stored client-side and passed with every subsequent message in that conversation.
Each session object returned by the API contains several key components that your application code needs to understand:
Ordered list of Content objects representing alternating user and model turns. This is the primary mechanism for multi-turn coherence.
Fields including create_time, update_time, user_id, and display_name for tracking and auditing.
Function call and function response pairs are stored as part of the turn history, enabling the agent to reason about what actions it already took.
Queryable log of all events in a session via list_events(), useful for debugging and building session replay features.
Production systems need to retrieve sessions for returning users. The Agent Builder API supports filtering by user_id so your application can surface a user's conversation history.
The Agent Builder Session API is backed by Spanner under the hood, giving sessions strong consistency and global availability. However, session storage is not a replacement for your application database — treat it as a short-to-medium-term conversational cache, not permanent user data storage.
Sessions automatically expire after a configurable idle timeout. In production, you should also explicitly delete sessions after a conversation is definitively complete to manage costs and comply with data retention policies. Sessions can be deleted with agent.delete_session(session_id=...). Google's documentation notes that the default inactivity timeout is 60 minutes, with an absolute maximum lifetime of 24 hours per session regardless of activity.
Session IDs are the conversation "handle." Your frontend or backend must store the session ID after create_session() and attach it to every subsequent API call in that conversation. Losing the session ID means losing the conversation context — the agent will behave as if meeting the user for the first time.
create_session()?list_events() method return for a given session?list_events() returns the full event log for a session including user turns, model turns, and tool call/response pairs — essential for debugging complex agent behaviors.list_events() returns all events (not just final responses), and list_sessions() is the method for listing sessions by user.In this lab, you'll discuss session architecture decisions with an AI tutor. Think through how you'd implement sessions in real production scenarios — a customer support chatbot, a multi-step form assistant, or a coding helper that needs to track files across turns.
When Duolingo launched Duolingo Max in March 2023 — their GPT-4-powered "Explain My Answer" and "Roleplay" features — they faced a direct memory problem. A Spanish learner who had struggled with the subjunctive mood in fifty previous lessons should have that history inform the current lesson. Session history alone couldn't carry fifty conversations. Duolingo's architecture uses a learner profile database that is selectively retrieved and injected into the prompt, a form of what the field calls external memory with semantic retrieval. This pattern is now a standard approach documented in both LangChain and Vertex AI's own agent frameworks.
Production agents in any framework — including Vertex AI Agent Builder — typically operate across three distinct memory layers. Understanding which layer to use for which information is one of the most consequential design decisions you'll make.
The session turn history injected into the current context window. Fast, free at point of use, but bounded by context window size (128K–1M tokens for Gemini models). Automatically managed by Agent Builder sessions.
Structured data in Cloud Firestore, BigQuery, or Spanner queried via tool calls. Unbounded size, persistent across sessions, but requires explicit retrieval logic in your agent's tool definitions.
Vector embeddings in Vertex AI Vector Search or Firestore's native vector index. Enables fuzzy retrieval — finding relevant memories by meaning rather than exact match. Ideal for long-term user preferences and knowledge bases.
The Agent Builder session API automatically manages in-context memory. Every turn you add via stream_query() is appended to the session and reconstructed into the context window on the next call. For most conversations under 50 turns with moderate message length, this is all you need.
The risk is context window saturation. When a session accumulates many turns, you approach context limits — even with Gemini 1.5 Pro's 1M token window. Token costs also scale linearly with context length, so a very long session is significantly more expensive per turn than a new one.
At roughly 500 tokens per turn average, a Gemini model with a 128K context window saturates after about 256 turns. For a customer support agent averaging 15 turns per conversation, this is rarely a concern. For a coding agent used daily over months, it is a genuine architectural challenge.
External memory is implemented by giving your agent tools that read and write to a persistent store. When the agent needs to remember something across sessions, it calls a save_preference or update_user_profile tool. When it needs to recall something, it calls get_user_history or load_preferences.
This pattern is explicit and auditable — you can see exactly what the agent stored and retrieved. It requires you to design the schema of what gets remembered. Common external memory stores for Vertex AI agents are Cloud Firestore (document store, low latency), BigQuery (analytics queries over conversation history), and Cloud Spanner (transactional, globally consistent).
Semantic memory uses embeddings to retrieve the most relevant memories rather than all memories. This is critical when a user has a long history — you don't want to inject all 200 past conversations into the context window; you want the 5 most relevant ones to the current question.
Vertex AI Vector Search (formerly Matching Engine) and Firestore's native vector capabilities both support this pattern. You embed key memories (user statements, past resolutions, expressed preferences) using textembedding-gecko or Gemini's embedding API, store them in the vector index, and retrieve the top-K most relevant at query time.
Start with in-context session memory. Add external memory when you need cross-session persistence. Add semantic memory only when external memory retrieves too much data to fit in context. Each layer adds architectural complexity — justify it with a real requirement.
Use in-context (session) memory when: the information is only relevant within the current conversation, the conversation is bounded in length, and you don't need to access it from other sessions or services.
Use external memory when: information must persist across sessions (user preferences, account data), must be accessible to non-agent systems, or must be queryable in structured ways (e.g., "all users who prefer metric units").
Use semantic memory when: you have large historical data per user that can't all fit in context, relevance matters more than recency, and you need to surface similar past situations to inform the current response.
text-embedding-004 is Google's current production text embedding model, accessed via TextEmbeddingModel.from_pretrained("text-embedding-004") in the Vertex AI SDK.text-embedding-004 is the current Vertex AI text embedding model. textembedding-gecko is an older variant. The others listed are not real Vertex AI embedding model identifiers.Work through memory architecture decisions with an AI tutor. You'll analyze realistic agent scenarios and decide which memory layer — in-context, external, or semantic — best fits each requirement.
When GitHub released Copilot Workspace in preview in April 2024, one of its defining engineering challenges was managing context across long-running coding sessions. A developer working through a complex refactor might generate thousands of tokens of file content, tool outputs, and planning messages. GitHub's approach — partially described in their engineering blog — involved progressive summarization: older exchanges are condensed into compact summaries that preserve key decisions while older verbatim text is dropped. This is now a documented pattern in the Vertex AI Agent Development Kit (ADK) as a custom context management strategy.
The context window is the total amount of text an LLM can process in a single inference call — inputs plus outputs combined. For Gemini 1.5 Pro, this is 1 million tokens. For Gemini 2.0 Flash, it's 1 million tokens as well. These are large windows, but they are not infinite, and every token costs money.
In Agent Builder sessions, the full turn history is reconstructed and injected on each call. A session with 100 turns, each averaging 500 tokens, already consumes 50,000 tokens of context before the user's new message or any tool results are added. At 1,000 turns, you're at 500,000 tokens — more than half a million-token window, with significant cost implications.
Keep only the last N turns in the injected context. Simple to implement, but older context is lost permanently. Works well when recent turns contain all relevant information.
Older turns are summarized into a compact narrative. The summary plus recent verbatim turns are injected together. Preserves semantic meaning while reducing token count.
Only turns semantically relevant to the current query are retrieved and injected. Requires a vector index over turn embeddings. Most sophisticated, highest retrieval fidelity.
A permanent summary of the entire session history plus the last K verbatim turns. Balances context quality with token efficiency. Recommended for most production agents.
The Vertex AI Agent Development Kit (ADK) allows you to intercept and modify context before it's sent to the model. You implement a custom context processor that monitors session length and triggers summarization when a threshold is reached.
Context management is also a cost management strategy. At Gemini 2.0 Flash pricing (approximately $0.075 per million input tokens as of mid-2024), a session with 500K tokens of context history costs roughly $0.0375 in context reconstruction alone — per turn. Multiply by thousands of daily active users and you have a significant budget item that summarization can reduce by 60–80%.
The key insight is that Gemini Flash is cheap enough to use as a summarizer. Running a Flash model to condense 50 turns into 500 tokens costs a fraction of what those 50 turns cost in context reconstruction for each subsequent Pro model call.
Many production teams run summarization asynchronously — after each user turn completes, a background Cloud Run job checks if the session has exceeded the threshold and updates the stored session with a prepended summary. This keeps the synchronous inference path fast while the summarization happens out-of-band.
When injecting summarized context, placement matters. You have two options: prepend the summary as a system instruction (before the conversation history) or insert it as the first "assistant" turn in the history. Google's ADK documentation recommends the system instruction approach — it signals to the model that the summary is authoritative background rather than a model output to be questioned or contradicted.
Context window management isn't just about fitting within limits — it's about what the model attends to. Research on transformer attention patterns (including Google's 2024 "Lost in the Middle" analysis) shows that models attend most strongly to the beginning and end of context. Placing your summary at the beginning and recent verbatim turns at the end aligns with how attention actually distributes, improving coherence.
In this lab, you'll work through the design of a context management system for a long-running agent. Discuss compression thresholds, summarization quality, cost tradeoffs, and async architectures with the AI tutor.
Waymo's internal engineering blog (2023) described a class of reliability requirements for their operations AI systems: tasks that might take hours to complete — route planning, fleet dispatch optimization, maintenance scheduling — had to survive network interruptions, server restarts, and model redeployments without losing state. The solution was explicit checkpointing: after each significant decision or sub-task, the agent's state was committed to Firestore before proceeding. If the process died, a supervisor could resume from the last committed checkpoint. This pattern maps directly to what Vertex AI ADK formalizes as durable agent execution.
Most conversational agents complete their work within a single inference call or a short chain of calls within one HTTP request. But a new class of agents — research agents, data pipeline agents, multi-step workflow agents — may need to work across dozens of sequential steps over minutes or hours. During that time, the executing process can fail for many reasons: Cloud Run instance eviction, timeout limits (Cloud Run maximum request timeout is 60 minutes), transient network errors, or model API failures.
Without checkpointing, any failure means restarting the entire task from scratch. With checkpointing, the agent can resume from the last committed state. This transforms long-running agents from fragile experiments into reliable production systems.
The Agent Development Kit's session service can function as a lightweight checkpoint store. After each completed sub-task or decision point, you commit the current session state — which includes not just conversation history but also arbitrary structured data stored in session metadata — to the session service. A Cloud Tasks queue entry or Pub/Sub message can then trigger resumption from any failure point.
Long-running agent workflows benefit from Cloud Tasks as the execution orchestrator. Instead of a single long HTTP request, the agent breaks its work into discrete steps, each dispatched as a Cloud Tasks entry. Cloud Tasks handles retry logic, backoff, and delivery guarantees. The agent's session store holds the inter-step state.
This architecture is explicitly recommended in Google's Vertex AI documentation for agents that exceed Cloud Run's 60-minute request timeout. It also naturally parallelizes — multiple Cloud Tasks workers can execute independent branches of a complex workflow simultaneously, each reading from and writing to the shared session checkpoint.
Long-running agents often need to pause for human approval before proceeding. The Vertex AI ADK's session state makes this straightforward: when the agent reaches a decision point requiring approval, it sets a status: "awaiting_approval" flag in session state and returns. A notification is sent to the human reviewer. When approval arrives, a new Cloud Tasks message is dispatched to resume execution, and the agent reads the approval from session state before continuing.
In Q4 2023, Google's own internal document processing agents (used for Google Cloud's customer contract review pipeline, documented in a Google Next 2024 session) used exactly this pattern: agents that checkpointed to Firestore after each document section, with human review required before finalizing any amendment suggestions. The checkpoint pattern enabled reviewers to pick up tasks across shifts without losing agent context.
When Cloud Tasks retries a failed step, your agent's step execution must be idempotent — running it twice must produce the same result as running it once. This means checking whether side effects (API calls, database writes, emails sent) already occurred before executing them. Use the session checkpoint to record that an action was completed, and always check before repeating it.
The combination of Vertex AI ADK sessions (for state), Cloud Tasks (for durable scheduling), and Pub/Sub (for event notification) gives you a complete long-running agent infrastructure entirely within Google Cloud. No third-party orchestration platform is required — these primitives are sufficient for most production long-running agent workloads.
In this capstone lab, you'll architect a complete long-running agent workflow — from task initiation through checkpointing, failure recovery, and human-in-the-loop approval. Work through your design with the AI tutor, covering state schema, step boundaries, idempotency, and resumption logic.
create_session() to enable multi-turn conversations?list_events() method on an Agent Builder session is primarily useful for:list_events() returns the complete event log including user turns, model turns, and tool call/response pairs — making it the primary debugging tool for agent behavior analysis.list_events() returns the full event log for debugging and replay. list_sessions() is for listing sessions by user. Billing data is in Cloud Billing, not the session API.