Module 6 · Lesson 1

Conversation State and Session Management

How agents remember what was said — and why that memory has limits

What does it really mean for an agent to "remember" a conversation across multiple turns?

In 2023, Google's DeepMind published findings from its Gemini research showing that even large context windows — up to one million tokens — did not guarantee recall fidelity across all positions in a document. The "lost in the middle" phenomenon, first formally described by Stanford researchers Liu et al. in their 2023 paper, demonstrated that language models systematically underweight information placed in the center of long contexts. For production agents, this is not an academic curiosity — it is an architectural constraint that shapes how conversation history must be managed.

What Is Conversation State?

A multi-turn agent maintains conversation state — the accumulated record of what has been said, decided, and done across multiple exchanges. In Vertex AI Agent Builder, state is represented as a structured session object that travels with each request to the underlying model.

State contains more than raw message text. It includes tool call results, intermediate reasoning steps, user-provided context (account IDs, preferences), and agent-generated artifacts (summaries, computed values). Managing all of this reliably is the central challenge of multi-turn agent design.

The Three Layers of Agent Memory

In-Context Memory

The active prompt window — messages, tool results, and instructions sent with every API call. Fast and immediately available, but bounded by token limits and subject to positional attention effects.

External Memory

Persistent storage outside the model: Firestore, Cloud Spanner, or Vector Search indexes. The agent retrieves relevant facts via search rather than keeping everything in context.

Parametric Memory

Knowledge baked into model weights during training or fine-tuning. Cannot be updated at inference time. Useful for general world knowledge but unreliable for dynamic, session-specific facts.

Session Objects in Vertex AI

Vertex AI's Agent Development Kit (ADK) structures sessions as first-class objects. Each session carries a unique session_id and a state dictionary that persists across calls within the same session lifecycle.

# Vertex AI ADK — session state example
from google.cloud.aiplatform_v1beta1 import AgentEngineServiceClient

# Create a persistent session
session = client.create_session(
    parent=f"projects/{PROJECT}/locations/{REGION}/reasoningEngines/{ENGINE_ID}",
    session={
        "user_id": "user-42",
        "display_name": "Support Conversation #7891"
    }
)

# State is read/written via the session context object
# Available in tool callbacks as: tool_context.state["key"]
tool_context.state["account_verified"] = True
tool_context.state["last_order_id"] = "ORD-2024-88123"
    

Context Window Management Strategies

Rolling Window Truncation

Keep only the most recent N turns in context. Simple but loses early context. Useful when conversations are task-focused and short-term.

Summarization Compression

Periodically summarize older turns using a fast, cheap model call. Replace raw history with the summary. Google used this pattern in early Bard conversations.

Structured State Extraction

Extract key facts into a typed state dict after each turn. Only inject the structured state — not raw history — into subsequent prompts. Minimizes tokens while preserving actionable context.

Retrieval-Augmented History

Store full history in Vertex AI Vector Search. At each turn, retrieve only the most semantically relevant prior exchanges. Scales to arbitrarily long conversations.

Production Reality

Google's own customer support agents deployed via CCAI (Contact Center AI) in 2022–2023 use a hybrid approach: a structured "customer context card" injected at the start of each session, combined with a rolling 8-turn window for recent dialogue. This keeps tokens predictable while surfacing account-specific facts reliably.

Key Terms

session_idUnique identifier for a conversation session in Vertex AI ADK. Enables state persistence across multiple agent turns and API calls.

State DictA key-value store attached to a session. Written and read by tools and callbacks without consuming prompt tokens for every key.

Context CompressionAny technique that reduces token consumption of conversation history: truncation, summarization, or selective retrieval.

Lost-in-the-MiddleDocumented attention degradation for information placed in the center of long prompts. First formally measured by Liu et al., Stanford, 2023.

Quiz — Conversation State

3 questions · Lesson 1

What does the Vertex AI ADK session state dict allow agents to do?

Correct. The state dict is a session-scoped key-value store. Tools read and write it via tool_context.state, avoiding the need to inject entire histories as raw text.

Not quite. The state dict is a structured key-value store for session data — it doesn't touch model weights, system prompts, or context window limits.

The "lost-in-the-middle" phenomenon (Liu et al., 2023) describes what limitation for multi-turn agents?

Correct. Even with large context windows, models show reduced recall for content positioned in the middle of the prompt. This means critical facts should be placed at the beginning or end of context.

Incorrect. The finding is specifically about positional attention — information in the middle of long contexts receives less reliable attention than content near the edges.

Which context management strategy scales best to arbitrarily long conversations?

Correct. Retrieval-augmented history stores the full conversation externally and retrieves only semantically relevant turns at each step, allowing the approach to scale without growing the prompt linearly.

That approach doesn't scale. Rolling windows lose early context, parametric memory can't store session facts, and full transcripts hit token limits quickly.

Lab 1 — Designing Session State

Interactive practice · Conversation State & Memory

Your Task

You are designing an e-commerce support agent on Vertex AI. The agent must handle multi-turn conversations where users ask about orders, returns, and account details across several exchanges.

Chat with the AI lab assistant to explore how you would structure session state for this agent. Think about what to store in the state dict, how to handle context compression, and where to draw the line between in-context and external memory.

Starter prompt: "What keys should I put in my session state dict for an order support agent — and which facts belong in external storage instead?"

Session State Lab

VERTEX AI ADK

Welcome to the session state lab. I'm your Vertex AI ADK assistant. Let's design a robust state schema for your order support agent. What's your first question about structuring conversation state?

Module 6 · Lesson 2

Human-in-the-Loop: Patterns and Triggers

When agents should pause, escalate, and ask — rather than act alone

How do production systems decide when an AI agent should stop and wait for a human decision?

In 2022, Air Canada deployed a chatbot to handle customer service inquiries. A customer asked whether Air Canada offered bereavement fares. The chatbot fabricated a policy that did not exist, promising a retroactive refund. The customer relied on this, and when Air Canada refused to honor it, the dispute reached the British Columbia Civil Resolution Tribunal. In January 2024, the tribunal ruled Air Canada liable — finding that the chatbot's statements bound the airline. Air Canada was ordered to pay $812.02 in damages.

The case illustrates a foundational principle: agents that make commitments involving money, policy, or legal standing must route to human oversight before acting. A well-designed HITL trigger would have escalated the bereavement fare question to a human agent immediately.

What Is Human-in-the-Loop (HITL)?

Human-in-the-loop is an architectural pattern where an AI agent pauses execution at defined checkpoints and requires a human to review, approve, or redirect before proceeding. In Vertex AI, this is implemented via agent callbacks, interrupt signals, and escalation tools registered in the agent's tool set.

HITL is not a failure mode — it is a deliberate design choice. The goal is not to minimize HITL occurrences but to trigger it precisely: too rarely risks harm; too often degrades the user experience and negates the value of automation.

The HITL Interaction Model

Agent

Processes user input, runs tools, evaluates confidence and risk. Detects trigger conditions.

⇄

Human Reviewer

Receives escalation with full context. Approves, modifies, or rejects the proposed action. Response injected back into session.

HITL Trigger Categories

Confidence Threshold

Agent's internal scoring falls below a defined threshold (e.g., intent confidence < 0.7). The agent knows it is uncertain and escalates rather than guessing.

High-Stakes Action

The proposed next action exceeds a risk boundary: issuing a refund above $500, modifying account permissions, sending communications to all users. Defined as a policy rule, not an AI judgment.

Policy Ambiguity

User's request falls into an edge case not covered by training data or grounding documents. Agent detects the gap and escalates rather than extrapolating.

User-Initiated Escalation

User explicitly requests a human ("Let me speak to someone"). Must be honored unconditionally. Override all automation gates.

Anomaly Detection

External signal (fraud score, unusual transaction pattern) crosses a threshold. Agent pauses and routes to a specialist review queue.

Consecutive Failure

Agent has attempted and failed to resolve the user's issue N times in the same session. Automatic escalation prevents user frustration loops.

Implementing HITL with Vertex AI ADK

In the Vertex AI Agent Development Kit, HITL is implemented as a specialized tool that the agent can call. When the tool is invoked, it signals the orchestration layer to suspend the session and route it to a human queue.

# Define an escalation tool for HITL handoff
from google.adk.tools import FunctionTool

def escalate_to_human(
    reason: str,
    urgency: str,  # "low" | "medium" | "high"
    context_summary: str
) -> dict:
    """
    Pauses the agent session and routes to a human agent queue.
    Call this when confidence is low, stakes are high, or
    user explicitly requests human assistance.
    """
    # Write escalation data to session state
    tool_context.state["escalation_reason"] = reason
    tool_context.state["escalation_urgency"] = urgency
    tool_context.state["escalation_summary"] = context_summary
    tool_context.state["awaiting_human"] = True
    # Signal the orchestration layer
    return {"status": "escalated", "queue": "tier-2-support"}

escalation_tool = FunctionTool(func=escalate_to_human)
    

Design Principle

The Air Canada tribunal ruling established that AI-generated commitments can be legally binding. In production, escalation triggers for financial commitments and policy representations should be defined as hard policy rules in your orchestration layer — not left to the model's discretion. The agent should not be able to override them.

HITL Queue Management

Snapshot and Suspend

When escalation is triggered, capture the full session state, conversation history, and the reason for escalation. Store in a review queue (e.g., Cloud Tasks + Firestore).

Human Review Interface

Present the reviewer with a structured summary of the conversation, the agent's proposed action (if any), and relevant customer data. Provide approve/modify/reject options.

Resume or Hand Off

If the human approves, the agent session is resumed with the decision injected into state. If the human takes over, the session is converted to a live human conversation.

Audit Logging

Every HITL event — trigger reason, human reviewer ID, decision made, time to resolution — is written to an immutable audit log. Essential for compliance and model improvement.

Quiz — HITL Patterns

3 questions · Lesson 2

What was the legal outcome of the Air Canada chatbot case decided in January 2024?

Correct. The BC Civil Resolution Tribunal found Air Canada liable for $812.02, ruling that the chatbot's fabricated bereavement policy statements bound the airline — establishing that AI-generated commitments can be legally actionable.

Incorrect. The tribunal ruled against Air Canada, finding the chatbot's statements legally binding regardless of whether they were accurate.

In Vertex AI ADK, which mechanism is used to implement a HITL handoff?

Correct. HITL is implemented as a FunctionTool. When called, it writes escalation data to session state and returns a signal to the orchestration layer to suspend and route the session.

Not quite. HITL in Vertex AI ADK is a tool-based mechanism — a registered function that the agent can call, not a prompt instruction or session deletion.

According to the design principle in this lesson, where should financial commitment escalation triggers be defined?

Correct. High-stakes triggers like financial commitments must be enforced as hard rules in the orchestration layer — not left to the model's discretion — to ensure they cannot be bypassed by prompt variations.

Hard policy rules in the orchestration layer are the correct answer. System prompt suggestions can be overridden by the model; orchestration-layer rules cannot.

Lab 2 — Designing HITL Triggers

Interactive practice · Human-in-the-Loop Patterns

Your Task

You are building a financial services agent that helps users manage their investment accounts. Given the legal and financial risk, you need a robust set of HITL triggers.

Discuss with the AI assistant what trigger conditions you'd implement, how you'd classify urgency levels, and how the human review queue should be structured for this domain.

Starter prompt: "What HITL triggers should a financial account management agent have, and how do I decide which ones should be hard orchestration rules vs. model-judged escalations?"

HITL Design Lab

ESCALATION PATTERNS

Welcome to the HITL design lab. I'll help you design robust human-in-the-loop triggers for a financial services agent. What aspects of escalation policy would you like to explore first?

Module 6 · Lesson 3

Approval Workflows and Interrupt Handling

Building the mechanics of human review — pause, inject, and resume

How do you technically implement a system where an agent pauses mid-task, waits for human input, and then resumes with that decision integrated?

In 2023, Waymo's autonomous vehicle operations in San Francisco used a remote assistance system (RAS) where human operators could take over vehicle decision-making remotely during edge cases. When a vehicle encountered a scenario outside its confidence boundary — a novel traffic situation, road debris, an unusual pedestrian behavior — it would emit an interrupt signal and pause non-critical decisions while requesting operator input. The operator would review a live feed and either approve the vehicle's planned action or provide a corrected directive. This pattern — interrupt, present context, await approval, resume — is the architectural template for HITL in any high-stakes agent system.

The Interrupt-Resume Pattern

In Vertex AI ADK, the interrupt-resume pattern is implemented using session state flags and an interrupt callback. When the agent's orchestration detects an interrupt condition, it:

1. Freezes further tool execution. 2. Persists the current session state. 3. Notifies a review endpoint (via Pub/Sub or Cloud Tasks). 4. Awaits an external inject call before resuming.

# Interrupt pattern — agent side
def before_tool_callback(tool, args, tool_context):
    """Called before every tool execution."""
    # Check if human approval is required for this tool
    if tool.name in HIGH_RISK_TOOLS:
        approval = tool_context.state.get("approval_granted", False)
        if not approval:
            # Return a dict to short-circuit tool execution
            return {
                "status": "awaiting_approval",
                "tool": tool.name,
                "args": args
            }
    return None  # None = proceed normally

# Resume pattern — orchestration side
def inject_human_decision(session_id: str, approved: bool, notes: str):
    """Called by human reviewer's interface."""
    client.update_session_state(
        session_id=session_id,
        state_delta={
            "approval_granted": approved,
            "reviewer_notes": notes,
            "approved_at": datetime.utcnow().isoformat()
        }
    )
    # Resume the agent run
    client.resume_agent_run(session_id=session_id)
    

Approval Workflow Architecture

Agent Detects Interrupt Condition

A before_tool_callback or confidence check identifies a high-risk operation. The callback returns a non-None dict, short-circuiting execution and storing the pending action in state.

Publish to Review Queue

The orchestration layer publishes a Cloud Pub/Sub message containing the session_id, pending tool name, arguments, and a context summary. Review systems subscribe to this topic.

Human Reviews Context

The reviewer receives a structured card: user history, pending action, risk level, and a recommendation from the agent. They approve, modify (provide different args), or reject.

Decision Injected into Session State

The review interface calls inject_human_decision(), which writes the approval flag and any modifications to session state via the ADK session update API.

Agent Resumes

resume_agent_run() re-triggers the agent loop. The before_tool_callback now finds approval_granted = True and allows the tool to execute. The reviewer's notes are available as state context.

Timeout and Fallback Handling

Human reviewers are not always immediately available. Production approval workflows must handle timeout scenarios gracefully. A Cloud Tasks task with a deadline can automatically trigger a fallback action if no human decision arrives within the SLA window.

# Cloud Tasks timeout handler
def handle_approval_timeout(session_id: str):
    """Triggered by Cloud Tasks if SLA exceeded."""
    state = client.get_session_state(session_id)
    pending_tool = state.get("pending_tool")
    urgency = state.get("escalation_urgency", "medium")

    if urgency == "low":
        # Auto-approve low urgency after timeout
        inject_human_decision(session_id, approved=True, notes="auto-approved on timeout")
    elif urgency == "high":
        # Reject and escalate to supervisor queue
        inject_human_decision(session_id, approved=False, notes="timeout — escalated to supervisor")
        publish_to_supervisor_queue(session_id)
    else:
        # Medium: deny and inform user of delay
        inject_human_decision(session_id, approved=False, notes="timeout — request denied, please retry")
    

Waymo Parallel

Waymo's RAS team in 2023 reported average operator response times of under 45 seconds for interrupt events. Their system automatically escalated to a second operator if the primary did not respond within 90 seconds — a direct implementation of the timeout + escalation pattern described above.

User Communication During Waits

When an agent pauses for human review, users must not experience unexplained silence. Best practice is to immediately send a holding message explaining the wait, then send a follow-up when the review is complete. The holding message should be sent by the agent before the interrupt is fully processed — using a before-interrupt callback that fires a user notification first.

Quiz — Interrupt Handling

3 questions · Lesson 3

In Vertex AI ADK's interrupt pattern, what does returning a non-None dict from a before_tool_callback do?

Correct. When a before_tool_callback returns a non-None dict, the ADK framework treats it as an interrupt — the tool is not executed, and the returned dict becomes the result, which the orchestration layer uses to detect the pending approval state.

Not quite. Returning a non-None dict from a before_tool_callback short-circuits the tool — it prevents execution entirely and returns the dict as the result for the orchestration layer to handle.

What Google Cloud service is recommended for publishing interrupt events to human review systems?

Correct. Cloud Pub/Sub is the recommended mechanism for publishing interrupt events. Review systems subscribe to the topic and receive structured messages containing the session_id, pending tool, arguments, and context summary.

Cloud Pub/Sub is the correct answer — it provides a decoupled, scalable messaging layer between the agent orchestration and the human review interface.

According to the lesson, what should happen if a high-urgency approval request times out before a human reviews it?

Correct. High-urgency timeout events should result in rejection (not auto-approval) and escalation to a supervisor queue — erring on the side of caution when the stakes are highest and no human has reviewed in time.

Auto-approval on timeout would be dangerous for high-urgency actions. The correct approach is to reject and escalate to a supervisor, maintaining human oversight even at the cost of delay.

Lab 3 — Building an Approval Workflow

Interactive practice · Interrupt & Resume Patterns

Your Task

You are implementing an approval workflow for a healthcare data agent that can access and modify patient records. This is a highly regulated domain — any modification to clinical data requires physician approval before the agent's change is committed.

Work through the architecture with the assistant: what does the interrupt sequence look like, how do you handle the 30-minute physician response SLA, and what fallback applies if the SLA is missed?

Starter prompt: "Walk me through how to implement the interrupt-resume pattern for a healthcare agent that needs physician approval before modifying patient records."

Approval Workflow Lab

INTERRUPT / RESUME

Welcome to the approval workflow lab. Healthcare agent approvals involve strict timing, audit requirements, and fail-safe defaults. Let's design your interrupt-resume architecture. What's your starting question?

Module 6 · Lesson 4

Feedback Loops and Continuous Improvement

How human decisions flow back into agent training and policy refinement

If humans are reviewing agent decisions, how do those reviews become learning signals that make the agent better over time?

Between 2022 and 2024, Google DeepMind's AlphaCode 2 system demonstrated that HITL feedback at scale could dramatically accelerate capability improvement. The team built a system where human expert programmers reviewed competitive programming solutions, marking edge-case failures and providing corrected approaches. These structured reviews were used to generate fine-tuning examples, and subsequent model versions showed measurable improvements specifically on the categories where human reviewers had identified failures. The key insight was that every HITL event is a labeled training example — a case where the agent's output diverged from human expert judgment.

The Feedback Loop Architecture

In a production Vertex AI agent system, every human review interaction generates structured data: what the agent proposed, what the human decided, and why. Capturing and routing this data systematically creates a flywheel — more reviews generate better training signals, which improve the agent, which reduces the volume of reviews needed.

Implicit Feedback

Derived from behavioral signals without explicit reviewer annotations: did the user accept the agent's recommendation? Did they ask for clarification? Did they abandon the session?

Explicit Feedback

Structured annotations from human reviewers: approval/rejection, confidence rating, category tags, corrected output. Higher signal quality, lower volume.

Counterfactual Labels

Cases where the agent's proposed action was rejected and the human provided a different one. The pairing (rejected action, correct action) is the highest-value fine-tuning signal.

Aggregate Pattern Analysis

Clustering rejection reasons to identify systematic agent weaknesses. If 40% of rejections cite "policy extrapolation," that's a targeted fine-tuning or grounding problem to fix.

Capturing Structured Feedback in Vertex AI

# Feedback capture schema — written at review completion
feedback_record = {
    "session_id": session_id,
    "turn_number": turn_number,
    "agent_proposal": {
        "tool": "issue_refund",
        "args": {"amount": 750.00, "reason": "delayed shipment"}
    },
    "human_decision": "modified",  # approved / rejected / modified
    "human_action": {
        "tool": "issue_refund",
        "args": {"amount": 450.00, "reason": "partial — policy max"}
    },
    "rejection_category": "policy_limit_exceeded",
    "reviewer_id": "rev-881",
    "review_duration_seconds": 47,
    "timestamp": "2024-03-15T14:22:11Z"
}
# Write to BigQuery for analysis and fine-tuning pipeline
bq_client.insert_rows_json("project.dataset.hitl_feedback", [feedback_record])
    

From Feedback to Fine-Tuning

Accumulate and Filter

Collect feedback records in BigQuery. Filter for high-confidence review events: cases where reviewers were certain (short review time, no modification), and counterfactuals (modifications with clear category tags).

Convert to Training Format

Transform feedback records into conversation-completion pairs: the conversation history up to the interrupt as the input, the human-approved action as the target. Format as JSONL for Vertex AI supervised fine-tuning.

Fine-Tune on Vertex AI

Submit a Vertex AI tuning job using the Gemini supervised fine-tuning API. Target the specific categories with highest rejection rates. Validate on a held-out set of known-correct decisions.

A/B Test New Model Version

Deploy the fine-tuned model to a canary slice (e.g., 10% of traffic). Measure HITL trigger rate, rejection rate, and user satisfaction against the baseline. Promote if metrics improve.

Update Policy Rules

Aggregate patterns may reveal policy gaps — cases where the model was consistently right but the hard rule was wrong, or vice versa. Review and update orchestration-layer rules on a scheduled cadence.

RLHF and Preference Data

Beyond supervised fine-tuning, HITL events can generate preference data for reinforcement learning from human feedback (RLHF). When a reviewer modifies an agent proposal, you have a preference pair: the human output is preferred over the agent output. Google's RLHF research applied to Gemini models (documented in the Gemini 1.5 technical report, 2024) showed that preference tuning on domain-specific review data consistently outperformed supervised fine-tuning alone for instruction-following tasks.

Measurement Principle

Track your HITL rate as a primary agent quality metric. A high HITL rate means the agent is frequently uncertain or risky — a signal for fine-tuning or policy clarification. A declining HITL rate over time, without an increase in error rate, is a direct measure of agent improvement driven by the feedback loop.

Key Terms

Counterfactual LabelA training example where the agent's proposed action was rejected and the human provided a correct alternative. The (rejected, correct) pair is the highest-signal fine-tuning data for HITL systems.

RLHFReinforcement Learning from Human Feedback. Uses human preference comparisons (preferred vs. non-preferred outputs) to train a reward model, which then guides policy optimization.

HITL RateThe fraction of agent sessions that trigger a human review event. A key operational metric: too high suggests agent underperformance; declining over time suggests improvement.

Canary DeploymentRouting a small percentage of live traffic to a new model version to measure real-world performance before full rollout. Standard practice for fine-tuned model promotion on Vertex AI.

Quiz — Feedback Loops

3 questions · Lesson 4

What makes a "counterfactual label" particularly valuable as fine-tuning data in HITL systems?

Correct. Counterfactual labels are high-value because they provide both the incorrect proposal (what the model got wrong) and the human-provided correction (what it should have done), creating a direct supervised fine-tuning signal.

Not quite. Counterfactual labels are specifically the pairing of a rejected agent action with the human's corrected alternative — providing both the error and the correction in a single training example.

According to Google's Gemini 1.5 technical report (2024), what did preference tuning outperform for instruction-following tasks?

Correct. Google's Gemini 1.5 research showed that RLHF preference tuning on domain-specific review data consistently outperformed supervised fine-tuning alone for instruction-following tasks — motivating the collection of preference pairs from HITL events.

The Gemini 1.5 technical report specifically compared preference tuning against supervised fine-tuning alone, finding preference tuning superior for instruction-following.

What does a declining HITL rate over time (without an increase in error rate) indicate about an agent system?

Correct. A declining HITL rate paired with a stable or improving error rate is the primary signal that the feedback loop is working — the agent is learning from past reviews and handling more cases autonomously and correctly.

If error rates are stable or improving, a declining HITL rate is a positive signal: the agent is becoming more capable and requires less human intervention. The key qualifier is that error rates must also be tracked.

Lab 4 — Designing a Feedback Pipeline

Interactive practice · HITL Feedback and Continuous Improvement

Your Task

You are the ML engineer responsible for the continuous improvement pipeline for a customer service agent at a large e-commerce company. The HITL system is generating hundreds of review events per day and you need to turn them into model improvements.

Work through the feedback-to-fine-tuning pipeline with the assistant: how to filter high-quality examples, convert them to training format, measure improvement, and decide when to promote a new model version.

Starter prompt: "I have 500 HITL review events in BigQuery from last week. How do I decide which ones are high enough quality to use for fine-tuning, and what format do I need them in for Vertex AI supervised fine-tuning?"

Feedback Pipeline Lab

FINE-TUNING / RLHF

Welcome to the feedback pipeline lab. Turning raw HITL events into model improvements is a multi-step process that requires careful quality filtering, format conversion, and A/B testing. What aspect of the pipeline would you like to tackle first?

Module 6 — Module Test

15 questions · Pass at 80% (12/15)

1. Which Vertex AI ADK object stores structured key-value data that persists across multiple turns of a session?

Correct. The session state dict is the designated persistent key-value store for multi-turn agent data in Vertex AI ADK.

The session state dict is the correct answer — it persists structured data across turns without consuming prompt tokens.

2. The "lost-in-the-middle" phenomenon was formally documented by which research group in 2023?

Correct. Liu et al. at Stanford published the foundational paper on the lost-in-the-middle phenomenon in 2023.

Stanford's Liu et al. published the lost-in-the-middle paper in 2023, documenting the positional attention degradation effect.

3. Which context management approach is best suited for conversations that could run indefinitely?

Correct. Vector search-based retrieval scales to arbitrary conversation length by storing history externally and retrieving only relevant turns.

Retrieval-augmented history scales to arbitrary length — the only approach that doesn't degrade as conversations grow.

4. What legal precedent did the Air Canada chatbot case (January 2024) establish?

Correct. The BC Civil Resolution Tribunal found Air Canada liable for $812.02 — establishing that AI chatbot statements can bind the deploying organization.

The tribunal ruled that the chatbot's fabricated policy was binding on Air Canada, establishing that AI-generated commitments can create legal liability.

5. Which HITL trigger category cannot be overridden by any other system condition?

Correct. User-initiated escalation must be honored unconditionally — it overrides all other automation gates and cannot be filtered or delayed.

User-initiated escalation is the one trigger that must be honored unconditionally, regardless of other system conditions or efficiency considerations.

6. In Vertex AI ADK, what happens when a before_tool_callback returns None?

Correct. A None return from before_tool_callback signals "proceed" — the tool executes as normal. Only a non-None dict triggers an interrupt.

None means proceed. Only returning a non-None dict causes the callback to short-circuit tool execution.

7. What Google Cloud service is used to implement approval timeout handling in HITL workflows?

Correct. Cloud Tasks supports task deadlines and retry logic, making it the appropriate service for implementing SLA-based timeout handlers in HITL workflows.

Cloud Tasks is the correct answer — it supports configurable deadlines and can trigger a fallback handler when the SLA window expires.

8. Waymo's Remote Assistance System (2023) automatically escalated to a second operator if no response arrived within how many seconds?

Correct. Waymo's RAS escalated to a second operator if the primary did not respond within 90 seconds — a real-world implementation of the timeout-plus-escalation pattern.

Waymo's RAS used a 90-second escalation timeout before routing to a secondary operator.

9. What should an agent do for the user immediately when an interrupt is triggered, before the human review is complete?

Correct. A before-interrupt callback should fire a user notification immediately — users must not experience unexplained silence while a review is pending.

Silent waiting is poor UX. The agent should send a holding message via a before-interrupt callback before the review process begins.

10. Which type of feedback data is generated when a human reviewer modifies (rather than simply approves or rejects) an agent's proposed action?

Correct. A modification creates a counterfactual label — the pairing of the agent's rejected proposal with the human's correct alternative. This is the highest-value fine-tuning signal.

When a reviewer modifies an action, they create a counterfactual label: (rejected action, correct action). This is more valuable than a simple approval or rejection.

11. Google's CCAI (Contact Center AI) support agents used which combination for context management in 2022–2023?

Correct. Google's CCAI agents used a customer context card (account data injected at session start) plus a rolling 8-turn window, balancing predictable token usage with short-term dialogue recall.

CCAI used a hybrid: a structured customer context card at session start, plus a rolling window of recent turns — keeping tokens predictable.

12. In the fine-tuning pipeline, what format does Vertex AI supervised fine-tuning require for training data?

Correct. Vertex AI supervised fine-tuning requires training data in JSONL format, where each line contains a conversation history as the input and the target completion.

Vertex AI supervised fine-tuning uses JSONL — JSON Lines format with one conversation example per line.

13. What is the primary risk of setting HITL triggers that are too sensitive (triggering too frequently)?

Correct. Over-triggering HITL creates user frustration through waits and interruptions, and negates the efficiency gains of deploying an AI agent in the first place.

Too-frequent HITL triggers degrade user experience and eliminate automation value. The goal is precise triggering — not minimal or maximal.

14. What is a "canary deployment" in the context of promoting a fine-tuned agent model?

Correct. A canary deployment routes a small slice of real traffic to the new model version, allowing measurement of real-world metrics before committing to a full rollout.

Canary deployment means routing a small percentage of live production traffic to the new model — getting real-world signal while limiting blast radius if the model underperforms.

15. According to the module, which three layers of agent memory are distinguished?

Correct. The three layers are: in-context memory (the prompt window), external memory (persistent storage outside the model), and parametric memory (knowledge encoded in model weights).

The three layers defined in this module are in-context memory (active prompt), external memory (Firestore, vector search), and parametric memory (model weights).