In 2023, Google's DeepMind published findings from its Gemini research showing that even large context windows — up to one million tokens — did not guarantee recall fidelity across all positions in a document. The "lost in the middle" phenomenon, first formally described by Stanford researchers Liu et al. in their 2023 paper, demonstrated that language models systematically underweight information placed in the center of long contexts. For production agents, this is not an academic curiosity — it is an architectural constraint that shapes how conversation history must be managed.
A multi-turn agent maintains conversation state — the accumulated record of what has been said, decided, and done across multiple exchanges. In Vertex AI Agent Builder, state is represented as a structured session object that travels with each request to the underlying model.
State contains more than raw message text. It includes tool call results, intermediate reasoning steps, user-provided context (account IDs, preferences), and agent-generated artifacts (summaries, computed values). Managing all of this reliably is the central challenge of multi-turn agent design.
The active prompt window — messages, tool results, and instructions sent with every API call. Fast and immediately available, but bounded by token limits and subject to positional attention effects.
Persistent storage outside the model: Firestore, Cloud Spanner, or Vector Search indexes. The agent retrieves relevant facts via search rather than keeping everything in context.
Knowledge baked into model weights during training or fine-tuning. Cannot be updated at inference time. Useful for general world knowledge but unreliable for dynamic, session-specific facts.
Vertex AI's Agent Development Kit (ADK) structures sessions as first-class objects. Each session carries a unique session_id and a state dictionary that persists across calls within the same session lifecycle.
Keep only the most recent N turns in context. Simple but loses early context. Useful when conversations are task-focused and short-term.
Periodically summarize older turns using a fast, cheap model call. Replace raw history with the summary. Google used this pattern in early Bard conversations.
Extract key facts into a typed state dict after each turn. Only inject the structured state — not raw history — into subsequent prompts. Minimizes tokens while preserving actionable context.
Store full history in Vertex AI Vector Search. At each turn, retrieve only the most semantically relevant prior exchanges. Scales to arbitrarily long conversations.
Google's own customer support agents deployed via CCAI (Contact Center AI) in 2022–2023 use a hybrid approach: a structured "customer context card" injected at the start of each session, combined with a rolling 8-turn window for recent dialogue. This keeps tokens predictable while surfacing account-specific facts reliably.
state dict allow agents to do?tool_context.state, avoiding the need to inject entire histories as raw text.You are designing an e-commerce support agent on Vertex AI. The agent must handle multi-turn conversations where users ask about orders, returns, and account details across several exchanges.
Chat with the AI lab assistant to explore how you would structure session state for this agent. Think about what to store in the state dict, how to handle context compression, and where to draw the line between in-context and external memory.
In 2022, Air Canada deployed a chatbot to handle customer service inquiries. A customer asked whether Air Canada offered bereavement fares. The chatbot fabricated a policy that did not exist, promising a retroactive refund. The customer relied on this, and when Air Canada refused to honor it, the dispute reached the British Columbia Civil Resolution Tribunal. In January 2024, the tribunal ruled Air Canada liable — finding that the chatbot's statements bound the airline. Air Canada was ordered to pay $812.02 in damages.
The case illustrates a foundational principle: agents that make commitments involving money, policy, or legal standing must route to human oversight before acting. A well-designed HITL trigger would have escalated the bereavement fare question to a human agent immediately.
Human-in-the-loop is an architectural pattern where an AI agent pauses execution at defined checkpoints and requires a human to review, approve, or redirect before proceeding. In Vertex AI, this is implemented via agent callbacks, interrupt signals, and escalation tools registered in the agent's tool set.
HITL is not a failure mode — it is a deliberate design choice. The goal is not to minimize HITL occurrences but to trigger it precisely: too rarely risks harm; too often degrades the user experience and negates the value of automation.
Processes user input, runs tools, evaluates confidence and risk. Detects trigger conditions.
Receives escalation with full context. Approves, modifies, or rejects the proposed action. Response injected back into session.
Agent's internal scoring falls below a defined threshold (e.g., intent confidence < 0.7). The agent knows it is uncertain and escalates rather than guessing.
The proposed next action exceeds a risk boundary: issuing a refund above $500, modifying account permissions, sending communications to all users. Defined as a policy rule, not an AI judgment.
User's request falls into an edge case not covered by training data or grounding documents. Agent detects the gap and escalates rather than extrapolating.
User explicitly requests a human ("Let me speak to someone"). Must be honored unconditionally. Override all automation gates.
External signal (fraud score, unusual transaction pattern) crosses a threshold. Agent pauses and routes to a specialist review queue.
Agent has attempted and failed to resolve the user's issue N times in the same session. Automatic escalation prevents user frustration loops.
In the Vertex AI Agent Development Kit, HITL is implemented as a specialized tool that the agent can call. When the tool is invoked, it signals the orchestration layer to suspend the session and route it to a human queue.
The Air Canada tribunal ruling established that AI-generated commitments can be legally binding. In production, escalation triggers for financial commitments and policy representations should be defined as hard policy rules in your orchestration layer — not left to the model's discretion. The agent should not be able to override them.
When escalation is triggered, capture the full session state, conversation history, and the reason for escalation. Store in a review queue (e.g., Cloud Tasks + Firestore).
Present the reviewer with a structured summary of the conversation, the agent's proposed action (if any), and relevant customer data. Provide approve/modify/reject options.
If the human approves, the agent session is resumed with the decision injected into state. If the human takes over, the session is converted to a live human conversation.
Every HITL event — trigger reason, human reviewer ID, decision made, time to resolution — is written to an immutable audit log. Essential for compliance and model improvement.
You are building a financial services agent that helps users manage their investment accounts. Given the legal and financial risk, you need a robust set of HITL triggers.
Discuss with the AI assistant what trigger conditions you'd implement, how you'd classify urgency levels, and how the human review queue should be structured for this domain.
In 2023, Waymo's autonomous vehicle operations in San Francisco used a remote assistance system (RAS) where human operators could take over vehicle decision-making remotely during edge cases. When a vehicle encountered a scenario outside its confidence boundary — a novel traffic situation, road debris, an unusual pedestrian behavior — it would emit an interrupt signal and pause non-critical decisions while requesting operator input. The operator would review a live feed and either approve the vehicle's planned action or provide a corrected directive. This pattern — interrupt, present context, await approval, resume — is the architectural template for HITL in any high-stakes agent system.
In Vertex AI ADK, the interrupt-resume pattern is implemented using session state flags and an interrupt callback. When the agent's orchestration detects an interrupt condition, it:
1. Freezes further tool execution. 2. Persists the current session state. 3. Notifies a review endpoint (via Pub/Sub or Cloud Tasks). 4. Awaits an external inject call before resuming.
A before_tool_callback or confidence check identifies a high-risk operation. The callback returns a non-None dict, short-circuiting execution and storing the pending action in state.
The orchestration layer publishes a Cloud Pub/Sub message containing the session_id, pending tool name, arguments, and a context summary. Review systems subscribe to this topic.
The reviewer receives a structured card: user history, pending action, risk level, and a recommendation from the agent. They approve, modify (provide different args), or reject.
The review interface calls inject_human_decision(), which writes the approval flag and any modifications to session state via the ADK session update API.
resume_agent_run() re-triggers the agent loop. The before_tool_callback now finds approval_granted = True and allows the tool to execute. The reviewer's notes are available as state context.
Human reviewers are not always immediately available. Production approval workflows must handle timeout scenarios gracefully. A Cloud Tasks task with a deadline can automatically trigger a fallback action if no human decision arrives within the SLA window.
Waymo's RAS team in 2023 reported average operator response times of under 45 seconds for interrupt events. Their system automatically escalated to a second operator if the primary did not respond within 90 seconds — a direct implementation of the timeout + escalation pattern described above.
When an agent pauses for human review, users must not experience unexplained silence. Best practice is to immediately send a holding message explaining the wait, then send a follow-up when the review is complete. The holding message should be sent by the agent before the interrupt is fully processed — using a before-interrupt callback that fires a user notification first.
before_tool_callback do?You are implementing an approval workflow for a healthcare data agent that can access and modify patient records. This is a highly regulated domain — any modification to clinical data requires physician approval before the agent's change is committed.
Work through the architecture with the assistant: what does the interrupt sequence look like, how do you handle the 30-minute physician response SLA, and what fallback applies if the SLA is missed?
Between 2022 and 2024, Google DeepMind's AlphaCode 2 system demonstrated that HITL feedback at scale could dramatically accelerate capability improvement. The team built a system where human expert programmers reviewed competitive programming solutions, marking edge-case failures and providing corrected approaches. These structured reviews were used to generate fine-tuning examples, and subsequent model versions showed measurable improvements specifically on the categories where human reviewers had identified failures. The key insight was that every HITL event is a labeled training example — a case where the agent's output diverged from human expert judgment.
In a production Vertex AI agent system, every human review interaction generates structured data: what the agent proposed, what the human decided, and why. Capturing and routing this data systematically creates a flywheel — more reviews generate better training signals, which improve the agent, which reduces the volume of reviews needed.
Derived from behavioral signals without explicit reviewer annotations: did the user accept the agent's recommendation? Did they ask for clarification? Did they abandon the session?
Structured annotations from human reviewers: approval/rejection, confidence rating, category tags, corrected output. Higher signal quality, lower volume.
Cases where the agent's proposed action was rejected and the human provided a different one. The pairing (rejected action, correct action) is the highest-value fine-tuning signal.
Clustering rejection reasons to identify systematic agent weaknesses. If 40% of rejections cite "policy extrapolation," that's a targeted fine-tuning or grounding problem to fix.
Collect feedback records in BigQuery. Filter for high-confidence review events: cases where reviewers were certain (short review time, no modification), and counterfactuals (modifications with clear category tags).
Transform feedback records into conversation-completion pairs: the conversation history up to the interrupt as the input, the human-approved action as the target. Format as JSONL for Vertex AI supervised fine-tuning.
Submit a Vertex AI tuning job using the Gemini supervised fine-tuning API. Target the specific categories with highest rejection rates. Validate on a held-out set of known-correct decisions.
Deploy the fine-tuned model to a canary slice (e.g., 10% of traffic). Measure HITL trigger rate, rejection rate, and user satisfaction against the baseline. Promote if metrics improve.
Aggregate patterns may reveal policy gaps — cases where the model was consistently right but the hard rule was wrong, or vice versa. Review and update orchestration-layer rules on a scheduled cadence.
Beyond supervised fine-tuning, HITL events can generate preference data for reinforcement learning from human feedback (RLHF). When a reviewer modifies an agent proposal, you have a preference pair: the human output is preferred over the agent output. Google's RLHF research applied to Gemini models (documented in the Gemini 1.5 technical report, 2024) showed that preference tuning on domain-specific review data consistently outperformed supervised fine-tuning alone for instruction-following tasks.
Track your HITL rate as a primary agent quality metric. A high HITL rate means the agent is frequently uncertain or risky — a signal for fine-tuning or policy clarification. A declining HITL rate over time, without an increase in error rate, is a direct measure of agent improvement driven by the feedback loop.
You are the ML engineer responsible for the continuous improvement pipeline for a customer service agent at a large e-commerce company. The HITL system is generating hundreds of review events per day and you need to turn them into model improvements.
Work through the feedback-to-fine-tuning pipeline with the assistant: how to filter high-quality examples, convert them to training format, measure improvement, and decide when to promote a new model version.
before_tool_callback returns None?