When Google DeepMind published research on multi-agent reinforcement learning frameworks, a pattern became clear across deployed systems: agents solving narrow tasks well but failing at coordination. The bottleneck wasn't intelligence — it was communication architecture.
A single LLM-based agent operates within hard limits: a fixed context window, a single thread of execution, and one model's capability profile. For simple, bounded tasks — answering a question, writing a document, executing a SQL query — this works fine. Production systems, however, face a different reality.
Consider what Salesforce encountered when building Einstein Copilot in 2023–2024. A single agent handling a full sales workflow — ingesting CRM data, drafting outreach, scheduling calls, updating records — quickly exhausted context limits and struggled with the breadth of tool integrations required. The solution wasn't a bigger model; it was decomposition into specialized agents, each owning a domain.
Three structural limits drive this decomposition in practice:
If decomposition is the answer, the next question is: how do agents coordinate without creating chaos? Early multi-agent experiments at companies like Adept AI and Cohere showed that ad-hoc coordination — agents calling each other through custom HTTP APIs or shared databases — created brittle, unmaintainable systems. Each integration was a one-off. Security review had to happen for every pair of agents. There was no standard vocabulary for task delegation.
This mirrors a problem the internet solved in the 1990s: arbitrary networked programs needed a common protocol. HTTP gave them one. The Agent2Agent (A2A) Protocol, announced by Google in April 2025 with over 50 partner organizations, attempts to do the same for AI agents.
Google's Agent2Agent Protocol launched with backing from Atlassian, Box, Cohere, Deloitte, Salesforce, SAP, ServiceNow, and dozens more. The protocol defines how agents discover each other's capabilities, delegate tasks, stream results, and handle errors — all over standard HTTPS using JSON. It is explicitly designed to complement the Model Context Protocol (MCP), which handles agent-to-tool connections.
OpenAI's function calling and Google's tool use in Gemini are agent-to-tool protocols — the agent remains in control and tools are passive executors. A2A is agent-to-agent: both sides are autonomous agents with their own reasoning, state, and decision-making. A remote agent can push back, ask clarifying questions, or fail gracefully in ways a tool cannot.
Vertex AI Agent Engine (formerly Agent Builder) provides the runtime for deploying individual agents. When those agents need to coordinate, A2A provides the protocol layer. The Vertex AI platform added native A2A support in 2025, meaning agents built with the Vertex AI SDK can be registered with Agent Cards and discovered by other agents within and across organizations.
The stack looks like this: LLM reasoning at the core, MCP for tool connections outward, A2A for agent-to-agent coordination, and Vertex AI infrastructure handling deployment, scaling, and observability across the whole system.
┌─────────────────────────────────────────────────┐
│ Orchestration Layer │
│ (Vertex AI Agent Engine / LangGraph / custom) │
└──────────────┬──────────────────────────────────┘
│ Agent2Agent Protocol (A2A)
┌─────────┴─────────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ Agent A │ │ Agent B │
│ (MCP) │ │ (MCP) │
└────┬────┘ └────┬────┘
│ │
Tools/APIs Tools/APIs
In this lab you will work with an AI tutor to analyze a monolithic agent design and determine appropriate decomposition boundaries. You'll practice identifying which tasks should become separate agents, what their Agent Cards should advertise, and how A2A coordination would work.
When Atlassian began integrating A2A into Rovo — its AI work management system — the engineering team noted that the protocol's task lifecycle model solved a problem they'd struggled with: long-running async operations. An agent initiating a code review couldn't block waiting for completion. A2A's push notification and streaming mechanisms let the client agent do other work and receive results when ready.
Every A2A interaction centers on a Task object. Understanding its lifecycle is the foundation of protocol fluency. A task moves through defined states, and both client and remote agents track this state to coordinate correctly.
Task States:
submitted ──► working ──► input-required ──► working
│
├──► completed (success, artifact returned)
├──► failed (terminal failure)
└──► canceled (client or remote canceled)
Each transition is observable via polling or push notifications.
The input-required state is particularly important: it allows a remote agent to pause execution and ask the client agent for more information — a genuine back-and-forth that is impossible in simple function calling. This enables multi-turn interactions between agents, not just one-shot request-response.
A2A uses JSON-RPC 2.0 over HTTPS as its transport. This was a deliberate choice for enterprise adoption: every corporate network can route HTTPS, existing API gateways understand it, and security teams know how to audit it. There is no special protocol to whitelist.
The four core RPC methods are:
For long-running tasks, polling is inefficient. A2A supports push notifications: the client agent registers a webhook URL when submitting a task, and the remote agent POSTs status updates to that URL as they occur. This is how Atlassian's Rovo handles code review agents that may take minutes to complete their analysis.
A2A delegates authentication to standard web mechanisms — OAuth 2.0, API keys, service account JWT tokens. The Agent Card advertises which authentication schemes the remote agent accepts. The client agent is responsible for obtaining and presenting the correct credentials.
This matters enormously for enterprise deployment. When Deloitte published its A2A integration guidance for enterprise clients in 2025, authentication was highlighted as the primary concern: agents crossing organizational boundaries need cryptographic proof of identity, not just shared secrets. The protocol's reliance on standard web auth means existing identity infrastructure — Azure AD, Google Cloud IAM, Okta — can be applied directly.
A2A explicitly does not require a remote agent to expose its internal implementation. The client agent knows what the remote agent can do (via Agent Card) and what it returned (via artifacts), but not how it did it. This enables agents built on entirely different frameworks — LangGraph, AutoGen, CrewAI, or custom code — to interoperate without any shared implementation knowledge.
In this lab you will practice designing A2A Agent Cards and tracing how tasks move through the protocol lifecycle. The tutor will present scenarios and ask you to design Agent Card JSON structures or trace task state transitions.
ServiceNow's AI platform team, building multi-agent workflows for IT service management, published architectural guidance in 2025 noting two failure modes they'd encountered: orchestrator bottlenecks in hierarchical systems (every task touching a central agent created a single point of failure) and coordination storms in peer-to-peer systems (agents negotiating with each other recursively without convergence). The solution was matching topology to task structure — not defaulting to either extreme.
Multi-agent systems cluster around two structural patterns. Understanding their tradeoffs is essential before designing any production system.
HIERARCHICAL (Orchestrator-Worker) PEER-TO-PEER (Mesh)
[Orchestrator] [Agent A] ←──► [Agent B]
/ | \ ↕ ↕
[W1] [W2] [W3] [Agent C] ←──► [Agent D]
Central coordination point Agents negotiate directly
Clear authority, simple auditing Flexible, resilient
Single point of failure Complex to reason about
In a hierarchical topology, a central orchestrator agent breaks down tasks and delegates to worker agents. The orchestrator maintains the full picture of task state; workers are specialists that receive narrow, well-defined subtasks and return artifacts.
This is the pattern used in Google's own Gemini multi-agent demonstrations and in Vertex AI's Agent Builder workflow templates. The orchestrator is typically a capable frontier model (Gemini 1.5 Pro or equivalent) that reasons about task decomposition. Worker agents can be smaller, cheaper, faster models specialized for specific domains.
Advantages: Simple to audit (all task state visible at orchestrator), clear authority for conflict resolution, easy to trace failures, straightforward to add/remove workers.
Failure modes: Orchestrator becomes a bottleneck at scale; orchestrator failure halts everything; orchestrator's reasoning quality determines total system quality.
In a peer-to-peer topology, agents communicate directly with each other based on capability matching. There is no central orchestrator; agents discover each other via Agent Cards and negotiate task delegation bilaterally.
This pattern appears in research systems and some advanced enterprise deployments where resilience is paramount. If one agent fails, others can route around it. The system has no single point of failure.
Advantages: No single point of failure, naturally load-balances, agents can evolve independently, works well for loosely-coupled domains.
Failure modes: Coordination storms (agents ping-ponging tasks), difficult to audit and debug, emergent behaviors hard to predict, latency from negotiation overhead.
ServiceNow's 2025 architectural guidance recommends a hybrid: hierarchical orchestration within a bounded domain (e.g., all IT incident resolution agents report to an incident orchestrator) with peer-to-peer coordination between domain orchestrators. This limits coordination storm risk while preserving cross-domain resilience.
Vertex AI Agent Engine supports both patterns. For hierarchical orchestration, the recommended approach uses a LangGraph-based orchestrator agent that calls worker agents via A2A. The orchestrator graph defines the workflow structure; A2A handles the remote agent calls transparently.
# Hierarchical orchestration — LangGraph + A2A
# Orchestrator node calls remote worker agents
from vertexai.agent_engines import create_agent_engine
from google.adk.agents import RemoteAgent
# Worker agents registered in Vertex AI Agent Registry
research_agent = RemoteAgent(agent_card_url="https://agents.example.com/research/.well-known/agent.json")
writing_agent = RemoteAgent(agent_card_url="https://agents.example.com/writing/.well-known/agent.json")
review_agent = RemoteAgent(agent_card_url="https://agents.example.com/review/.well-known/agent.json")
# Orchestrator delegates via A2A tasks/send
async def orchestrate(task_description: str):
research = await research_agent.send_task(message=task_description)
draft = await writing_agent.send_task(message=research.artifact.text)
final = await review_agent.send_task(message=draft.artifact.text)
return final
Google's Vertex AI documentation as of 2025 recommends starting with hierarchical orchestration for new multi-agent systems — it's easier to reason about, easier to debug, and easier to expand incrementally. Peer-to-peer coordination can be introduced at domain-orchestrator level once the hierarchical foundation is stable.
Practice selecting orchestration topologies for realistic enterprise scenarios. You'll be asked to justify your choices, identify failure modes, and describe how you'd implement the coordination in Vertex AI using LangGraph and A2A.
When SAP integrated A2A into its Business AI platform in 2025, the security team identified a new attack surface they'd not anticipated: prompt injection through inter-agent communication. An attacker who could influence a worker agent's output could potentially inject instructions that the orchestrator would execute. The solution required treating all inter-agent message content as untrusted input — the same principle applied to user inputs — combined with structured output schemas that rejected freeform instruction text.
In a single-agent system, there is one trust boundary: between the user and the agent. In a multi-agent system, every agent-to-agent connection is a potential trust boundary. This creates several new security considerations that do not exist in single-agent deployments.
SAP's Business AI team adopted a "structured artifact schema" pattern: all inter-agent artifacts must conform to a JSON Schema registered in a central schema registry. Free-form text responses from worker agents that don't match the expected schema are rejected before being processed by the orchestrator. This eliminates the prompt injection vector through inter-agent communication.
Single-agent observability is relatively straightforward: one input, one trace, one output. Multi-agent observability requires tracking causality across agent boundaries — understanding that a failure in agent D was caused by a bad input from agent B which came from a misinterpretation by agent A of the original user request.
Vertex AI Agent Engine provides distributed tracing that propagates trace IDs across A2A calls. Every task sent via A2A carries the parent trace ID, creating a complete causal tree across all agents involved in processing a single user request.
Distributed Trace: User Request → Multi-Agent Processing
[User Request] trace_id: abc123
└─► [Orchestrator] span: orchestrate_task
├─► [Research Agent] span: research_task (parent: abc123)
│ └─► [Web Search Tool] span: tool_call
├─► [Analysis Agent] span: analyze_task (parent: abc123)
│ └─► [BigQuery Tool] span: tool_call
└─► [Writing Agent] span: write_task (parent: abc123)
└─► artifact: final_report
All spans linked by trace_id: abc123
Full causal chain visible in Cloud Trace / Vertex AI Monitoring
Vertex AI's 2025 multi-agent deployment guide specifies a production readiness checklist. The following are the critical items for A2A-based systems:
A2A Production Readiness Checklist
Authentication
✓ Every inter-agent call authenticated with service account credentials
✓ Credentials scoped to minimum required permissions per agent
✓ User credentials never forwarded to worker agents
✓ Agent Card served over HTTPS with valid TLS certificate
Content Security
✓ All inter-agent artifacts validated against registered JSON Schema
✓ Free-form text from worker agents treated as untrusted input
✓ No instructions extracted from artifact text and executed directly
Observability
✓ Distributed tracing enabled with trace ID propagation across A2A calls
✓ Task lifecycle events logged to Cloud Logging
✓ Alerts on error rate, latency, and queue depth per agent
Reliability
✓ Idempotency keys used on all tasks/send calls
✓ Retry logic with exponential backoff on transient failures
✓ Circuit breakers on remote agent connections
✓ Graceful degradation path when remote agents are unavailable
In 2025, teams at both SAP and Deloitte independently arrived at the same production pattern: wrapping every A2A remote agent call with a circuit breaker. If a remote agent starts failing, the circuit opens and the orchestrator routes to a fallback path — either a different agent or a degraded but functional response — rather than accumulating timeouts that cascade into orchestrator failure. Vertex AI's SDK includes a built-in circuit breaker wrapper for RemoteAgent calls.
Practice identifying security vulnerabilities in multi-agent architectures and designing observability configurations. The tutor will present agent system designs and ask you to conduct a security and observability review.