Module 7 · Lesson 1

Why Agents Need to Talk to Each Other

The limits of single-agent architectures and the case for coordination protocols

What fundamental constraints push production AI systems beyond the single-agent model?

When Google DeepMind published research on multi-agent reinforcement learning frameworks, a pattern became clear across deployed systems: agents solving narrow tasks well but failing at coordination. The bottleneck wasn't intelligence — it was communication architecture.

The Single-Agent Ceiling

A single LLM-based agent operates within hard limits: a fixed context window, a single thread of execution, and one model's capability profile. For simple, bounded tasks — answering a question, writing a document, executing a SQL query — this works fine. Production systems, however, face a different reality.

Consider what Salesforce encountered when building Einstein Copilot in 2023–2024. A single agent handling a full sales workflow — ingesting CRM data, drafting outreach, scheduling calls, updating records — quickly exhausted context limits and struggled with the breadth of tool integrations required. The solution wasn't a bigger model; it was decomposition into specialized agents, each owning a domain.

Three structural limits drive this decomposition in practice:

Hard Limits

Context window saturation on long tasks
Single-model capability ceilings
Sequential execution bottlenecks
Tool namespace collisions at scale

Operational Limits

Latency compounds across long chains
Failure isolation impossible in monoliths
Separate security domains needed
Specialization vs. generalization tradeoff

The Coordination Problem

If decomposition is the answer, the next question is: how do agents coordinate without creating chaos? Early multi-agent experiments at companies like Adept AI and Cohere showed that ad-hoc coordination — agents calling each other through custom HTTP APIs or shared databases — created brittle, unmaintainable systems. Each integration was a one-off. Security review had to happen for every pair of agents. There was no standard vocabulary for task delegation.

This mirrors a problem the internet solved in the 1990s: arbitrary networked programs needed a common protocol. HTTP gave them one. The Agent2Agent (A2A) Protocol, announced by Google in April 2025 with over 50 partner organizations, attempts to do the same for AI agents.

The A2A Announcement — April 2025

Google's Agent2Agent Protocol launched with backing from Atlassian, Box, Cohere, Deloitte, Salesforce, SAP, ServiceNow, and dozens more. The protocol defines how agents discover each other's capabilities, delegate tasks, stream results, and handle errors — all over standard HTTPS using JSON. It is explicitly designed to complement the Model Context Protocol (MCP), which handles agent-to-tool connections.

Key Concepts: The A2A Mental Model

Client Agent The agent that initiates a task delegation. It knows what needs to be done but delegates execution to a more capable or specialized agent.

Remote Agent The agent that receives and executes the delegated task. It exposes an A2A-compliant endpoint and publishes an Agent Card describing its capabilities.

Agent Card A JSON document published at a well-known URL (typically /.well-known/agent.json) that describes the agent's capabilities, supported input/output modalities, authentication requirements, and pricing if any.

Task The fundamental unit of A2A work. A client sends a task, the remote agent processes it, and returns an artifact. Tasks have a defined lifecycle: submitted → working → completed (or failed).

Why Not Just Use Function Calling?

OpenAI's function calling and Google's tool use in Gemini are agent-to-tool protocols — the agent remains in control and tools are passive executors. A2A is agent-to-agent: both sides are autonomous agents with their own reasoning, state, and decision-making. A remote agent can push back, ask clarifying questions, or fail gracefully in ways a tool cannot.

Where A2A Fits in the Vertex AI Agent Stack

Vertex AI Agent Engine (formerly Agent Builder) provides the runtime for deploying individual agents. When those agents need to coordinate, A2A provides the protocol layer. The Vertex AI platform added native A2A support in 2025, meaning agents built with the Vertex AI SDK can be registered with Agent Cards and discovered by other agents within and across organizations.

The stack looks like this: LLM reasoning at the core, MCP for tool connections outward, A2A for agent-to-agent coordination, and Vertex AI infrastructure handling deployment, scaling, and observability across the whole system.

  ┌─────────────────────────────────────────────────┐
  │              Orchestration Layer                 │
  │  (Vertex AI Agent Engine / LangGraph / custom)  │
  └──────────────┬──────────────────────────────────┘
                 │ Agent2Agent Protocol (A2A)
       ┌─────────┴─────────┐
       ▼                   ▼
  ┌─────────┐         ┌─────────┐
  │ Agent A │         │ Agent B │
  │  (MCP)  │         │  (MCP)  │
  └────┬────┘         └────┬────┘
       │                   │
   Tools/APIs          Tools/APIs

Lesson 1 Quiz

Why Agents Need to Talk to Each Other

What is an Agent Card in the A2A protocol?

Correct. The Agent Card is analogous to a service discovery document — it tells other agents what this agent can do, what inputs it accepts, what authentication it requires, and optionally what it costs to call.

Not quite. The Agent Card is a machine-readable JSON capability advertisement published at a well-known URL, enabling agent discovery without a central registry.

What is the key architectural difference between MCP (Model Context Protocol) and A2A (Agent2Agent Protocol)?

Exactly right. MCP is an agent-to-tool protocol where the agent is always in control. A2A is agent-to-agent — both sides have autonomy, reasoning, and can initiate communication.

The distinction is about autonomy: MCP connects to passive tools, A2A connects to autonomous agents that can reason, push back, and have their own state.

Which of these is NOT one of the structural limits that push production systems toward multi-agent architectures?

Correct — LLMs generating structured JSON is a solved problem (via function calling, structured outputs, etc.) and is not a driver of multi-agent decomposition. The real drivers are context limits, capability specialization, and fault isolation.

LLMs generating structured JSON is well-supported and not a driver for multi-agent systems. The real structural limits are context windows, capability specialization, sequential execution bottlenecks, and failure isolation.

Lab 1: Designing Agent Decomposition

Practice identifying when and how to split a monolithic agent into a multi-agent system

Lab Objective

In this lab you will work with an AI tutor to analyze a monolithic agent design and determine appropriate decomposition boundaries. You'll practice identifying which tasks should become separate agents, what their Agent Cards should advertise, and how A2A coordination would work.

Start by describing a complex agent task you want to decompose — for example: "An agent that handles end-to-end customer onboarding: KYC verification, account setup, welcome email, and initial product recommendation." The tutor will guide you through the decomposition process.

A2A Architecture Tutor

Lab 1

Welcome to Lab 1. I'm here to help you practice multi-agent decomposition using the Agent2Agent Protocol model.

Describe a complex workflow or agent task — something with multiple distinct capability domains. I'll guide you through identifying decomposition boundaries, designing Agent Cards, and mapping the A2A coordination pattern.

What scenario would you like to work through?

Module 7 · Lesson 2

The A2A Protocol: Technical Architecture

Tasks, artifacts, streaming, and the message lifecycle that makes agent coordination reliable

How does A2A structure communication so that agents can coordinate reliably across trust boundaries?

When Atlassian began integrating A2A into Rovo — its AI work management system — the engineering team noted that the protocol's task lifecycle model solved a problem they'd struggled with: long-running async operations. An agent initiating a code review couldn't block waiting for completion. A2A's push notification and streaming mechanisms let the client agent do other work and receive results when ready.

The Task Lifecycle

Every A2A interaction centers on a Task object. Understanding its lifecycle is the foundation of protocol fluency. A task moves through defined states, and both client and remote agents track this state to coordinate correctly.

  Task States:

  submitted ──► working ──► input-required ──► working
                   │
                   ├──► completed   (success, artifact returned)
                   ├──► failed      (terminal failure)
                   └──► canceled    (client or remote canceled)

  Each transition is observable via polling or push notifications.

The input-required state is particularly important: it allows a remote agent to pause execution and ask the client agent for more information — a genuine back-and-forth that is impossible in simple function calling. This enables multi-turn interactions between agents, not just one-shot request-response.

Core Protocol Objects

Task Contains: id, sessionId, status, history (messages), artifacts, metadata. The client sends a task with an initial Message; the remote agent updates status and returns artifacts.

Message A single turn in the task conversation. Has a role (user or agent) and an array of Parts. Parts can be text, data (arbitrary JSON), or file references.

Artifact The output produced by the remote agent upon task completion. Like a Message, it contains Parts. A task can produce multiple artifacts (e.g., a document and a summary).

Part The atomic content unit. TextPart holds text. DataPart holds arbitrary JSON (structured data, tool results). FilePart references an external file by URI or carries inline bytes.

Transport: JSON-RPC over HTTPS

A2A uses JSON-RPC 2.0 over HTTPS as its transport. This was a deliberate choice for enterprise adoption: every corporate network can route HTTPS, existing API gateways understand it, and security teams know how to audit it. There is no special protocol to whitelist.

The four core RPC methods are:

Synchronous Methods

tasks/send — Submit a task, get final result
tasks/get — Poll for current task status
tasks/cancel — Request task cancellation

Streaming Methods

tasks/sendSubscribe — Submit and stream updates via SSE
Server-Sent Events for real-time progress
TaskStatusUpdateEvent and TaskArtifactUpdateEvent

Push Notifications

For long-running tasks, polling is inefficient. A2A supports push notifications: the client agent registers a webhook URL when submitting a task, and the remote agent POSTs status updates to that URL as they occur. This is how Atlassian's Rovo handles code review agents that may take minutes to complete their analysis.

Authentication and Trust

A2A delegates authentication to standard web mechanisms — OAuth 2.0, API keys, service account JWT tokens. The Agent Card advertises which authentication schemes the remote agent accepts. The client agent is responsible for obtaining and presenting the correct credentials.

This matters enormously for enterprise deployment. When Deloitte published its A2A integration guidance for enterprise clients in 2025, authentication was highlighted as the primary concern: agents crossing organizational boundaries need cryptographic proof of identity, not just shared secrets. The protocol's reliance on standard web auth means existing identity infrastructure — Azure AD, Google Cloud IAM, Okta — can be applied directly.

The Opaque Execution Principle

A2A explicitly does not require a remote agent to expose its internal implementation. The client agent knows what the remote agent can do (via Agent Card) and what it returned (via artifacts), but not how it did it. This enables agents built on entirely different frameworks — LangGraph, AutoGen, CrewAI, or custom code — to interoperate without any shared implementation knowledge.

Lesson 2 Quiz

The A2A Protocol: Technical Architecture

What does the "input-required" task state enable that simple function calling cannot?

Correct. The input-required state is what distinguishes A2A from simple tool calls. The remote agent can pause, signal it needs more information, and the client agent responds — creating genuine bidirectional dialogue between autonomous agents.

The input-required state specifically enables multi-turn agent-to-agent dialogue — the remote agent can pause execution and request clarification from the client agent, which is impossible in one-shot function calling.

Why did A2A choose JSON-RPC over HTTPS as its transport, rather than a more specialized protocol?

Exactly. Enterprise adoption requires that the protocol work in existing infrastructure. HTTPS needs no special firewall rules, API gateways already inspect it, and security teams have established processes for auditing it — all critical for enterprise deployment.

The choice was pragmatic for enterprise adoption: HTTPS works through any corporate network, existing tooling understands it, and security teams know how to audit it. No special protocol to whitelist or infrastructure to add.

What does the "Opaque Execution Principle" mean in A2A?

Correct. Opaque execution is what enables a LangGraph agent to call a CrewAI agent without knowing or caring about the implementation. The interface contract (Agent Card + artifacts) is all that matters, not the internals.

Opaque execution means the client only sees the interface (Agent Card capabilities) and the output (artifacts), never the internal implementation. This is what makes cross-framework interoperability possible.

Lab 2: Designing Agent Cards and Task Flows

Practice writing A2A Agent Cards and tracing task state transitions

Lab Objective

In this lab you will practice designing A2A Agent Cards and tracing how tasks move through the protocol lifecycle. The tutor will present scenarios and ask you to design Agent Card JSON structures or trace task state transitions.

Start by asking the tutor to give you an agent scenario, or propose your own. For example: "Help me design an Agent Card for a document summarization agent that accepts PDF files and returns structured summaries."

A2A Protocol Design Tutor

Lab 2

Welcome to Lab 2. We're working on A2A Agent Cards and task lifecycle design.

I can give you an agent scenario to design an Agent Card for, or you can propose one. We'll work through the JSON structure including: capabilities declaration, supported input/output modalities, authentication requirements, and how tasks should flow through the lifecycle states for your agent's specific behavior.

Ready to start? Propose a scenario or ask me for one.

Module 7 · Lesson 3

Orchestration Patterns: Hierarchical and Peer-to-Peer

How multi-agent topologies shape capability, failure modes, and operational complexity

Which orchestration topology fits your system's coordination needs — and what are the costs of getting it wrong?

ServiceNow's AI platform team, building multi-agent workflows for IT service management, published architectural guidance in 2025 noting two failure modes they'd encountered: orchestrator bottlenecks in hierarchical systems (every task touching a central agent created a single point of failure) and coordination storms in peer-to-peer systems (agents negotiating with each other recursively without convergence). The solution was matching topology to task structure — not defaulting to either extreme.

The Two Fundamental Topologies

Multi-agent systems cluster around two structural patterns. Understanding their tradeoffs is essential before designing any production system.

  HIERARCHICAL (Orchestrator-Worker)        PEER-TO-PEER (Mesh)

       [Orchestrator]                    [Agent A] ←──► [Agent B]
       /      |      \                       ↕               ↕
   [W1]    [W2]    [W3]                  [Agent C] ←──► [Agent D]

  Central coordination point             Agents negotiate directly
  Clear authority, simple auditing       Flexible, resilient
  Single point of failure                Complex to reason about

Hierarchical Orchestration

In a hierarchical topology, a central orchestrator agent breaks down tasks and delegates to worker agents. The orchestrator maintains the full picture of task state; workers are specialists that receive narrow, well-defined subtasks and return artifacts.

This is the pattern used in Google's own Gemini multi-agent demonstrations and in Vertex AI's Agent Builder workflow templates. The orchestrator is typically a capable frontier model (Gemini 1.5 Pro or equivalent) that reasons about task decomposition. Worker agents can be smaller, cheaper, faster models specialized for specific domains.

Advantages: Simple to audit (all task state visible at orchestrator), clear authority for conflict resolution, easy to trace failures, straightforward to add/remove workers.

Failure modes: Orchestrator becomes a bottleneck at scale; orchestrator failure halts everything; orchestrator's reasoning quality determines total system quality.

Peer-to-Peer Agent Meshes

In a peer-to-peer topology, agents communicate directly with each other based on capability matching. There is no central orchestrator; agents discover each other via Agent Cards and negotiate task delegation bilaterally.

This pattern appears in research systems and some advanced enterprise deployments where resilience is paramount. If one agent fails, others can route around it. The system has no single point of failure.

Advantages: No single point of failure, naturally load-balances, agents can evolve independently, works well for loosely-coupled domains.

Failure modes: Coordination storms (agents ping-ponging tasks), difficult to audit and debug, emergent behaviors hard to predict, latency from negotiation overhead.

What ServiceNow Found in Production

ServiceNow's 2025 architectural guidance recommends a hybrid: hierarchical orchestration within a bounded domain (e.g., all IT incident resolution agents report to an incident orchestrator) with peer-to-peer coordination between domain orchestrators. This limits coordination storm risk while preserving cross-domain resilience.

Implementing Orchestration on Vertex AI

Vertex AI Agent Engine supports both patterns. For hierarchical orchestration, the recommended approach uses a LangGraph-based orchestrator agent that calls worker agents via A2A. The orchestrator graph defines the workflow structure; A2A handles the remote agent calls transparently.

  # Hierarchical orchestration — LangGraph + A2A
  # Orchestrator node calls remote worker agents

  from vertexai.agent_engines import create_agent_engine
  from google.adk.agents import RemoteAgent

  # Worker agents registered in Vertex AI Agent Registry
  research_agent  = RemoteAgent(agent_card_url="https://agents.example.com/research/.well-known/agent.json")
  writing_agent   = RemoteAgent(agent_card_url="https://agents.example.com/writing/.well-known/agent.json")
  review_agent    = RemoteAgent(agent_card_url="https://agents.example.com/review/.well-known/agent.json")

  # Orchestrator delegates via A2A tasks/send
  async def orchestrate(task_description: str):
      research = await research_agent.send_task(message=task_description)
      draft    = await writing_agent.send_task(message=research.artifact.text)
      final    = await review_agent.send_task(message=draft.artifact.text)
      return final

Choosing Your Topology

Choose Hierarchical When

Task decomposition is well-understood upfront
Auditability and traceability are required
Workers have clearly distinct specializations
Throughput matters more than resilience
Regulatory compliance requires clear authority chain

Choose Peer-to-Peer When

No single agent should be a single point of failure
Task routing is highly dynamic and context-dependent
Agents are maintained by different teams or orgs
System must degrade gracefully under partial failure
Domains are loosely coupled with occasional coordination

The Vertex AI Recommendation

Google's Vertex AI documentation as of 2025 recommends starting with hierarchical orchestration for new multi-agent systems — it's easier to reason about, easier to debug, and easier to expand incrementally. Peer-to-peer coordination can be introduced at domain-orchestrator level once the hierarchical foundation is stable.

Lesson 3 Quiz

Orchestration Patterns: Hierarchical and Peer-to-Peer

What is the primary failure mode of hierarchical (orchestrator-worker) multi-agent systems?

Correct. Centralizing coordination in an orchestrator creates both a performance bottleneck (everything flows through it) and a reliability risk (if it fails, everything fails). This is the core tradeoff ServiceNow's team documented.

The primary failure mode of hierarchical systems is the orchestrator as bottleneck and single point of failure. Every task must flow through the orchestrator, making it a performance and reliability risk.

What is a "coordination storm" in a peer-to-peer agent mesh?

Exactly. Coordination storms occur when agents in a peer-to-peer mesh keep delegating tasks to each other — Agent A delegates to B, B delegates back to A or to C who delegates back — creating circular or runaway delegation without convergence.

A coordination storm is when agents in a peer-to-peer mesh ping-pong tasks to each other recursively without resolution — each agent delegates to another agent who delegates back, consuming resources without completing work.

According to ServiceNow's 2025 architectural guidance, what hybrid approach helps balance the tradeoffs between hierarchical and peer-to-peer topologies?

Correct. The hybrid approach keeps the benefits of both: hierarchical within a domain gives clear authority and easy auditing, while peer-to-peer between domain orchestrators provides cross-domain resilience and eliminates any single point of failure across the whole system.

ServiceNow recommends hierarchical orchestration within each bounded domain (e.g., all incident resolution agents under one orchestrator) combined with peer-to-peer between those domain orchestrators — capturing the auditability benefits of hierarchy while avoiding a global single point of failure.

Lab 3: Orchestration Architecture Design

Choose and justify topology decisions for realistic multi-agent scenarios

Lab Objective

Practice selecting orchestration topologies for realistic enterprise scenarios. You'll be asked to justify your choices, identify failure modes, and describe how you'd implement the coordination in Vertex AI using LangGraph and A2A.

Ask the tutor for a multi-agent scenario, or bring your own. You'll analyze it together — choosing between hierarchical, peer-to-peer, or hybrid topologies, identifying failure modes, and sketching the implementation approach.

Orchestration Architecture Tutor

Lab 3

Welcome to Lab 3. We're working on multi-agent orchestration architecture decisions.

I'll present realistic enterprise scenarios — or you can bring your own — and we'll work through topology selection together. For each scenario you'll need to: choose between hierarchical, peer-to-peer, or hybrid, justify that choice against the tradeoffs, identify the likely failure modes, and sketch how you'd implement it on Vertex AI with LangGraph and A2A.

Want a scenario from me, or do you have a system you're designing?

Module 7 · Lesson 4

Security, Observability, and Production Hardening

What changes when agents call other agents — and how to keep production multi-agent systems trustworthy

How do the security and observability requirements of multi-agent systems differ from single-agent deployments, and what does Vertex AI provide to address them?

When SAP integrated A2A into its Business AI platform in 2025, the security team identified a new attack surface they'd not anticipated: prompt injection through inter-agent communication. An attacker who could influence a worker agent's output could potentially inject instructions that the orchestrator would execute. The solution required treating all inter-agent message content as untrusted input — the same principle applied to user inputs — combined with structured output schemas that rejected freeform instruction text.

The New Attack Surface: Agent-to-Agent Trust

In a single-agent system, there is one trust boundary: between the user and the agent. In a multi-agent system, every agent-to-agent connection is a potential trust boundary. This creates several new security considerations that do not exist in single-agent deployments.

Prompt Injection via Agent Output A malicious payload in a remote agent's artifact that, when ingested by the orchestrator, injects instructions. Mitigation: treat all inter-agent content as untrusted; use structured schemas for artifact exchange.

Credential Forwarding Risk If an orchestrator forwards user credentials to worker agents, compromise of any worker grants access to those credentials. Mitigation: use service account credentials scoped to each agent's specific permissions; never forward user tokens.

Confused Deputy Problem An orchestrator agent, acting with elevated permissions, can be tricked into taking actions on behalf of an attacker by a malicious remote agent. Mitigation: implement least-privilege at every agent boundary; validate all action requests against defined policy.

Lateral Movement Compromise of one agent in a mesh can be used to attack neighboring agents. Mitigation: network segmentation, zero-trust agent authentication, and independent credential stores per agent.

SAP's Production Mitigation Pattern

SAP's Business AI team adopted a "structured artifact schema" pattern: all inter-agent artifacts must conform to a JSON Schema registered in a central schema registry. Free-form text responses from worker agents that don't match the expected schema are rejected before being processed by the orchestrator. This eliminates the prompt injection vector through inter-agent communication.

Vertex AI Observability for Multi-Agent Systems

Single-agent observability is relatively straightforward: one input, one trace, one output. Multi-agent observability requires tracking causality across agent boundaries — understanding that a failure in agent D was caused by a bad input from agent B which came from a misinterpretation by agent A of the original user request.

Vertex AI Agent Engine provides distributed tracing that propagates trace IDs across A2A calls. Every task sent via A2A carries the parent trace ID, creating a complete causal tree across all agents involved in processing a single user request.

  Distributed Trace: User Request → Multi-Agent Processing

  [User Request] trace_id: abc123
       └─► [Orchestrator] span: orchestrate_task
               ├─► [Research Agent] span: research_task (parent: abc123)
               │       └─► [Web Search Tool] span: tool_call
               ├─► [Analysis Agent] span: analyze_task (parent: abc123)
               │       └─► [BigQuery Tool] span: tool_call
               └─► [Writing Agent] span: write_task (parent: abc123)
                       └─► artifact: final_report

  All spans linked by trace_id: abc123
  Full causal chain visible in Cloud Trace / Vertex AI Monitoring

Key Observability Signals for Multi-Agent Systems

Metrics to Monitor

Task completion rate per remote agent
Inter-agent latency (P50, P95, P99)
Task state distribution (working vs. input-required)
Retry rate per agent pair
Artifact schema validation failure rate

Alerts to Configure

Remote agent error rate exceeds threshold
Task stuck in input-required state (timeout)
Orchestrator queue depth growing unbounded
Trace depth exceeds expected maximum (loop detection)
Credential validation failures on inter-agent calls

Production Hardening Checklist

Vertex AI's 2025 multi-agent deployment guide specifies a production readiness checklist. The following are the critical items for A2A-based systems:

  A2A Production Readiness Checklist

  Authentication
  ✓ Every inter-agent call authenticated with service account credentials
  ✓ Credentials scoped to minimum required permissions per agent
  ✓ User credentials never forwarded to worker agents
  ✓ Agent Card served over HTTPS with valid TLS certificate

  Content Security
  ✓ All inter-agent artifacts validated against registered JSON Schema
  ✓ Free-form text from worker agents treated as untrusted input
  ✓ No instructions extracted from artifact text and executed directly

  Observability
  ✓ Distributed tracing enabled with trace ID propagation across A2A calls
  ✓ Task lifecycle events logged to Cloud Logging
  ✓ Alerts on error rate, latency, and queue depth per agent

  Reliability
  ✓ Idempotency keys used on all tasks/send calls
  ✓ Retry logic with exponential backoff on transient failures
  ✓ Circuit breakers on remote agent connections
  ✓ Graceful degradation path when remote agents are unavailable

The Circuit Breaker Pattern for Agent Meshes

In 2025, teams at both SAP and Deloitte independently arrived at the same production pattern: wrapping every A2A remote agent call with a circuit breaker. If a remote agent starts failing, the circuit opens and the orchestrator routes to a fallback path — either a different agent or a degraded but functional response — rather than accumulating timeouts that cascade into orchestrator failure. Vertex AI's SDK includes a built-in circuit breaker wrapper for RemoteAgent calls.

Lesson 4 Quiz

Security, Observability, and Production Hardening

What is "prompt injection via agent output" and how did SAP mitigate it in production?

Correct. SAP's mitigation was elegant: require all inter-agent artifacts to conform to a registered JSON Schema. Free-form text that doesn't match the schema is rejected before the orchestrator processes it, eliminating the injection vector entirely.

Prompt injection via agent output is when a worker agent's artifact contains instructions that get executed by the orchestrator. SAP's mitigation was requiring all artifacts to match a registered JSON Schema — free-form instruction text simply can't pass schema validation.

What does distributed tracing provide in a multi-agent Vertex AI system?

Correct. The key value of distributed tracing in multi-agent systems is causality: when a failure occurs in agent D, you can trace back through the span tree to see exactly what input it received, which agent sent it, and what the original user request was that set the chain in motion.

Distributed tracing propagates trace IDs across A2A calls, creating a causal tree. This lets you trace a failure in any agent back through all intermediate agents to the original user request — root-cause analysis across agent boundaries.

Why is the circuit breaker pattern particularly important for A2A-based multi-agent systems?

Exactly. Cascading failure is the key risk in hierarchical multi-agent systems: if a worker agent starts timing out, every orchestrator call waiting on it ties up a thread. Circuit breakers open when failures accumulate, enabling immediate fallback routing before the orchestrator itself becomes overwhelmed.

The circuit breaker prevents cascading failure: a timing-out worker agent causes the orchestrator to accumulate blocked threads. Without a circuit breaker, this cascades — the orchestrator itself fails. Circuit breakers open at the first sign of persistent failure and route to fallbacks immediately.

Lab 4: Security and Observability Design

Apply production hardening principles to a multi-agent architecture under review

Lab Objective

Practice identifying security vulnerabilities in multi-agent architectures and designing observability configurations. The tutor will present agent system designs and ask you to conduct a security and observability review.

Ask for an agent architecture to review, or describe a system you're building. You'll identify security risks (prompt injection paths, credential handling issues, confused deputy risks) and design the observability stack (what to trace, what to alert on, what the circuit breaker policy should be).

Security & Observability Tutor

Lab 4

Welcome to Lab 4. We're working on security review and observability design for multi-agent systems.

I can give you an agent architecture with deliberate security and observability gaps for you to find and fix, or you can describe a system you're building and we'll review it together.

For each architecture we'll cover: prompt injection vectors and mitigations, credential handling risks, what to trace and how to configure the trace ID propagation, which metrics to alert on, and where circuit breakers should be placed.

Ready to start? Ask for an architecture to review or describe your own system.

Module 7 — Test

The Agent2Agent Protocol — Building Multi-Agent Systems · 15 questions · 80% to pass

1. What was the primary motivation behind Google launching the Agent2Agent (A2A) Protocol in April 2025?

Correct. A2A was designed to solve the coordination protocol problem — giving agents a standard way to discover each other, delegate tasks, and exchange results across frameworks and organizational boundaries.

A2A was launched to standardize agent-to-agent communication — capability discovery via Agent Cards, task delegation, and result exchange — enabling agents built on different frameworks to interoperate.

2. Which URL convention does A2A use for Agent Card discovery?

Correct. The /.well-known/ URL pattern (from RFC 5785) is used for service discovery documents across many web standards. A2A follows this convention for Agent Cards.

A2A uses /.well-known/agent.json following the RFC 5785 well-known URI standard, the same pattern used by OAuth server metadata, ACME protocol, and other web-scale discovery mechanisms.

3. In the A2A protocol, what is the difference between a "client agent" and a "remote agent"?

Correct. The client/remote distinction is about the direction of task delegation, not location or capability tier. In a different context, the same agent can be a client agent (when delegating) or a remote agent (when receiving delegations).

The distinction is about task delegation direction: the client agent initiates, the remote agent receives and executes. An agent can play either role depending on the interaction — it's positional, not a fixed characteristic.

4. What are the four terminal and non-terminal states in the A2A task lifecycle?

Correct. The key non-terminal states are submitted (received), working (executing), and input-required (paused for more info). Terminal states are completed, failed, and canceled.

A2A tasks flow through: submitted → working → input-required (back to working) or directly to completed, failed, or canceled. The input-required state is particularly distinctive — it enables genuine multi-turn agent dialogue.

5. What transport protocol does A2A use, and why was this choice significant for enterprise adoption?

Correct. The choice was pragmatic: enterprise adoption requires working within existing infrastructure. HTTPS needs no special firewall rules, API gateways already understand it, and security teams know how to audit it.

A2A uses JSON-RPC 2.0 over HTTPS — the choice was driven by enterprise practicality: no special network rules, existing tooling works, security teams have established HTTPS review processes.

6. How does A2A handle real-time streaming of task progress for long-running operations?

Correct. tasks/sendSubscribe opens an SSE stream, and the remote agent pushes TaskStatusUpdateEvent and TaskArtifactUpdateEvent messages as they occur — enabling real-time progress visibility without polling.

A2A supports streaming via tasks/sendSubscribe, which opens a Server-Sent Events stream. The remote agent pushes update events as task status changes and as artifact chunks become available.

7. What is the "Opaque Execution Principle" and why does it enable cross-framework interoperability?

Correct. The interface contract — Agent Card plus artifacts — is all that matters. Internal implementation is invisible. This is the same principle that lets Python code call a Java service over REST without caring that it's Java.

Opaque execution means the interface contract (Agent Card + artifacts) is all that matters — not the implementation. This is what lets agents built with completely different frameworks interoperate over A2A.

8. In hierarchical multi-agent orchestration, what role does the orchestrator agent typically play on Vertex AI?

Correct. The orchestrator needs strong reasoning to decompose complex tasks appropriately, so frontier models are typical choices. Workers can be smaller, cheaper, specialized models since they handle well-defined narrow tasks.

The orchestrator does the heavy reasoning — breaking down complex tasks and routing appropriately — so capable frontier models like Gemini 1.5 Pro are the recommended choice. Workers can be smaller, cheaper, specialized models.

9. What hybrid topology did ServiceNow recommend for enterprise multi-agent deployments?

Correct. This hybrid captures the best of both: hierarchy within each domain gives clear authority and auditability; peer-to-peer between domain orchestrators eliminates a global single point of failure.

ServiceNow's recommendation: hierarchical within bounded domains (e.g., all IT incident agents under one orchestrator) + peer-to-peer between domain orchestrators. This combines auditability with cross-domain resilience.

10. Which of these is the correct description of a "coordination storm" in a peer-to-peer agent mesh?

Correct. Coordination storms are recursive delegation loops: A delegates to B, B delegates to C, C delegates back to A — consuming compute and connections without completing any work.

A coordination storm is recursive delegation without convergence: agents keep passing tasks to each other without any agent completing the work. It's a livelock scenario specific to peer-to-peer agent meshes.

11. How does A2A handle authentication, and what enterprise identity systems can be used?

Correct. A2A's reliance on standard web auth is a key enterprise adoption feature — security teams don't need to learn a new auth system, and existing IAM infrastructure can be applied directly to agent-to-agent connections.

A2A delegates authentication to standard web mechanisms advertised in the Agent Card. This means existing enterprise identity infrastructure — Google Cloud IAM, Azure AD, Okta — works directly for authenticating inter-agent calls.

12. What security vulnerability did SAP's Business AI team discover when building multi-agent systems, and how did they mitigate it?

Correct. SAP's structured artifact schema pattern is a clean mitigation: if all inter-agent artifacts must conform to a JSON Schema, freeform instruction text simply cannot pass validation and reach the orchestrator.

SAP found that attackers could influence worker agent output to inject instructions into the orchestrator. Their mitigation: require all artifacts to match a registered JSON Schema. Free-form instruction text doesn't pass schema validation.

13. What does "distributed tracing with trace ID propagation" provide in a multi-agent Vertex AI system?

Correct. The causal tree is the key value: when agent D fails, you can trace back through every intermediate agent span to understand the full chain of causation back to the original user request.

Distributed tracing propagates a trace ID across all A2A calls, creating a causal tree. This enables root-cause analysis: when any agent fails, you can trace backward through all intermediate agents to the original request.

14. Why is the "confused deputy" problem particularly relevant for multi-agent systems?

Correct. The confused deputy attack exploits the trust relationship: the orchestrator has permissions the attacker doesn't. If the attacker can influence what the orchestrator does via a malicious remote agent, they can use the orchestrator's permissions to take actions they couldn't take directly.

The confused deputy problem: the orchestrator has elevated permissions. A malicious remote agent can potentially trick the orchestrator into using those permissions for the attacker's benefit. Mitigation is least-privilege at every agent boundary and validating all action requests against policy.

15. What production pattern did both SAP and Deloitte independently converge on for protecting orchestrators against remote agent failures in 2025?

Correct. The circuit breaker pattern prevents cascading failure: rather than accumulating timeouts until the orchestrator itself fails, the circuit opens at the first sign of persistent remote agent failure and routes to a fallback path immediately. Vertex AI's SDK includes a built-in RemoteAgent circuit breaker wrapper.

Both SAP and Deloitte converged on circuit breakers. A failing remote agent causes the orchestrator to accumulate blocked threads waiting for timeouts — circuit breakers open early and route to fallbacks, preventing this cascade. Vertex AI SDK includes a built-in circuit breaker for RemoteAgent calls.