In early 2023, teams at Cognition AI, Adept, and AutoGPT all ran into the same wall. A single LLM given a long software engineering task would lose track of earlier decisions, contradict itself on step forty, and hallucinate file paths it had invented three pages back. The context window was a hard ceiling. The solution was not a bigger model β it was division of labour.
A single agent operating on one context window faces three compounding constraints. First, context exhaustion: the model cannot hold the entire state of a long task in one prompt. Second, skill mismatch: a generalised model performs worse on specialised subtasks than a fine-tuned or prompted specialist. Third, no parallelism: sequential reasoning is slow when many subtasks are independent.
These were not theoretical problems. In May 2023, the AutoGPT open-source project β which had reached 150,000 GitHub stars in under two months β published post-mortems showing that tasks requiring more than roughly 15 sequential steps consistently degraded in quality. The longer the chain, the more the agent "forgot" its own prior tool outputs.
The response from research labs was the multi-agent system: an architecture in which several specialised agents operate in parallel or in sequence, each holding only the slice of context it needs, handing off structured results to peers.
GPT-4's original 8k context window forced engineers building agents to confront a fundamental choice: summarise aggressively (losing fidelity) or decompose the task across multiple agents (preserving it). Most production systems chose decomposition.
By Q3 2023, three deployments had demonstrated multi-agent architectures at meaningful scale.
Cognition AI's Devin (announced March 2024, but developed through 2023) assigned separate agents to planning, code generation, test execution, and error debugging. Each agent communicated through a structured message bus rather than a shared context. Cognition reported on SWE-bench that Devin resolved 13.86% of real GitHub issues autonomously β compared to 1.96% for single-pass GPT-4 on the same benchmark.
Microsoft's AutoGen framework, released October 2023, formalised the multi-agent conversation pattern. In their published ablation study, a two-agent "AssistantAgent + UserProxyAgent" pair solved 69% of HumanEval coding tasks successfully, versus 55% for a single agent with equivalent compute. The key gain was that the proxy agent could execute code and feed real error messages back β something a single agent could only simulate.
Salesforce's Einstein Copilot, built on a multi-agent routing layer, demonstrated in internal benchmarks that routing customer queries to domain-specific agents (billing, technical support, product) reduced hallucination rates by roughly 40% compared to a single general-purpose agent handling all query types.
The empirical gains from multi-agent systems come from two distinct sources: specialisation (each agent sees only what it needs) and verification (one agent can check another's output). Both effects showed up independently across the 2023 benchmarks.
Modules 1β5 of this course examined single agents: how they reason, use tools, maintain memory, and act safely. Module 6 extends that foundation to coordinated systems. Every concept you have studied β tool use, ReAct loops, retrieval, safety constraints β appears again here, but now multiplied across agents that must trust, verify, and communicate with each other.
Understanding multi-agent design is no longer academic. By early 2025, the majority of enterprise AI deployments reported by Gartner and McKinsey involved more than one AI component working in coordination. The lone chatbot is giving way to the agent team.
Task decomposition is the first skill of any multi-agent architect. Before you can assign work to specialist agents, you must break a complex goal into subtasks with clear inputs, outputs, and dependencies.
In this lab, you'll work with an AI assistant that plays the role of an AutoGen-style orchestrator coach. Ask it to help you decompose real-world tasks, identify which subtasks should run in parallel vs. sequentially, and spot where context boundaries should be drawn.
When Amazon's Rufus shopping assistant launched in February 2024, its internal architecture exposed a design decision every multi-agent team faces: should one central orchestrator route all requests, or should agents negotiate directly with each other? Amazon chose a hub-and-spoke model β Rufus itself as the hub, with product-search, review-analysis, and comparison agents as spokes. The choice was deliberate: centralised routing gave engineers a single point to audit, rate-limit, and update without touching every spoke.
In practice, multi-agent architectures converge on three coordination patterns. Understanding their trade-offs is what separates an architect from a tinkerer.
| Pattern | Structure | Best For | Risk |
|---|---|---|---|
| Hub-and-Spoke | One orchestrator agent routes to N workers; all communication flows through the hub | Audit, rate-limiting, consistent routing logic | Hub becomes a bottleneck and single point of failure |
| Pipeline | Agents in a fixed sequence; each passes output to the next | Well-defined sequential tasks (e.g., extract β enrich β format β deliver) | Error propagation; early-stage failures corrupt all downstream agents |
| Blackboard / Market | Agents read and write to a shared state store; any agent can claim an available task | Dynamic workloads with unpredictable task ordering | Race conditions; harder to trace causality; requires conflict resolution |
LangChain's LangGraph (released January 2024) formalised the pipeline pattern as a directed acyclic graph of agent nodes. Each node is a callable that reads from and writes to a typed state object. Edges can be conditional β an "error detected" edge routes to a repair agent rather than continuing downstream.
In a published walkthrough by Harrison Chase (LangChain CEO), a five-node research pipeline β query planner β web search agent β document retriever β synthesis agent β citation checker β demonstrated that conditional routing reduced hallucinated citations by 58% compared to a linear chain without the checker node. The citation-checker acted as an in-pipeline verifier, catching errors before they were presented to the user.
LangGraph's key contribution was making the graph structure explicit and inspectable. Rather than agent calls buried inside agent prompts, the topology was defined in code β making debugging and modification far more tractable.
The choice of coordination pattern is not purely technical β it reflects your tolerance for different failure modes. Hub-and-spoke fails loudly (hub goes down, everything stops). Blackboard fails silently (agents race, results are inconsistent). Choose based on which failure is easier to detect and recover from in your context.
In October 2024, OpenAI released Swarm, an experimental framework demonstrating a lightweight handoff protocol: agents could explicitly transfer control to another agent by returning a special Agent object rather than a text response. This formalised a pattern OpenAI engineers had been using internally β what they called "agent handoffs".
Swarm's published example showed a triage agent receiving a customer query, classifying it, and handing off to a billing or technical agent. The key innovation was that the handoff carried context β the receiving agent inherited the conversation history and any variables the handing-off agent had collected. This solved a common multi-agent problem: the second agent needing to re-ask questions the first had already answered.
Though Swarm was explicitly marked as experimental and not production-ready, its handoff pattern was adopted in modified form by several commercial frameworks within weeks of release.
Effective handoffs transfer three things: the task fragment, the relevant context accumulated so far, and the expected output format. Handoffs that omit context force the receiving agent to reconstruct information, wasting tokens and risking inconsistency.
Pattern selection is a judgement call that depends on failure tolerance, task predictability, and audit requirements. In this lab, you'll reason through real deployment scenarios with an AI coach that challenges your pattern choices and pushes you to justify the trade-offs.
The coach will present orchestration dilemmas and stress-test your reasoning. Push back, ask for counter-arguments, and explore the edge cases of each pattern.
In December 2023, a Chevrolet dealer in Watsonville, California deployed a customer service chatbot. Users discovered they could instruct it to agree to sell a 2024 Chevrolet Tahoe for one dollar β and the chatbot, lacking any verification layer, confirmed the "deal" in writing. The incident was documented by The Guardian and circulated widely. The bot was taken down within days. The failure was not a prompt injection in the adversarial sense β it was an absence of cross-agent verification: there was no agent checking whether the proposed output was consistent with business constraints before it was delivered to the customer.
In any multi-agent system, an agent's output becomes downstream input. If that output is wrong β hallucinated, manipulated, or inconsistent with constraints β the error propagates. Without verification nodes, the system has no mechanism to catch problems before they matter.
Anthropic's research on multi-agent safety (published in their model card updates through 2024) identified three verification failure modes:
Unchecked propagation: Agent A produces a hallucinated fact. Agent B trusts it and builds on it. Agent C presents it to the user as confirmed. Each agent was individually "doing its job" β the failure was systemic.
Constraint bypass: An agent is prompted (by a user or upstream agent) to override a business or safety rule. Without a separate constraint-enforcement agent, the bypass succeeds.
Cascading confirmation: Multiple agents independently assess the same claim but all draw on the same training data or retrieval source. Their "independent" verification is illusory β they are all wrong in the same direction.
In 2024, security researchers at Embrace the Red demonstrated that a malicious instruction embedded in a webpage retrieved by a web-search agent could hijack the downstream summarisation agent's output β causing it to exfiltrate conversation data. This is "indirect prompt injection" across an agent pipeline. The attack vector exists precisely because agents trust the outputs of agents before them in the pipeline.
Not all agents in a system should be trusted equally. Anthropic's Claude model cards and OpenAI's GPT-4 system card both introduce the concept of trust hierarchies in multi-agent contexts:
Operator-level trust: An agent deployed and configured by the operator, with full access to system instructions and tools. Typically the orchestrator.
Agent-level trust: A subagent whose outputs should be treated with the same scepticism as user-provided text β checked before being acted upon by safety-critical components.
Environment trust: Data retrieved from the external world (web, databases, files). Should be treated as potentially adversarial β any retrieved content may contain injected instructions.
The practical implication: an orchestrator should not automatically trust the output of a retrieval agent any more than it trusts raw user input. Retrieved text must be sanitised or passed through a verification agent before being used to generate instructions to other agents.
Microsoft's Responsible AI documentation for Copilot Studio (2024) introduced a "principle of minimal footprint" for multi-agent systems: each agent should request only the permissions it needs, store only what is necessary, and prefer reversible over irreversible actions. These constraints limit the blast radius when an agent makes an error or is successfully manipulated.
Three verification patterns have emerged in production systems:
Inline checker nodes: A dedicated verification agent sits in the pipeline after high-risk nodes. It checks the output against constraints before passing it downstream. Used by LangGraph's conditional edges and Devin's test-execution agent.
Out-of-band auditors: A separate agent runs concurrently, sampling outputs from the main pipeline and flagging anomalies. Does not block execution but triggers alerts. Used in financial trading systems where latency matters.
Human-in-the-loop gates: For irreversible or high-stakes actions, a human approval step is mandatory regardless of agent confidence. Google DeepMind's AlphaCode 2 used this for generated code submitted to competitive programming contests in 2023.
In this lab, you'll work with an AI security coach who will present you with multi-agent pipeline descriptions and challenge you to identify trust vulnerabilities, propose verification layers, and reason through trust hierarchy design.
The coach draws on real incidents β the Chevy chatbot, the Embrace the Red prompt injection demos, and Microsoft's Copilot Studio guidance. Push it to give you new scenarios and attack vectors you haven't considered.
On April 29, 2024, GitHub announced Copilot Workspace β a technical preview that turned a GitHub issue into a complete pull request through a chain of five AI agents. An issue-analysis agent read the bug report, a planning agent designed the fix, a code-writing agent implemented it, a test-generation agent wrote tests, and a validation agent checked consistency. The entire pipeline ran in a browser tab. GitHub's engineering blog noted that the hardest problem was not the agents themselves β it was making their intermediate reasoning legible to the human developer who needed to trust, modify, or reject it.
Between 2023 benchmarks and 2024 production deployments, several problems emerged that pure research had not anticipated.
Latency compounding: Each agent call adds latency. A five-agent pipeline where each agent takes 3 seconds produces a 15-second minimum response time before any parallelism is factored in. GitHub Copilot Workspace addressed this by running code generation and test generation in parallel after the planning phase β cutting end-to-end time by roughly 35% on typical tasks.
Intermediate state legibility: Users and developers need to understand what each agent did and why. GitHub's Copilot Workspace exposed each agent's reasoning step in a visual plan view. This was not a UX feature β it was a safety feature. Developers who could see the plan caught errors before code was generated.
Cost at scale: A five-agent pipeline on GPT-4 class models costs approximately 5β10Γ more per query than a single-agent approach. Anthropic's published usage guidance recommends using Claude Haiku-class models for routine subagent tasks and reserving Sonnet/Opus-class for orchestration and verification β a cost-performance trade-off that became standard practice by mid-2024.
In July 2024, Google DeepMind announced that AlphaProof had solved four of six problems from the 2024 International Mathematical Olympiad (IMO) β a result that would have earned a silver medal if submitted by a human contestant.
AlphaProof's architecture was explicitly multi-agent: a natural-language problem parser, a mathematical formalisation agent that translated problems into Lean 4 proof language, a proof-search agent using reinforcement learning, and a formal verification agent running the Lean 4 theorem prover as a tool. The verification agent was not another LLM β it was a symbolic verifier, mathematically certain in its judgements.
This is the strongest documented example of heterogeneous agent teams: mixing neural agents (LLMs, RL models) with symbolic agents (theorem provers, rule engines). The symbolic agent provided exactly the kind of genuine independent verification that cascading confirmation cannot β its "opinion" was not based on the same training data.
AlphaProof's use of Lean 4 as a formal verification agent is a template for safe multi-agent design: use symbolic, rule-based, or mathematical tools as verification agents wherever possible. They cannot be prompted into agreeing with incorrect outputs β they either verify or they don't.
Analysing the major 2024 deployments β Copilot Workspace, AlphaProof, Devin, Salesforce Einstein, and Amazon Rufus β reveals five shared characteristics of successful production multi-agent systems:
1. Explicit intermediate state. Every system exposes what each agent produced, not just the final output. This enables debugging, user trust, and error recovery.
2. At least one non-LLM verifier. Whether a unit test runner (Devin), a Lean 4 prover (AlphaProof), or a constraint engine (Einstein), production systems include at least one agent whose verification cannot be talked out of its verdict.
3. Parallel execution where possible. Latency constraints force parallelism. Planning and data-gathering are often sequential; generation and verification often run in parallel.
4. Model tiering. Not every agent needs the largest model. Routing, formatting, and simple transformation tasks run on smaller, faster, cheaper models. Judgment and synthesis run on larger ones.
5. Human gates for irreversible actions. Every production system reviewed required human confirmation before actions that could not be undone β code commits, sent emails, financial transactions.
Multi-agent systems do not solve the fundamental limitations of language models β they architect around them. Specialisation, verification, and parallelism each mitigate a specific LLM weakness. Understanding which weakness you're mitigating with which architectural choice is what makes a multi-agent design defensible.
In this final lab, you'll design a complete multi-agent system β from decomposition through orchestration pattern selection, trust hierarchy, verification design, model tiering, and latency analysis. Your AI coach will evaluate your design against the five production principles from Lesson 4 and push you on weak points.
This is a synthesis lab: draw on everything from Lessons 1β4. The coach has the full module context and will challenge you as a senior multi-agent architect would.