🎯 Advanced · Lesson 1 of 4

The ReAct Pattern

How interleaving reasoning traces with tool calls transformed agent reliability — and what Google's experiments actually showed.

In 2022, researchers at Google Brain and Princeton published "ReAct: Synergizing Reasoning and Acting in Language Models." They ran GPT-3 and PaLM on HotpotQA — a multi-hop question dataset requiring two-step Wikipedia lookups — and found that models using only chain-of-thought hallucinated facts they could not verify. Models using only tool calls retrieved correct documents but constructed incoherent final answers. ReAct, which interspersed explicit Thought: traces between every tool invocation, outperformed both baselines: +6.4 points over chain-of-thought alone on HotpotQA and a 34% reduction in hallucination rate on FEVER (a fact-verification task). The key mechanism was that each Thought line let the model audit its own retrieved evidence before deciding what to search next.

How ReAct Works

ReAct — short for Reasoning + Acting — structures every agent step as a three-part cycle: Thought, Action, Observation. The model first writes a short reasoning trace explaining what it needs to know and why. It then emits a structured tool call (the Action). The tool's output is injected back into the context as an Observation. The model reads that Observation and immediately writes another Thought before deciding whether to call another tool or generate a final answer.

This loop continues until the model concludes it has enough information. Because every tool result is verbally processed before the next action is chosen, errors compound more slowly. If a Wikipedia search returns an irrelevant article, the Thought step catches the mismatch. The model can reformulate the query in the next Action rather than silently proceeding on bad data.

Core Loop

Thought → What do I need to find? Action → search("query") Observation → [tool output] — then repeat until confident enough to answer.

Strengths and Failure Modes

ReAct's primary strength is interpretability. Because the Thought traces are part of the model's output, engineers can read exactly which step caused a wrong answer. This makes debugging substantially faster than black-box tool-call chains. Langchain's initial agentic pipelines in 2023 were built on ReAct precisely because teams could inspect every Thought line in their logs.

The pattern's main failure mode is repetitive looping. If the model's Thought step consistently misidentifies why a search failed, it can re-issue near-identical queries dozens of times, burning tokens without progress. The original ReAct paper noted this in their ablations: without a hard step limit, PaLM occasionally entered infinite retrieval loops on ambiguous questions. Production deployments address this with step budgets — typically 5–15 iterations — after which the agent either returns a partial answer or escalates to a human.

A secondary failure mode is Thought verbosity: models sometimes write excessively long reasoning traces that fill the context window before any useful tool calls occur. OpenAI's GPT-4 function-calling documentation (released May 2023) specifically recommended keeping Thought traces under 150 tokens per step to avoid this.

Production Reality

Most ReAct deployments set a hard step limit between 5 and 15 iterations. Without it, ambiguous tasks can cause token-expensive retrieval loops that return no better answer than stopping at step 3.

When ReAct Is the Right Choice

ReAct is well-suited to tasks that are single-domain, tool-heavy, and short: Q&A over a knowledge base, database lookups, API chaining where each call informs the next. It is poorly suited to tasks that require long-horizon planning across many parallel sub-problems, because its strictly sequential loop cannot parallelize work. If a task could benefit from simultaneously checking five different data sources and merging results, ReAct will process them one at a time — wasting latency and tokens.

Best for: multi-hop Q&A, tool-heavy lookups, debugging workflows
Weak for: parallel sub-task execution, long-horizon planning
Requires: step budget enforcement in production
Advantage over pure CoT: grounded, verifiable intermediate steps

🎯 Advanced · Lesson 1 Quiz

ReAct Pattern — Check

3 questions — free, untracked, retake anytime.

1. In the original ReAct paper, which benchmark showed a 34% reduction in hallucination rate compared to chain-of-thought alone?

✓ Correct — ✓ Correct! FEVER is a fact-verification task, and ReAct's grounded retrieval reduced hallucination by 34% compared to chain-of-thought alone on that benchmark.

Not quite. The 34% hallucination reduction was measured on FEVER, a fact-verification task, not HotpotQA (where the gain was +6.4 accuracy points).

2. What is the correct order of steps in a single ReAct cycle?

✓ Correct — ✓ Correct! Thought first (reasoning trace), then Action (tool call), then Observation (tool output injected back). The cycle repeats until the model decides to answer.

Not quite. ReAct always starts with a Thought trace, then emits an Action (tool call), then receives the Observation back. Thought → Action → Observation is the correct order.

3. What is the PRIMARY production risk of deploying ReAct without a step budget?

✓ Correct — ✓ Correct! Without a step cap, a model that misdiagnoses why a search failed can re-issue near-identical queries dozens of times — burning tokens without improving the answer.

Not quite. The primary risk is infinite retrieval loops: the model repeatedly issues failed searches without realising it's stuck. A step budget (typically 5–15) prevents this.

🎯 Advanced · Lab 1

ReAct Trace Analysis

Inspect a simulated ReAct trace and identify where the loop breaks down.

Your Task

You'll receive a simulated ReAct trace from an agent that failed to answer a multi-hop question correctly. Work with the AI tutor to:

Identify the exact Thought step where the agent's reasoning went wrong.
Explain why the bad Thought led to a compounding error in subsequent Actions.
Propose a corrected Thought trace that would have led to the right answer.

Start by asking the tutor to show you a failing ReAct trace, then analyse it step by step.

🧪 ReAct Trace Lab AI Tutor

🎯 Advanced · Lesson 2 of 4

Plan-and-Execute Architecture

Separating planning from execution — how LangChain's 2023 framework experiments revealed what sequential loops cannot do.

In April 2023, LangChain's blog published benchmark results comparing a standard ReAct agent against a new "Plan-and-Execute" agent on the BabyAGI-style task set. For tasks requiring more than eight sequential steps, ReAct's per-step token overhead caused it to exceed GPT-4's context window before finishing. Plan-and-Execute solved this by separating concerns: a planner LLM call produced the full ordered task list upfront, and a smaller executor LLM (or even GPT-3.5) carried out each step. Tasks that stumped ReAct at step 9 were completed by Plan-and-Execute in the same token budget because the executor calls were stateless — they didn't carry the growing Thought history. On a web-research task involving 12 sequential lookups, Plan-and-Execute was 40% cheaper in tokens while matching accuracy.

The Two-Phase Design

Plan-and-Execute splits cognition into two strictly separated phases. In Phase 1 (Planning), a capable planner LLM receives the user's goal and generates a complete, ordered list of sub-tasks. Each sub-task is a self-contained instruction — no tool calls happen yet. The planner may be a large, expensive model optimised for reasoning.

In Phase 2 (Execution), a separate executor agent (which may be a smaller, cheaper model) works through the plan step by step. Because the executor's context for each step contains only that step's instruction plus its immediate inputs, context window pressure is vastly reduced. Completed step outputs can be stored in a structured memory object and retrieved selectively rather than keeping the entire history in-context.

Key Insight

Plan-and-Execute lets you route planning to an expensive frontier model and execution to a cheap fast model — cutting inference cost without sacrificing plan quality.

Re-Planning and Adaptive Execution

A naive Plan-and-Execute system generates the plan once and executes it rigidly. This fails when real-world tool outputs differ from what the planner anticipated — a common situation when browsing live web data or querying APIs whose schemas change. The solution is re-planning triggers: after each step, a lightweight classifier (or a simple heuristic on the executor's output) checks whether the result sufficiently matches the plan's expectation. If not, the planner is invoked again on the remaining steps.

Microsoft's AutoGen framework (released September 2023) implemented this as a "feedback loop to planner" pattern. In their published evals on ScienceWorld — a simulated lab environment — agents with re-planning triggers outperformed static plan agents by 22 percentage points on tasks that involved unexpected tool failures. Re-planning added roughly 15% token overhead per trial but eliminated the catastrophic failures caused by executing a stale plan on changed world state.

Design Decision

Static plans are fast and cheap. Re-planning is more robust. Most production teams implement re-planning only on explicit execution failures — not after every step — to keep costs reasonable.

Where Plan-and-Execute Breaks Down

The architecture assumes the planner can enumerate all necessary steps before execution begins. This assumption fails on exploratory tasks where the correct next step depends entirely on what was discovered in the previous one — for example, debugging an unknown codebase. In such cases, the planner either over-specifies (producing irrelevant steps) or under-specifies (leaving gaps the executor cannot fill). ReAct handles these tasks better because it dynamically decides the next action after reading each observation.

Plan-and-Execute also introduces a coordination overhead cost: the planner and executor must share state somehow. In LangChain's implementation this was done via a shared memory dict, which became a point of failure when executor outputs contained unexpected data types that downstream steps were not prepared to parse.

🎯 Advanced · Lesson 2 Quiz

Plan-and-Execute — Check

3 questions — free, untracked, retake anytime.

1. According to LangChain's April 2023 benchmarks, what was Plan-and-Execute's token cost advantage over ReAct on a 12-step web-research task?

✓ Correct — ✓ Correct! Plan-and-Execute was 40% cheaper in tokens while matching ReAct's accuracy on the 12-step task, because executor calls didn't carry the full growing Thought history.

Not quite. LangChain's benchmark showed Plan-and-Execute was 40% cheaper in tokens on that 12-step research task — because stateless executor calls don't carry accumulated Thought history.

2. What is the PRIMARY reason Plan-and-Execute handles long tasks better than ReAct inside a context window?

✓ Correct — ✓ Correct! Each executor call sees only its own step's instruction plus inputs — not the entire accumulated Thought/Observation history that ReAct carries, which is what causes context overflow on long tasks.

Not quite. The key is that executor steps are stateless — each call contains only its step's instruction and inputs, not the growing history ReAct drags along. That's what keeps tokens under control on long tasks.

3. On which type of task does Plan-and-Execute fail compared to ReAct?

✓ Correct — ✓ Correct! Plan-and-Execute assumes the planner can enumerate all steps upfront. When the correct next step depends on discovering something in the previous step — like debugging an unknown codebase — it breaks down.

Not quite. Plan-and-Execute struggles with exploratory tasks where each step's direction depends on what was found in the last step. If the planner can't foresee that, the plan becomes stale or irrelevant mid-execution.

🎯 Advanced · Lab 2

Plan Construction Workshop

Write and critique a Plan-and-Execute task plan for a real-world scenario.

Your Task

You'll design a Plan-and-Execute plan for a specific complex task, then defend and refine it with the tutor. Work through:

Draft a 6-step ordered plan for the scenario the tutor gives you.
Identify which steps could fail due to unexpected tool outputs.
Add re-planning triggers at the most vulnerable steps.

Ask the tutor for a scenario and start drafting your plan. Be specific about what each step's executor call should receive as input.

🧪 Plan Construction Lab AI Tutor

🎯 Advanced · Lesson 3 of 4

Multi-Agent and Hierarchical Patterns

Orchestrators, sub-agents, and the coordination overhead that made AutoGen's multi-agent conversations a landmark — and a warning.

Microsoft Research released AutoGen in September 2023. Its core contribution was a conversation-based multi-agent framework where agents could message each other in structured dialogues — not just receive instructions from a single orchestrator. In their evaluation on the HumanEval coding benchmark, a two-agent setup (one "Coder" agent, one "Critic" agent that reviewed and corrected output) achieved 89.9% pass@1 — compared to 67.0% for a single GPT-4 agent with no critic. The Coder wrote code; the Critic ran it, read the error, and sent back a correction request. The loop ran a maximum of ten rounds. The gain came entirely from the Critic's ability to isolate bugs the Coder's self-review missed — demonstrating that role specialisation inside a multi-agent system meaningfully outperformed a single generalist agent on well-defined tasks.

Orchestrator–Sub-Agent Architecture

The most common multi-agent pattern is hierarchical orchestration: a central orchestrator LLM decomposes a goal into sub-tasks and delegates each to a specialised sub-agent. Each sub-agent has its own tool set, system prompt, and memory scope. The orchestrator receives each sub-agent's output and decides whether to pass it to another sub-agent, revise the delegation, or synthesise a final answer.

This pattern appeared in production at Cognition AI's Devin (announced March 2024), where a planning agent maintained the high-level engineering task while separate sub-agents handled shell execution, file editing, and browser navigation. By isolating tool sets per sub-agent, the system prevented a browser-navigation error from corrupting the shell execution state — a cross-contamination risk that flat single-agent designs face.

Role Isolation Benefit

Sub-agents with scoped tool sets cannot accidentally invoke tools outside their domain. This reduces the blast radius of hallucinated tool calls — a critical safety property in agentic systems with write access to real systems.

Peer-to-Peer Multi-Agent Conversations

AutoGen's alternative pattern is peer-to-peer dialogue: agents message each other directly without a central orchestrator. This is effective when two specialised agents need to iteratively negotiate — like the Coder/Critic loop — but it introduces a significant risk: sycophantic convergence. Two agents can agree on an incorrect answer faster than they converge on a correct one if neither has a ground-truth verification step. The original AutoGen paper documented this: when both agents were the same base model, agreement rates were high but accuracy gains over a single agent were negligible.

The fix is asymmetric agent roles: one agent must have verifiable ground truth (e.g., the Critic actually executes the code and reads error output) rather than reasoning from the same prior knowledge as the Coder. This is why AutoGen's coding demos outperformed its QA demos — code execution provides external verification that pure language tasks do not.

Critical Warning

Two LLM agents debating a factual claim will often converge on confident-sounding agreement that is wrong. Peer-to-peer debate works only when at least one agent has access to external verification — not just different reasoning about the same training data.

Coordination Overhead and When to Prefer Single-Agent

Every agent boundary introduces latency (an additional LLM call), cost (tokens for context-setting in each sub-agent prompt), and failure surface (message serialisation errors, schema mismatches between agents). Anthropic's published guidance on multi-agent systems (2024) recommends starting with a single agent and adding agent boundaries only when a specific bottleneck — context length, tool set isolation, or specialisation — can be directly solved by the split. Adding agents "in case it helps" consistently increases both cost and failure rate without proportional gains.

Add a sub-agent when: tool sets must be isolated for safety or context reasons
Add a critic agent when: external verification (e.g., code execution) is possible
Stay single-agent when: the task fits in one context window and tool sets overlap
Never add agents: speculatively — measure the bottleneck first

🎯 Advanced · Lesson 3 Quiz

Multi-Agent Patterns — Check

3 questions — free, untracked, retake anytime.

1. In AutoGen's HumanEval coding benchmark, what was the pass@1 score for the two-agent Coder/Critic setup?

✓ Correct — ✓ Correct! The Coder/Critic two-agent setup achieved 89.9% pass@1, compared to 67.0% for a single GPT-4 agent — a gain driven by the Critic's access to actual code execution results.

Not quite. 67.0% was the single-agent baseline. The two-agent Coder/Critic achieved 89.9% pass@1 — because the Critic ran the code and read real error output rather than reasoning from the same knowledge as the Coder.

2. Why do two peer-to-peer LLM agents often converge on wrong answers when debating factual claims?

✓ Correct — ✓ Correct! Two agents trained on the same data will confidently agree on shared misconceptions. Without external verification — like running code — debate just produces faster confident-sounding wrong consensus.

Not quite. The problem is that both agents draw on identical training data. Without external verification (like code execution), their debate is just two instances of the same knowledge base arguing — and they converge on agreement, not truth.

3. According to Anthropic's 2024 multi-agent guidance, when should you add a sub-agent boundary?

✓ Correct — ✓ Correct! Anthropic's guidance says to start single-agent and add agent boundaries only when you can name the specific bottleneck the split solves. Speculative multi-agent design raises cost and failure rate without proportional gain.

Not quite. Anthropic's guidance is clear: start single-agent. Add a sub-agent only when you can identify a specific bottleneck — context length pressure, tool set isolation needs, or required specialisation — that the split directly addresses.

🎯 Advanced · Lab 3

Multi-Agent Design Critique

Evaluate a proposed multi-agent architecture and determine whether each boundary is justified.

Your Task

You'll receive a description of a proposed multi-agent system. Work with the tutor to systematically evaluate it:

List each agent boundary in the proposed system.
For each boundary, state the specific bottleneck it allegedly solves.
Identify which boundaries are justified and which should be collapsed back into a single agent — and why.

Ask the tutor to present a proposed multi-agent architecture, then work through each boundary one by one.

🧪 Multi-Agent Design Lab AI Tutor

Building AI Agents I — Use Cases · Module 5 · Lesson 4

Lesson 4

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Lesson 4

What is the primary focus of Lesson 4?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4 through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.

Try: "I need an agent that monitors competitor pricing across 50 websites, analyzes trends, and generates a weekly strategy memo. Which architecture pattern fits — ReAct, Plan-and-Execute, or Multi-Agent — and what are the trade-offs of each?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 5 Test

Agent Architectures Overview · 15 Questions · 70% to Pass

Score: 0/15

1. What does ReAct stand for and what is its core cycle?

2. On the HotpotQA benchmark, how did ReAct compare to chain-of-thought alone?

3. What is the primary failure mode where ReAct agents get stuck?

4. Why is ReAct weak for parallel sub-task execution?

5. What production safeguard is required for ReAct deployments?

6. How does Plan-and-Execute reduce token costs compared to ReAct?

7. What is the key advantage of using a separate planner and executor model?

8. In Microsoft's AutoGen framework, how much did re-planning triggers improve performance on tasks with unexpected failures?

9. When does Plan-and-Execute fail as an architecture?

10. What is the typical re-planning overhead in production systems?

11. In Microsoft's AutoGen coding benchmark, what pass rate did the two-agent Coder+Critic setup achieve vs. single GPT-4?

12. What is "sycophantic convergence" in peer-to-peer multi-agent systems?

13. Why did AutoGen's coding demos outperform its QA demos?

14. According to Anthropic's 2024 multi-agent guidance, when should you add agent boundaries?

15. What safety property does the orchestrator-sub-agent architecture provide through tool set isolation?