How interleaving reasoning traces with tool calls transformed agent reliability — and what Google's experiments actually showed.
In 2022, researchers at Google Brain and Princeton published "ReAct: Synergizing Reasoning and Acting in Language Models." They ran GPT-3 and PaLM on HotpotQA — a multi-hop question dataset requiring two-step Wikipedia lookups — and found that models using only chain-of-thought hallucinated facts they could not verify. Models using only tool calls retrieved correct documents but constructed incoherent final answers. ReAct, which interspersed explicit Thought: traces between every tool invocation, outperformed both baselines: +6.4 points over chain-of-thought alone on HotpotQA and a 34% reduction in hallucination rate on FEVER (a fact-verification task). The key mechanism was that each Thought line let the model audit its own retrieved evidence before deciding what to search next.
ReAct — short for Reasoning + Acting — structures every agent step as a three-part cycle: Thought, Action, Observation. The model first writes a short reasoning trace explaining what it needs to know and why. It then emits a structured tool call (the Action). The tool's output is injected back into the context as an Observation. The model reads that Observation and immediately writes another Thought before deciding whether to call another tool or generate a final answer.
This loop continues until the model concludes it has enough information. Because every tool result is verbally processed before the next action is chosen, errors compound more slowly. If a Wikipedia search returns an irrelevant article, the Thought step catches the mismatch. The model can reformulate the query in the next Action rather than silently proceeding on bad data.
Thought → What do I need to find? Action → search("query") Observation → [tool output] — then repeat until confident enough to answer.
ReAct's primary strength is interpretability. Because the Thought traces are part of the model's output, engineers can read exactly which step caused a wrong answer. This makes debugging substantially faster than black-box tool-call chains. Langchain's initial agentic pipelines in 2023 were built on ReAct precisely because teams could inspect every Thought line in their logs.
The pattern's main failure mode is repetitive looping. If the model's Thought step consistently misidentifies why a search failed, it can re-issue near-identical queries dozens of times, burning tokens without progress. The original ReAct paper noted this in their ablations: without a hard step limit, PaLM occasionally entered infinite retrieval loops on ambiguous questions. Production deployments address this with step budgets — typically 5–15 iterations — after which the agent either returns a partial answer or escalates to a human.
A secondary failure mode is Thought verbosity: models sometimes write excessively long reasoning traces that fill the context window before any useful tool calls occur. OpenAI's GPT-4 function-calling documentation (released May 2023) specifically recommended keeping Thought traces under 150 tokens per step to avoid this.
Most ReAct deployments set a hard step limit between 5 and 15 iterations. Without it, ambiguous tasks can cause token-expensive retrieval loops that return no better answer than stopping at step 3.
ReAct is well-suited to tasks that are single-domain, tool-heavy, and short: Q&A over a knowledge base, database lookups, API chaining where each call informs the next. It is poorly suited to tasks that require long-horizon planning across many parallel sub-problems, because its strictly sequential loop cannot parallelize work. If a task could benefit from simultaneously checking five different data sources and merging results, ReAct will process them one at a time — wasting latency and tokens.
3 questions — free, untracked, retake anytime.
Inspect a simulated ReAct trace and identify where the loop breaks down.
You'll receive a simulated ReAct trace from an agent that failed to answer a multi-hop question correctly. Work with the AI tutor to:
Separating planning from execution — how LangChain's 2023 framework experiments revealed what sequential loops cannot do.
In April 2023, LangChain's blog published benchmark results comparing a standard ReAct agent against a new "Plan-and-Execute" agent on the BabyAGI-style task set. For tasks requiring more than eight sequential steps, ReAct's per-step token overhead caused it to exceed GPT-4's context window before finishing. Plan-and-Execute solved this by separating concerns: a planner LLM call produced the full ordered task list upfront, and a smaller executor LLM (or even GPT-3.5) carried out each step. Tasks that stumped ReAct at step 9 were completed by Plan-and-Execute in the same token budget because the executor calls were stateless — they didn't carry the growing Thought history. On a web-research task involving 12 sequential lookups, Plan-and-Execute was 40% cheaper in tokens while matching accuracy.
Plan-and-Execute splits cognition into two strictly separated phases. In Phase 1 (Planning), a capable planner LLM receives the user's goal and generates a complete, ordered list of sub-tasks. Each sub-task is a self-contained instruction — no tool calls happen yet. The planner may be a large, expensive model optimised for reasoning.
In Phase 2 (Execution), a separate executor agent (which may be a smaller, cheaper model) works through the plan step by step. Because the executor's context for each step contains only that step's instruction plus its immediate inputs, context window pressure is vastly reduced. Completed step outputs can be stored in a structured memory object and retrieved selectively rather than keeping the entire history in-context.
Plan-and-Execute lets you route planning to an expensive frontier model and execution to a cheap fast model — cutting inference cost without sacrificing plan quality.
A naive Plan-and-Execute system generates the plan once and executes it rigidly. This fails when real-world tool outputs differ from what the planner anticipated — a common situation when browsing live web data or querying APIs whose schemas change. The solution is re-planning triggers: after each step, a lightweight classifier (or a simple heuristic on the executor's output) checks whether the result sufficiently matches the plan's expectation. If not, the planner is invoked again on the remaining steps.
Microsoft's AutoGen framework (released September 2023) implemented this as a "feedback loop to planner" pattern. In their published evals on ScienceWorld — a simulated lab environment — agents with re-planning triggers outperformed static plan agents by 22 percentage points on tasks that involved unexpected tool failures. Re-planning added roughly 15% token overhead per trial but eliminated the catastrophic failures caused by executing a stale plan on changed world state.
Static plans are fast and cheap. Re-planning is more robust. Most production teams implement re-planning only on explicit execution failures — not after every step — to keep costs reasonable.
The architecture assumes the planner can enumerate all necessary steps before execution begins. This assumption fails on exploratory tasks where the correct next step depends entirely on what was discovered in the previous one — for example, debugging an unknown codebase. In such cases, the planner either over-specifies (producing irrelevant steps) or under-specifies (leaving gaps the executor cannot fill). ReAct handles these tasks better because it dynamically decides the next action after reading each observation.
Plan-and-Execute also introduces a coordination overhead cost: the planner and executor must share state somehow. In LangChain's implementation this was done via a shared memory dict, which became a point of failure when executor outputs contained unexpected data types that downstream steps were not prepared to parse.
3 questions — free, untracked, retake anytime.
Write and critique a Plan-and-Execute task plan for a real-world scenario.
You'll design a Plan-and-Execute plan for a specific complex task, then defend and refine it with the tutor. Work through:
Orchestrators, sub-agents, and the coordination overhead that made AutoGen's multi-agent conversations a landmark — and a warning.
Microsoft Research released AutoGen in September 2023. Its core contribution was a conversation-based multi-agent framework where agents could message each other in structured dialogues — not just receive instructions from a single orchestrator. In their evaluation on the HumanEval coding benchmark, a two-agent setup (one "Coder" agent, one "Critic" agent that reviewed and corrected output) achieved 89.9% pass@1 — compared to 67.0% for a single GPT-4 agent with no critic. The Coder wrote code; the Critic ran it, read the error, and sent back a correction request. The loop ran a maximum of ten rounds. The gain came entirely from the Critic's ability to isolate bugs the Coder's self-review missed — demonstrating that role specialisation inside a multi-agent system meaningfully outperformed a single generalist agent on well-defined tasks.
The most common multi-agent pattern is hierarchical orchestration: a central orchestrator LLM decomposes a goal into sub-tasks and delegates each to a specialised sub-agent. Each sub-agent has its own tool set, system prompt, and memory scope. The orchestrator receives each sub-agent's output and decides whether to pass it to another sub-agent, revise the delegation, or synthesise a final answer.
This pattern appeared in production at Cognition AI's Devin (announced March 2024), where a planning agent maintained the high-level engineering task while separate sub-agents handled shell execution, file editing, and browser navigation. By isolating tool sets per sub-agent, the system prevented a browser-navigation error from corrupting the shell execution state — a cross-contamination risk that flat single-agent designs face.
Sub-agents with scoped tool sets cannot accidentally invoke tools outside their domain. This reduces the blast radius of hallucinated tool calls — a critical safety property in agentic systems with write access to real systems.
AutoGen's alternative pattern is peer-to-peer dialogue: agents message each other directly without a central orchestrator. This is effective when two specialised agents need to iteratively negotiate — like the Coder/Critic loop — but it introduces a significant risk: sycophantic convergence. Two agents can agree on an incorrect answer faster than they converge on a correct one if neither has a ground-truth verification step. The original AutoGen paper documented this: when both agents were the same base model, agreement rates were high but accuracy gains over a single agent were negligible.
The fix is asymmetric agent roles: one agent must have verifiable ground truth (e.g., the Critic actually executes the code and reads error output) rather than reasoning from the same prior knowledge as the Coder. This is why AutoGen's coding demos outperformed its QA demos — code execution provides external verification that pure language tasks do not.
Two LLM agents debating a factual claim will often converge on confident-sounding agreement that is wrong. Peer-to-peer debate works only when at least one agent has access to external verification — not just different reasoning about the same training data.
Every agent boundary introduces latency (an additional LLM call), cost (tokens for context-setting in each sub-agent prompt), and failure surface (message serialisation errors, schema mismatches between agents). Anthropic's published guidance on multi-agent systems (2024) recommends starting with a single agent and adding agent boundaries only when a specific bottleneck — context length, tool set isolation, or specialisation — can be directly solved by the split. Adding agents "in case it helps" consistently increases both cost and failure rate without proportional gains.
3 questions — free, untracked, retake anytime.
Evaluate a proposed multi-agent architecture and determine whether each boundary is justified.
You'll receive a description of a proposed multi-agent system. Work with the tutor to systematically evaluate it:
This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.