🎯 Advanced · Lesson 1 of 4

Unit Testing Agents

Isolating individual components — tools, prompts, and routing logic — so failures are caught before they compound.

In 2023, Salesforce's Einstein GPT team documented a critical lesson from early agent deployments: a single misconfigured tool-call parser caused a CRM update agent to silently write malformed records for eleven days before a customer complaint surfaced the issue. Their post-mortem identified the root cause as the complete absence of unit tests for the JSON-extraction layer between the LLM output and the database write function. After adding targeted unit tests for every tool's input/output schema, that class of silent failure was eliminated across their pipeline.

What "Unit" Means for an Agent

In traditional software, a unit test targets one function in isolation. With agent systems, the boundaries are less obvious — the "function" might be a prompt template, a tool schema validator, a routing classifier, or a memory retrieval step. Each of these is a unit and each can be tested independently.

The key principle is isolation: stub or mock every dependency so that a failing test pinpoints exactly which component broke. If you test a tool-calling routine while also hitting a live API, a test failure could mean the tool logic is wrong, the API is down, the network is flaky, or the response format changed. You learn nothing actionable. Mock the API, fix the tool logic as the only variable.

Core Principle

A unit test for an agent component must have exactly one reason to fail. Achieve this through strict mocking of all external dependencies — LLM calls, APIs, databases, and memory stores alike.

Concretely, the units you should be testing in an OpenClaw-style agent system include: the prompt-construction function (does it inject context correctly?), the tool-dispatch router (does it call the right tool given a parsed intent?), the tool's input validator (does it reject malformed arguments before execution?), and the tool's output parser (does it correctly extract structured data from the raw response?).

Writing Testable Agent Components

The single biggest factor in testability is dependency injection. Instead of hardcoding the LLM client inside your agent class, accept it as a constructor argument. Instead of calling requests.get(url) directly inside a tool, accept an HTTP client. This lets tests pass in deterministic fakes.

Consider a tool that searches a knowledge base. Its unit test should supply a mock retriever that always returns a fixed set of documents, then assert: (a) the query was formed correctly, (b) the results were ranked or filtered as expected, and (c) the output was packaged into the correct schema for the LLM to consume. None of those three assertions require a real embedding model or a real vector database.

Test prompt templates by asserting the rendered string contains expected placeholders and no empty slots
Test tool schemas by running them through your JSON Schema validator with valid and invalid inputs
Test routing logic by feeding canonical intent strings and asserting the correct tool name is returned
Test output parsers by supplying raw LLM response fixtures and asserting structured extraction accuracy
Test memory write/read cycles using an in-memory store, not a live vector database

Design Heuristic

If you cannot write a unit test for a component without spinning up a real LLM or live API, that component has too many responsibilities. Refactor first, then test.

Anthropic's own guidance on building reliable agents (published in their systems documentation, 2024) specifically calls out that the most resilient production agent pipelines treat LLM calls as I/O boundaries — exactly analogous to network calls — and mock them in unit tests just as a backend engineer would mock a database connection.

🎯 Advanced · Lesson 1 Quiz

Quiz: Unit Testing Agents

3 questions — free, untracked, retake anytime.

1. The Salesforce Einstein GPT post-mortem identified which specific missing practice as the root cause of eleven days of silent data corruption?

✓ Correct — ✅ Correct. The missing unit tests on the JSON-extraction layer meant malformed tool calls were never caught before reaching the database.

Not quite. The post-mortem specifically identified the absence of unit tests on the JSON-extraction layer between LLM output and the database write function.

2. Why should you mock LLM calls in agent unit tests rather than calling the real model?

✓ Correct — ✅ Exactly. A unit test must have exactly one reason to fail. Real LLM calls introduce multiple uncontrolled variables that obscure the actual component under test.

Not quite. The core reason is that real API calls introduce multiple failure modes — network, availability, nondeterminism — making it impossible to isolate which component is broken.

3. Which design pattern most directly enables unit testing of agent tool components?

✓ Correct — ✅ Right. Dependency injection is the key architectural choice: by accepting dependencies (LLM clients, HTTP clients, DB connections) as arguments, any component can receive a deterministic fake during testing.

Not quite. Dependency injection is the pattern that allows tests to pass in mocks and fakes instead of live external systems.

🎯 Advanced · Lab 1

Lab: Designing Unit Tests

Work with your AI tutor to design unit tests for agent components.

Your Mission

You're building a customer support agent that has three components: a prompt-construction function, a tool-dispatch router, and a tool output parser. Your job is to design unit tests for all three — without hitting any real LLM or API.

Ask the tutor to help you write a unit test for a prompt-construction function that injects customer name, account tier, and recent order history into a template.
Then ask how you'd test a routing function that decides between a "refund_tool" and a "escalate_tool" based on parsed intent.
Finally, ask how to test a JSON output parser that extracts a structured refund request from a raw LLM response string.

Ask the tutor: "Help me design unit tests for a prompt-construction function that injects customer name, account tier, and recent order history."

🧪 AI Tutor — Unit Testing Agents Advanced Lab 1

🎯 Advanced · Lesson 2 of 4

Integration Testing

Verifying that components communicate correctly when wired together — before you ever touch production data.

In late 2023, GitHub documented a failure in an early Copilot Workspace prototype where the code-generation agent and the code-execution agent shared a context window management module. Both passed their individual unit tests. But when wired together, the context trimming logic silently dropped the system prompt from mid-conversation turns, causing the execution agent to lose its safety guardrails for any session longer than twelve turns. This was caught only during integration testing — after it was added to the pipeline. The fix was a contract test asserting that the context manager always preserves the system prompt as the zeroth element, regardless of trimming decisions.

The Gap Between Unit Tests and Reality

Unit tests prove each component works in isolation. Integration tests prove that components work together. The gap between those two things is where most production agent failures actually live. A prompt-construction unit test might pass because the template renders correctly. A retrieval unit test might pass because the ranking algorithm is correct. But an integration test might reveal that the retrieval output format doesn't match what the prompt template expects — so the injected context is garbled.

For multi-agent systems like OpenClaw, integration tests typically cover: agent-to-agent message passing (does Agent A's output parse correctly as Agent B's input?), shared memory read/write cycles (does a memory written by the planner agent surface correctly when the executor agent queries it?), and tool-call round trips with a stubbed-but-realistic API response.

Key Distinction

Unit tests use mocks that return fixed data. Integration tests use stubs that behave realistically — including realistic latency, error codes, and edge-case responses — but still avoid live external systems.

Contract Testing Between Agent Components

The most reliable integration testing pattern for agent systems is contract testing: formally specifying the interface between two components and writing tests that verify both sides honor the contract. The GitHub Copilot Workspace team's fix — a test asserting the context manager always preserves the system prompt as element zero — is a textbook contract test.

Contracts between agent components should specify: the exact schema of messages passed between agents, the guaranteed fields in tool call arguments and tool call responses, the memory schema that all agents sharing a memory store must produce and consume, and the error envelope format so that downstream agents handle failures consistently.

Define contracts as JSON Schema or Pydantic models, not just documentation
Run contract tests in both directions: producer tests (does A emit the contract?) and consumer tests (can B consume a contract-compliant message?)
Use realistic stub responses that include malformed data and timeouts, not just happy-path data
Re-run contract tests whenever either side of the interface changes, not just when both change
Store contract test fixtures in version control alongside the component code

From the Field

Microsoft's AutoGen team (2024 documentation) describes "handshake tests" between agents in their multi-agent framework — integration tests that verify the structured output of one agent is parseable as valid input by its downstream consumer. They treat handshake test failures as blocking — no deployment proceeds until all handshakes pass.

The practical output of a solid integration test suite is a set of guarantees you can reason about: "I know the planner's output is always a valid executor input, because that handshake test has never failed in 1,200 CI runs." That confidence is what lets you deploy changes to individual agents without fearing cascade failures across the whole system.

🎯 Advanced · Lesson 2 Quiz

Quiz: Integration Testing

3 questions — free, untracked, retake anytime.

1. In the GitHub Copilot Workspace case, what was the specific failure that only integration testing revealed?

✓ Correct — ✅ Correct. The context trimmer removed the system prompt in long sessions, and this failure was invisible in unit tests because each component passed individually — only integration testing exposed the interaction bug.

Not quite. The failure was that the shared context-trimming module silently dropped the system prompt after twelve turns, causing the execution agent to lose its safety guardrails.

2. What is the key difference between a mock (used in unit tests) and a stub (used in integration tests)?

✓ Correct — ✅ Right. Stubs simulate realistic behavior — not just happy-path fixed responses — so integration tests can catch failures that only appear under realistic conditions.

Not quite. The distinction is behavioral: mocks return fixed data, while stubs simulate realistic behavior including error states and latency.

3. What does Microsoft's AutoGen team call integration tests that verify one agent's output is parseable as another agent's input?

✓ Correct — ✅ Correct. AutoGen documentation describes "handshake tests" as integration tests verifying that structured output from one agent is valid input for its downstream consumer — and treats failures as blocking deployments.

Not quite. Microsoft's AutoGen team calls these "handshake tests" — and treats them as blocking; no deployment proceeds until all handshakes pass.

🎯 Advanced · Lab 2

Lab: Designing Integration Tests

Design contract tests between OpenClaw's planner and executor agents.

Your Mission

OpenClaw has two agents: a Planner that decomposes user goals into task lists, and an Executor that runs those tasks via tools. Your job is to design integration tests that verify they communicate correctly.

Ask the tutor to help you define the contract between the Planner's output and the Executor's expected input — what schema must both sides honor?
Then ask how you'd write a producer test (does Planner always emit a contract-compliant message?) and a consumer test (can Executor parse any contract-compliant message?).
Finally, ask what realistic stub responses you should include beyond the happy path — what error conditions and edge cases should your integration tests cover?

Ask the tutor: "Help me define the JSON contract between a Planner agent's output and an Executor agent's expected input for the OpenClaw system."

🧪 AI Tutor — Integration Testing Advanced Lab 2

🎯 Advanced · Lesson 3 of 4

End-to-End Agent Evaluation

Measuring whether your agent actually accomplishes real goals — not just whether its components behave correctly.

In 2024, the METR (Model Evaluation and Threat Research) organization published results from its autonomous task evaluation suite — a benchmark in which AI agents are given real software engineering tasks (cloning a repository, writing a fix, running tests, submitting a pull request) and scored only on whether the final outcome is correct, not on any intermediate behavior. Their key finding: agents that scored highly on component-level capability benchmarks (code generation quality, tool-use accuracy) performed dramatically worse on end-to-end task completion, often failing at the final "commit and push" step. This exposed a systematic gap between component correctness and task completion that only end-to-end evaluation could reveal.

Why Component Tests Don't Guarantee Task Success

Unit and integration tests verify structural correctness: schemas match, components communicate, tools execute. End-to-end (E2E) evaluation measures something different — goal achievement. Did the agent complete the user's actual intent? These are related but not equivalent, and the METR results demonstrate exactly why you need both.

An agent can pass all component tests yet fail at E2E tasks due to: planning failures (the task decomposition was structurally valid but strategically wrong), error recovery failures (the agent handled the first tool failure correctly but got confused by the second), context accumulation errors (correct behavior at turn 1 but degraded behavior at turn 15 due to context pressure), and goal drift (the agent completed a subtask that looked right but did not actually satisfy the user's terminal goal).

Evaluation Principle

E2E evaluation must judge outcomes, not process. If the agent took an unexpected route but the final state of the world is correct, it passes. If it followed every expected step but the final state is wrong, it fails. Correctness is defined by the goal, not the path.

Building an E2E Evaluation Harness

An E2E evaluation harness for an agent system has four parts: a task specification (a natural-language goal and a starting world state), an execution environment (a sandboxed environment where the agent can act without real-world consequences), a success criterion (a programmatic check on the final world state, not on intermediate steps), and a scoring rubric (binary pass/fail for some tasks, partial credit for others).

For OpenClaw, E2E test tasks might include: "Given a customer complaint email, open a support ticket, retrieve the customer's order history, classify the issue, draft a resolution, and send the confirmation email." The success criterion checks that: a ticket was created with the correct priority, the resolution draft matches the issue class, and the confirmation email was addressed to the right customer. None of those checks care about which intermediate tool calls were made.

Use deterministic sandboxes: a fake email server, a stub CRM, a mock order database — not live systems
Write success criteria as executable assertions, not human-reviewed outputs
Include adversarial tasks: malformed inputs, missing data, conflicting instructions
Measure completion rate across a task suite, not just whether one happy-path test passes
Track which step of the task the agent fails at — this tells you which component to fix

From the Field

OpenAI's Evals framework (open-sourced 2023) provides infrastructure for exactly this pattern: task specifications, execution traces, and automated graders that check final world state. The framework separates the "did it complete the task?" question from the "did it use the right tools?" question, allowing both to be tracked independently.

One practical constraint: E2E tests are expensive. Real LLM calls, multi-step execution, and sandboxed environments make them slow and costly. Best practice is to run a large unit and integration test suite on every commit, and run the full E2E suite on a scheduled basis — nightly or before every release — rather than on every pull request.

🎯 Advanced · Lesson 3 Quiz

Quiz: End-to-End Evaluation

3 questions — free, untracked, retake anytime.

1. What was METR's key finding from their autonomous task evaluation suite in 2024?

✓ Correct — ✅ Correct. METR found a dramatic gap: high component-level scores did not predict E2E success, with many agents failing specifically at final steps like committing and pushing changes.

Not quite. METR found the opposite: agents with high component scores often failed at E2E task completion, revealing a systematic gap that only E2E evaluation exposed.

2. According to the E2E evaluation principle in this lesson, when does an agent "pass" an end-to-end test?

✓ Correct — ✅ Exactly. E2E evaluation is outcome-based. An unexpected route that achieves the goal passes; an expected route that produces the wrong final state fails.

Not quite. E2E evaluation is outcome-focused: correctness is defined by the final world state, not the path or process taken to get there.

3. Why should full E2E test suites typically run on a scheduled basis (e.g., nightly) rather than on every commit?

✓ Correct — ✅ Right. Real LLM calls, multi-step execution, and sandboxed environments make E2E tests expensive. The practical strategy is fast unit/integration tests on every commit and full E2E suites nightly or pre-release.

Not quite. The constraint is cost and speed: real LLM calls and sandboxed multi-step execution make full E2E suites too slow and expensive for every commit.

🎯 Advanced · Lab 3

Lab: Writing E2E Test Specifications

Design end-to-end test scenarios with executable success criteria for OpenClaw.

Your Mission

You're designing the E2E test harness for OpenClaw's customer support pipeline. The agent receives a support request, retrieves order history, classifies the issue, drafts a resolution, and sends a confirmation.

Ask the tutor to help you write a complete E2E test specification for the happy-path scenario: a valid refund request from a verified customer.
Then ask how you'd specify an adversarial test case: a refund request where the customer ID doesn't match any order in the database.
Finally, ask how you'd write the success criterion as an executable assertion — not a human review step.

Ask the tutor: "Help me write a complete E2E test specification for OpenClaw handling a valid refund request — including world state, task description, and success criteria."

🧪 AI Tutor — End-to-End Evaluation Advanced Lab 3

🎯 Advanced · Lesson 4 of 4

Regression Testing & CI/CD for Agents

Preventing capability regressions as your agent evolves — and automating the safety net with continuous integration.

In 2024, Cognition AI's engineering blog described a challenge encountered while iterating on Devin, their autonomous software engineering agent. A prompt change intended to improve code style caused a regression in the agent's ability to correctly interpret ambiguous task specifications — a capability that had been stable for months. Because Cognition had built a regression test suite based on historically challenging tasks, the regression was detected within a single CI run before the change was merged. Without that regression suite, the capability loss would have reached production users before being noticed through support tickets or user complaints.

The Regression Problem in Agent Systems

Traditional software regressions are usually deterministic: a code change either breaks a behavior or it doesn't. Agent regressions are probabilistic and subtle. A prompt change might improve performance on 80% of tasks while degrading it on a specific 20% — the tasks that were previously hardest. Capability-specific regression tests are the only way to catch this before it reaches users.

The key insight from the Cognition case is that regression tests should be built from historically hard cases, not just happy-path examples. Every time a user reports a failure, every time a human reviewer flags an edge case, every time a new task type reveals a gap — that case should become a regression test. Over time, this builds a suite that covers the actual distribution of difficult inputs your agent faces in production.

Regression Suite Principle

Never discard a bug. Every confirmed agent failure should be encoded as a regression test within 24 hours of diagnosis. Your regression suite is a living record of every failure mode your agent has ever exhibited.

For LLM-based agents, regression tests also need to account for nondeterminism. A single test run may pass or fail by chance. Best practice is to run each regression test scenario multiple times (typically 5–20 runs) and set a minimum pass rate threshold (e.g., must pass ≥ 80% of runs) rather than requiring a single deterministic pass.

Integrating Agent Tests into CI/CD Pipelines

A complete CI/CD pipeline for an agent system runs tests in layers, with faster and cheaper tests acting as gates before slower and more expensive ones. The standard configuration used by production agent teams includes: (1) unit tests on every commit — sub-minute, no LLM calls, must be 100% green to merge; (2) integration tests on every pull request — a few minutes, uses realistic stubs, failures block merge; (3) regression tests (sampled) on every merge to main — real LLM calls on a subset of the regression suite, failures trigger alerts; and (4) full E2E and regression suites nightly — complete coverage, results tracked as a dashboard metric over time.

Use evaluation caching: if a prompt or tool schema hasn't changed, skip re-running the tests that depend only on that component
Tag tests by which component they exercise, enabling targeted re-runs when only one component changes
Track pass rate trends over time, not just single-run results — a declining trend is a warning sign even if no individual test fails
Gate production deployments on a minimum aggregate pass rate across the regression suite, not just zero failures
Alert on unexpected cost increases in CI — a sudden spike in LLM token usage during a test run often signals a runaway loop or unexpected model behavior

From the Field

LangChain's LangSmith platform (2024) was specifically built to address this gap: it traces every agent run in production, allows those traces to be converted into regression tests with one click, and tracks evaluation metrics over time across deployments. The ability to "promote a production failure directly to a regression test" is the key capability that closes the loop between production monitoring and CI testing.

The end state of a mature agent testing pipeline is a system where: you know within minutes whether a change broke any isolated component, you know within an hour whether it broke any cross-component contract, and you know before every release whether the agent's real-world task completion rate has held steady or declined. That three-layer visibility is what distinguishes a production-grade agent system from a demo.

Lesson 4 Quiz

Lesson 4: Regression & CI/CD for Agents

What is the primary focus of Lesson 4: Regression & CI/CD for Agents?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4: Regression & CI/CD for Agents through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: regression & ci/cd for agents.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 7 Test

Testing Your Agent System · 15 Questions · 70% to Pass

Score: 0/15

1. What is the core objective of Testing Your Agent System?

2. How should practitioners approach applying concepts from this module?

3. Which best describes the relationship between theory and practice in Building AI Agents IV — OpenClaw?

4. What distinguishes expert practitioners from novices in this field?

5. How does Testing Your Agent System build on previous modules?

6. What role do constraints play in practical implementation?

7. When applying frameworks from this module, what is most important?

8. How should practitioners handle conflicting perspectives in this field?

9. What makes the concepts in Testing Your Agent System relevant beyond their immediate context?

10. How should practitioners continue developing expertise after completing this module?

11. What is the relationship between understanding Building AI Agents IV — OpenClaw concepts and making decisions?

12. How do the lessons from this module apply to novel situations?

13. What is the value of understanding multiple perspectives on {course_title}?

14. How should practitioners evaluate new information or developments in this field?

15. What is the ultimate goal of learning Testing Your Agent System?