🎯 Advanced · Lesson 1 of 4

Measuring Task Completion

What does it actually mean for an agent to succeed — and how do you measure it precisely?

In 2023, Salesforce published benchmark results for their Agentforce platform showing an 83% task completion rate on customer service workflows. The remaining 17% required human escalation. What Salesforce made clear in their technical documentation was that this headline number masked enormous variation: simple refund requests completed at 96%, while complex multi-system account restructuring tasks completed at only 61%. A single aggregate number had nearly hidden the breakdown that mattered most to deployment decisions.

Why Completion Rate Alone Deceives

Task completion rate is the most natural first metric anyone reaches for when evaluating an AI agent. Did it finish the job? Yes or no? But a binary completion metric discards almost everything useful about agent quality. An agent that completes 90% of tasks but consistently fails on the 10% that are highest-stakes — security-sensitive actions, large financial transactions, irreversible operations — is far more dangerous than an agent completing 75% of tasks uniformly across complexity levels.

The first refinement is stratified completion rate: measuring success separately across task difficulty tiers, task types, and task domains. Google DeepMind's WebAgent research (2023) explicitly stratified their evaluation across easy, medium, and hard web navigation tasks, revealing that completion rates dropped from 90.8% on easy tasks to 32.1% on hard tasks — a gap that an aggregate number of ~65% would have obscured entirely.

Key Distinction

There is a critical difference between attempted completion (the agent tried and produced output) and verified completion (the output was checked against a ground-truth standard). Many published benchmarks conflate the two. Always ask: how was success verified?

The second refinement is distinguishing between terminal success (the task reached the correct final state) and trajectory quality (the path taken to get there). An agent that achieves the right answer via ten unnecessary API calls, two retries, and one recoverable error has a different quality profile than one that achieves it cleanly in two steps. Benchmarks like GAIA (2023, Meta) explicitly measure both dimensions separately.

Partial Credit and Subgoal Tracking

Complex tasks decompose into subgoals. A travel booking agent might need to: (1) parse destination intent, (2) query available flights, (3) filter by user constraints, (4) select optimal option, (5) execute payment, (6) confirm and notify. A binary success/fail score throws away five-sixths of the diagnostic information. Subgoal completion tracking pinpoints where in the pipeline agents systematically fail.

The SWE-bench benchmark (2023, Princeton/Chicago) evaluates coding agents on real GitHub issues and explicitly tracks subgoal achievement — unit tests passing, linting passing, integration tests passing — rather than just whether the issue was "resolved." This granular structure revealed that many agents could pass unit tests but failed at integration, pointing directly at a specific capability gap.

Define subgoals before running evaluations — post-hoc subgoal definition introduces bias
Weight subgoals by their criticality to overall task success, not equally
Track whether failures cluster at the same subgoal across runs — systematic failure is more informative than random failure
Distinguish recoverable partial failures (agent detected and corrected) from undetected partial failures

Real Number to Know

On SWE-bench Verified (the curated subset released in 2024), the best open-source agents at time of release achieved roughly 23% full-task completion but over 60% completion of at least one critical subgoal. The subgoal data tells a completely different story about capability than the headline number.

Setting Completion Thresholds for Production

Evaluation metrics only become operational when paired with deployment thresholds: the minimum acceptable completion rate before an agent goes live, and the trigger point for pulling it back. These thresholds must be risk-calibrated. A medical records retrieval agent and a playlist recommendation agent can have very different acceptable failure rates even if they are technically similar systems.

Anthropic's published guidance on Claude deployment recommends establishing separate thresholds for different action consequence levels — read-only actions, reversible writes, and irreversible actions — and requiring near-perfect completion rates on the irreversible tier before any production deployment. This tiered threshold approach is now standard in serious agent evaluation frameworks.

→ Lesson 1 Quiz

🎯 Advanced · Lesson 1 Quiz

Task Completion — Quiz

3 questions — free, untracked, retake anytime.

1. The Salesforce Agentforce benchmark showed an 83% overall task completion rate. What critical information did this aggregate number hide?

✓ Correct — ✓ Correct. Simple refund tasks completed at 96% while complex account restructuring completed at only 61% — the aggregate masked which task types actually needed human escalation.

Not quite. The Salesforce case specifically revealed that stratifying by task complexity exposed dramatic variation — from 96% on simple tasks to 61% on complex ones — that the aggregate hid.

2. What is the key difference between "attempted completion" and "verified completion"?

✓ Correct — ✓ Correct. Many benchmarks conflate these two, which inflates reported success rates. Verification against ground truth is what makes completion meaningful.

Not quite. The distinction is about verification quality: attempted completion just means output was produced, while verified completion means it was checked against a correct answer standard.

3. On SWE-bench Verified, the best agents achieved roughly 23% full-task completion but over 60% subgoal completion. What does this gap most directly indicate?

✓ Correct — ✓ Correct. The gap shows agents are genuinely accomplishing intermediate steps — which has real deployment value — but binary metrics invisibilize this progress entirely.

Not quite. The gap reveals that agents have real partial capabilities that binary success/fail scoring hides. Subgoal tracking surfaces diagnostic information that headline numbers cannot.

← Back to Lesson 1 → Lesson 1 Lab

🎯 Advanced · Lesson 1 Lab

Lab: Designing Completion Metrics

Practice designing stratified, subgoal-aware completion frameworks with an AI coach.

Your Challenge

You are evaluating an AI agent designed to handle customer onboarding for a B2B SaaS company. The agent must: verify company eligibility, create the account, configure default settings, send welcome materials, and schedule an onboarding call.

Work with the AI coach to: (1) identify the subgoals and their relative weights, (2) design stratified completion tiers, and (3) set production deployment thresholds given that account creation is irreversible.

Push back, ask follow-up questions, and request the coach to critique your proposed metrics.

🎯 Eval Coach — Task Completion Advanced Lab

← Back to Quiz → Lesson 2

🎯 Advanced · Lesson 2 of 4

Cost Efficiency Metrics

Completion rate is meaningless without cost — how to measure what you actually spend per unit of value.

In May 2024, Cognition AI's Devin coding agent launched to significant press coverage claiming it could autonomously resolve software engineering issues. Independent researcher Albert Ziegler published a detailed analysis in July 2024 documenting that while Devin did complete tasks, its cost per resolved issue — when accounting for token usage, retries, and infrastructure — was approximately $20–30 per SWE-bench task. For many issues, a senior engineer could resolve the same problem in under ten minutes. Cost efficiency, not raw capability, determined whether the tool was economically viable for different use cases.

The Three Cost Dimensions

Agent cost breaks down across three independent dimensions that are often conflated. Token cost is the most visible: the API fees for every prompt and completion token consumed across all model calls in an agent run. Infrastructure cost covers compute for tool execution, memory systems, vector databases, and orchestration layers. Time cost is wall-clock latency — how long a task takes, which has economic value when agents block human workflows or time-sensitive operations.

The key composite metric is cost per successful completion (CPSC): total cost across all runs divided by the number of verified successes. This automatically penalizes agents that fail frequently, because failed runs still consume tokens and time. An agent with a 90% success rate at $0.50/run has a CPSC of $0.56. An agent with a 60% success rate at $0.20/run has a CPSC of $0.33 — apparently cheaper per task, but only because the comparison ignores what happens to the 40% of failed attempts.

Formula to Remember

CPSC = Total Cost of All Runs ÷ Number of Verified Successful Completions. A $0.20 agent with 50% success has a CPSC of $0.40 — identical to a $0.40 agent with 100% success. Always compute CPSC, never just average cost.

Token Efficiency and Reasoning Overhead

With the emergence of chain-of-thought and extended thinking models (OpenAI o1, Claude 3.7 Sonnet with extended thinking, Gemini 2.0 Flash Thinking), a new cost dimension appeared: reasoning tokens. These are tokens consumed internally during the model's "thinking" process, charged at full rate but invisible in the output. A task that produces a 200-token answer may have consumed 8,000 reasoning tokens to get there.

Anthropic's pricing documentation for Claude 3.7 Sonnet (released February 2025) shows thinking tokens charged at $3.00/M input tokens — identical to standard tokens. For complex multi-step agent workflows using extended thinking, reasoning overhead can represent 60–80% of total token cost while producing only marginally better results on straightforward tasks. This makes thinking budget calibration — setting appropriate maximum reasoning token limits per task type — a critical cost efficiency lever.

Measure reasoning-token-to-output-token ratio per task category
Set thinking budgets by task complexity tier, not uniformly across all tasks
Test whether capping reasoning tokens degrades completion quality — the degradation threshold is your optimal budget
Track cost per token of useful output (CTUO): total spend divided by output tokens that contributed to verified success

Retry Costs and Loop Detection

Agents fail, and failure handling has its own cost profile. The naive approach — retry until success or hit a max-retry limit — can produce catastrophic cost spikes on edge cases. A well-documented production incident at a major e-commerce company in 2023 (reported in Langchain's public postmortem blog) involved an agent entering a reasoning loop on ambiguous return requests, consuming $400 in API costs on a single transaction before hitting its timeout — a 40,000x cost overrun versus the expected $0.01/transaction budget.

Robust cost evaluation therefore requires tracking cost variance alongside cost mean. A low average CPSC with high variance is far more dangerous to budget forecasting than a slightly higher average with tight variance. Production deployments should implement hard cost caps at the individual run level — a feature now supported natively in LangGraph, CrewAI, and Anthropic's Claude API through max_tokens and stop sequences.

Production Standard

Any agent deployed in production should have a hard per-run cost cap set at no more than 10x the expected CPSC. This allows for legitimate retries and complex cases while preventing runaway loops. The cap should trigger graceful escalation to human review, not silent failure.

← Lesson 1 Lab → Lesson 2 Quiz

🎯 Advanced · Lesson 2 Quiz

Cost Efficiency — Quiz

3 questions — free, untracked, retake anytime.

1. An agent completes 60% of tasks at $0.30 per run. A second agent completes 90% of tasks at $0.40 per run. Which has the lower Cost Per Successful Completion (CPSC)?

✓ Correct — ✓ Correct. Agent 1 CPSC = $0.30 ÷ 0.60 = $0.50. Agent 2 CPSC = $0.40 ÷ 0.90 = $0.44. Despite the higher per-run cost, Agent 2 is more cost-efficient because it fails less often.

Not quite. CPSC = cost per run ÷ success rate. Agent 1: $0.30/0.60 = $0.50. Agent 2: $0.40/0.90 = $0.44. Agent 2 wins because failures cost money even when they produce no value.

2. Why did the independent analysis of Cognition AI's Devin agent matter beyond capability benchmarks?

✓ Correct — ✓ Correct. At $20–30 per resolved issue, many tasks were cheaper with human engineers — showing that raw capability benchmarks without cost context are insufficient for deployment decisions.

Not quite. Devin did complete tasks. The analysis by Albert Ziegler showed cost per resolved issue of $20–30, making it economically uncompetitive against human engineers for many task types despite real capability.

3. What production safeguard should be set to prevent runaway agent cost loops, based on documented incidents?

✓ Correct — ✓ Correct. The 10x CPSC cap allows legitimate complexity while preventing the 40,000x overruns documented in real incidents. Graceful escalation to humans preserves task completion without unlimited spend.

Not quite. The standard is a hard per-run cost cap at no more than 10x the expected CPSC, with graceful escalation to human review — not disabling retries or daily monitoring, which are too slow to prevent runaway loops.

← Back to Lesson 2 → Lesson 2 Lab

🎯 Advanced · Lesson 2 Lab

Lab: Cost Efficiency Analysis

Work through real cost scenarios and design cost-efficiency metrics for a production agent.

Your Challenge

You are evaluating two competing agents for a document summarization pipeline. Agent A costs $0.08/run with 75% success. Agent B costs $0.15/run with 95% success. Your budget is $500/month for 4,000 tasks/month.

Work with the AI coach to: (1) calculate CPSC for both agents, (2) determine which fits your budget, (3) factor in reasoning token overhead if Agent B uses extended thinking, and (4) design a cost cap policy for production deployment.

Ask the coach to walk through the math with you, and challenge its assumptions.

💰 Eval Coach — Cost Efficiency Advanced Lab

← Back to Quiz → Lesson 3

🎯 Advanced · Lesson 3 of 4

Reliability & Failure Mode Analysis

An agent that usually works is not the same as a reliable agent — understanding failure patterns is the difference.

In March 2024, Air Canada was ordered by a Canadian tribunal to honor a bereavement discount that its customer service chatbot had incorrectly promised to a passenger. The court found Air Canada liable despite the company arguing the chatbot's output was not binding. The incident made global news not because the bot failed — it did fail — but because Air Canada had deployed it without adequate understanding of its failure distribution. The bot had performed well on routine queries; its failure mode was specifically on edge-case policy interpretation. No evaluation had stress-tested that category before deployment.

Failure Mode Classification

Not all failures are equal. The Air Canada case illustrates a category that is far more dangerous than simple task failures: confident wrong outputs. The chatbot did not say "I don't know" — it stated incorrect information with apparent authority. A taxonomy of agent failure modes helps prioritize what to test and monitor.

Silent failures: Agent produces no output or crashes without explanation
Detected failures: Agent correctly identifies it cannot complete the task and escalates
Confident errors: Agent completes task but with incorrect output, no uncertainty signal
Cascading failures: An early-step error propagates through subsequent steps, compounding damage
Latent failures: Output appears correct but contains errors only visible under specific downstream conditions

Risk Hierarchy

From least to most dangerous: silent failure → detected failure → latent failure → cascading failure → confident error. Confident errors are the highest-risk failure mode because users act on them without suspicion, as Air Canada's case demonstrated at significant legal cost.

Reliability Under Distribution Shift

Agents evaluated on a static test set may show excellent reliability metrics that completely collapse in production. This happens because production inputs are drawn from a different distribution than the evaluation set — a phenomenon called distribution shift. The gap between eval-set reliability and production reliability is one of the most consistent failure patterns in deployed AI systems.

Anthropic's 2024 research on Claude's consistency documented that response quality can vary significantly based on prompt phrasing, user context, conversation history length, and time-of-day load — none of which are typically controlled in benchmark evaluations. This means a single-point reliability number is meaningless without knowing what distribution it was measured on.

The standard mitigation is adversarial evaluation: deliberately constructing test inputs that represent edge cases, ambiguous phrasing, adversarial users, and out-of-distribution queries. Stanford's HELM benchmark (Holistic Evaluation of Language Models, 2022) pioneered multi-distribution evaluation, testing models across 42 scenarios specifically to expose differential reliability across conditions.

Reliability ≠ Accuracy

A reliable agent produces consistent outputs for similar inputs. An accurate agent produces correct outputs. These are independent dimensions. An agent can be reliably wrong (consistent but incorrect), accurately unreliable (correct when it works but unpredictably unstable), or both reliable and accurate — the only acceptable production standard.

Measuring and Monitoring Failure Rates

Production reliability monitoring requires separating pre-deployment evaluation from post-deployment monitoring. Pre-deployment eval catches known failure modes; post-deployment monitoring catches the unknown ones. The standard production stack includes: error rate tracking by failure category, confidence calibration measurement (does the agent's expressed confidence predict its actual accuracy?), and anomaly detection on output distributions.

LangSmith (LangChain's observability platform, launched 2023) and Weights & Biases Prompts both provide production monitoring dashboards specifically for this purpose, tracking run success rates, latency distributions, and output classification over time. Microsoft's Azure AI Studio includes built-in safety and quality evaluators that run continuously against sampled production traffic. The key metric to watch is failure rate drift: a statistically significant increase in any failure category over baseline, which often signals distribution shift before it becomes a visible incident.

← Lesson 2 Lab → Lesson 3 Quiz

🎯 Advanced · Lesson 3 Quiz

Reliability & Failure Modes — Quiz

3 questions — free, untracked, retake anytime.

1. Why were "confident errors" identified as the highest-risk failure mode, based on the Air Canada chatbot case?

✓ Correct — ✓ Correct. The Air Canada bot didn't signal uncertainty — it stated a wrong discount policy confidently. The passenger relied on it, creating legal liability. Users cannot defend against failures they have no reason to suspect.

Not quite. The danger of confident errors is that users act on them — the bot gave incorrect but authoritative-sounding information about bereavement discounts, which a customer then relied upon, creating legal liability for the company.

2. What is "distribution shift" in the context of agent reliability evaluation?

✓ Correct — ✓ Correct. Eval sets are controlled samples; production traffic is not. When they differ significantly, even excellent eval metrics can mask poor production reliability.

Not quite. Distribution shift is when the real-world inputs an agent encounters in production differ from those in its evaluation set — causing reliability metrics measured in eval to be misleading about actual production performance.

3. An agent produces consistent outputs for similar inputs but those outputs are often incorrect. Which description fits?

✓ Correct — ✓ Correct. Reliability and accuracy are independent dimensions. This agent is consistent (reliable) but consistently incorrect (not accurate) — a dangerous combination because the consistency may prevent users from noticing the errors.

Not quite. Consistency of outputs regardless of correctness is reliability. Getting things right is accuracy. An agent that consistently produces wrong answers is "reliably wrong" — the two dimensions are independent and must both be measured.

← Back to Lesson 3 → Lesson 3 Lab

🎯 Advanced · Lesson 3 Lab

Lab: Failure Mode Mapping

Classify failure types, design adversarial test cases, and build a monitoring strategy.

Your Challenge

You are responsible for a legal document review agent that flags contract clauses for attorney review. It has been running for two weeks and you have collected 500 runs of data showing a 4% failure rate.

Work with the AI coach to: (1) classify what the failure types likely are given the domain, (2) design adversarial test cases that would stress-test confident error risk, and (3) set up a production monitoring plan that distinguishes failure-rate drift from normal variance.

Be specific — ask for concrete test case examples and threshold formulas.

🔁 Eval Coach — Reliability & Failures Advanced Lab

← Back to Quiz → Lesson 4

Building AI Agents I — Use Cases · Module 7 · Lesson 4

Lesson 4: Building Eval Frameworks

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores lesson 4: building eval frameworks — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Lesson 4: Building Eval Frameworks

What is the primary focus of Lesson 4: Building Eval Frameworks?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4: Building Eval Frameworks through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: building eval frameworks.

Try: "I'm evaluating a customer support agent that resolves 89% of tickets but costs $2.40 per resolution and occasionally gives confidently wrong answers. Design the eval framework — what metrics, thresholds, and failure-mode tests would you run before approving scale-up?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 7 Test

Evaluating Agent Performance · 15 Questions · 70% to Pass

Score: 0/15

1. Why does overall task completion rate alone deceive?

2. What critical distinction do many benchmarks conflate?

3. In Google DeepMind's WebAgent (2023) evaluation, what gap did stratified analysis reveal?

4. On SWE-bench Verified (2024), what did the subgoal data reveal that the headline number missed?

5. According to Anthropic guidance, what threshold approach should be used for different action types?

6. What is Cost Per Successful Completion (CPSC) and why is it better than raw cost?

7. In the e-commerce production incident documented by LangChain, what happened when an agent entered a reasoning loop?

8. For complex multi-step workflows using Claude 3.7 Sonnet's extended thinking, what percentage of total token cost can reasoning overhead represent?

9. What production standard prevents runaway cost loops?

10. Why is cost variance more important to track than cost mean alone?

11. In the failure mode risk hierarchy, which category is the most dangerous?

12. Why did the Air Canada chatbot case specifically demonstrate the danger of "confident errors"?

13. What is the distinction between reliability and accuracy in agent evaluation?

14. What does the Stanford HELM benchmark (2022) specifically test for?

15. Why is post-deployment monitoring essential even with thorough pre-deployment evaluation?