Task completion rate is the most natural first metric anyone reaches for when evaluating an AI agent. Did it finish the job? Yes or no? But a binary completion metric discards almost everything useful about agent quality. An agent that completes 90% of tasks but consistently fails on the 10% that are highest-stakes — security-sensitive actions, large financial transactions, irreversible operations — is far more dangerous than an agent completing 75% of tasks uniformly across complexity levels.
The first refinement is stratified completion rate: measuring success separately across task difficulty tiers, task types, and task domains. Google DeepMind's WebAgent research (2023) explicitly stratified their evaluation across easy, medium, and hard web navigation tasks, revealing that completion rates dropped from 90.8% on easy tasks to 32.1% on hard tasks — a gap that an aggregate number of ~65% would have obscured entirely.
There is a critical difference between attempted completion (the agent tried and produced output) and verified completion (the output was checked against a ground-truth standard). Many published benchmarks conflate the two. Always ask: how was success verified?
The second refinement is distinguishing between terminal success (the task reached the correct final state) and trajectory quality (the path taken to get there). An agent that achieves the right answer via ten unnecessary API calls, two retries, and one recoverable error has a different quality profile than one that achieves it cleanly in two steps. Benchmarks like GAIA (2023, Meta) explicitly measure both dimensions separately.
Complex tasks decompose into subgoals. A travel booking agent might need to: (1) parse destination intent, (2) query available flights, (3) filter by user constraints, (4) select optimal option, (5) execute payment, (6) confirm and notify. A binary success/fail score throws away five-sixths of the diagnostic information. Subgoal completion tracking pinpoints where in the pipeline agents systematically fail.
The SWE-bench benchmark (2023, Princeton/Chicago) evaluates coding agents on real GitHub issues and explicitly tracks subgoal achievement — unit tests passing, linting passing, integration tests passing — rather than just whether the issue was "resolved." This granular structure revealed that many agents could pass unit tests but failed at integration, pointing directly at a specific capability gap.
On SWE-bench Verified (the curated subset released in 2024), the best open-source agents at time of release achieved roughly 23% full-task completion but over 60% completion of at least one critical subgoal. The subgoal data tells a completely different story about capability than the headline number.
Evaluation metrics only become operational when paired with deployment thresholds: the minimum acceptable completion rate before an agent goes live, and the trigger point for pulling it back. These thresholds must be risk-calibrated. A medical records retrieval agent and a playlist recommendation agent can have very different acceptable failure rates even if they are technically similar systems.
Anthropic's published guidance on Claude deployment recommends establishing separate thresholds for different action consequence levels — read-only actions, reversible writes, and irreversible actions — and requiring near-perfect completion rates on the irreversible tier before any production deployment. This tiered threshold approach is now standard in serious agent evaluation frameworks.
You are evaluating an AI agent designed to handle customer onboarding for a B2B SaaS company. The agent must: verify company eligibility, create the account, configure default settings, send welcome materials, and schedule an onboarding call.
Push back, ask follow-up questions, and request the coach to critique your proposed metrics.
Agent cost breaks down across three independent dimensions that are often conflated. Token cost is the most visible: the API fees for every prompt and completion token consumed across all model calls in an agent run. Infrastructure cost covers compute for tool execution, memory systems, vector databases, and orchestration layers. Time cost is wall-clock latency — how long a task takes, which has economic value when agents block human workflows or time-sensitive operations.
The key composite metric is cost per successful completion (CPSC): total cost across all runs divided by the number of verified successes. This automatically penalizes agents that fail frequently, because failed runs still consume tokens and time. An agent with a 90% success rate at $0.50/run has a CPSC of $0.56. An agent with a 60% success rate at $0.20/run has a CPSC of $0.33 — apparently cheaper per task, but only because the comparison ignores what happens to the 40% of failed attempts.
CPSC = Total Cost of All Runs ÷ Number of Verified Successful Completions. A $0.20 agent with 50% success has a CPSC of $0.40 — identical to a $0.40 agent with 100% success. Always compute CPSC, never just average cost.
With the emergence of chain-of-thought and extended thinking models (OpenAI o1, Claude 3.7 Sonnet with extended thinking, Gemini 2.0 Flash Thinking), a new cost dimension appeared: reasoning tokens. These are tokens consumed internally during the model's "thinking" process, charged at full rate but invisible in the output. A task that produces a 200-token answer may have consumed 8,000 reasoning tokens to get there.
Anthropic's pricing documentation for Claude 3.7 Sonnet (released February 2025) shows thinking tokens charged at $3.00/M input tokens — identical to standard tokens. For complex multi-step agent workflows using extended thinking, reasoning overhead can represent 60–80% of total token cost while producing only marginally better results on straightforward tasks. This makes thinking budget calibration — setting appropriate maximum reasoning token limits per task type — a critical cost efficiency lever.
Agents fail, and failure handling has its own cost profile. The naive approach — retry until success or hit a max-retry limit — can produce catastrophic cost spikes on edge cases. A well-documented production incident at a major e-commerce company in 2023 (reported in Langchain's public postmortem blog) involved an agent entering a reasoning loop on ambiguous return requests, consuming $400 in API costs on a single transaction before hitting its timeout — a 40,000x cost overrun versus the expected $0.01/transaction budget.
Robust cost evaluation therefore requires tracking cost variance alongside cost mean. A low average CPSC with high variance is far more dangerous to budget forecasting than a slightly higher average with tight variance. Production deployments should implement hard cost caps at the individual run level — a feature now supported natively in LangGraph, CrewAI, and Anthropic's Claude API through max_tokens and stop sequences.
Any agent deployed in production should have a hard per-run cost cap set at no more than 10x the expected CPSC. This allows for legitimate retries and complex cases while preventing runaway loops. The cap should trigger graceful escalation to human review, not silent failure.
You are evaluating two competing agents for a document summarization pipeline. Agent A costs $0.08/run with 75% success. Agent B costs $0.15/run with 95% success. Your budget is $500/month for 4,000 tasks/month.
Ask the coach to walk through the math with you, and challenge its assumptions.
Not all failures are equal. The Air Canada case illustrates a category that is far more dangerous than simple task failures: confident wrong outputs. The chatbot did not say "I don't know" — it stated incorrect information with apparent authority. A taxonomy of agent failure modes helps prioritize what to test and monitor.
From least to most dangerous: silent failure → detected failure → latent failure → cascading failure → confident error. Confident errors are the highest-risk failure mode because users act on them without suspicion, as Air Canada's case demonstrated at significant legal cost.
Agents evaluated on a static test set may show excellent reliability metrics that completely collapse in production. This happens because production inputs are drawn from a different distribution than the evaluation set — a phenomenon called distribution shift. The gap between eval-set reliability and production reliability is one of the most consistent failure patterns in deployed AI systems.
Anthropic's 2024 research on Claude's consistency documented that response quality can vary significantly based on prompt phrasing, user context, conversation history length, and time-of-day load — none of which are typically controlled in benchmark evaluations. This means a single-point reliability number is meaningless without knowing what distribution it was measured on.
The standard mitigation is adversarial evaluation: deliberately constructing test inputs that represent edge cases, ambiguous phrasing, adversarial users, and out-of-distribution queries. Stanford's HELM benchmark (Holistic Evaluation of Language Models, 2022) pioneered multi-distribution evaluation, testing models across 42 scenarios specifically to expose differential reliability across conditions.
A reliable agent produces consistent outputs for similar inputs. An accurate agent produces correct outputs. These are independent dimensions. An agent can be reliably wrong (consistent but incorrect), accurately unreliable (correct when it works but unpredictably unstable), or both reliable and accurate — the only acceptable production standard.
Production reliability monitoring requires separating pre-deployment evaluation from post-deployment monitoring. Pre-deployment eval catches known failure modes; post-deployment monitoring catches the unknown ones. The standard production stack includes: error rate tracking by failure category, confidence calibration measurement (does the agent's expressed confidence predict its actual accuracy?), and anomaly detection on output distributions.
LangSmith (LangChain's observability platform, launched 2023) and Weights & Biases Prompts both provide production monitoring dashboards specifically for this purpose, tracking run success rates, latency distributions, and output classification over time. Microsoft's Azure AI Studio includes built-in safety and quality evaluators that run continuously against sampled production traffic. The key metric to watch is failure rate drift: a statistically significant increase in any failure category over baseline, which often signals distribution shift before it becomes a visible incident.
You are responsible for a legal document review agent that flags contract clauses for attorney review. It has been running for two weeks and you have collected 500 runs of data showing a 4% failure rate.
Be specific — ask for concrete test case examples and threshold formulas.
This lesson explores lesson 4: building eval frameworks — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: building eval frameworks.