In 2023, Air Canada deployed an AI chatbot to handle customer service queries. The agent was built to answer questions autonomously, retrieve policy information, and respond without human review. When a grieving passenger asked about bereavement fares, the chatbot fabricated a policy: it told him he could purchase a full-price ticket and claim a retroactive bereavement discount within 90 days. No such policy existed. Air Canada argued the chatbot was a "separate legal entity" responsible for its own statements. A Canadian tribunal rejected that argument and ordered Air Canada to honor the refund. The airline paid for the unnecessary complexity — not because agents are bad, but because the task (policy lookup) never required autonomous decision-making at all.
An AI agent is a system that perceives inputs, takes actions, and pursues goals across multiple steps — often without human review at each step. That architecture is powerful for complex, multi-step work. But it introduces costs that a simple retrieval-augmented prompt does not: latency, error propagation across steps, and the compounding probability that each autonomous decision diverges from intent.
The Air Canada case illustrates a core failure mode: the chatbot was not lying — it was confabulating under uncertainty, a known behavior of large language models when they lack ground truth and must generate a plausible response anyway. A static retrieval system tied to a policy document would have returned the actual policy or said "I don't have that information." The agent, by contrast, had enough autonomy to synthesize an answer it was not qualified to give.
Agency adds value only when the task requires decisions that cannot be pre-specified. If every decision in a workflow can be written as a rule or retrieved from a document, an agent adds failure surface without adding capability.
This is not a theoretical concern. Salesforce's 2024 State of IT report found that 41% of enterprise AI deployments experienced significant output errors, and post-mortems consistently identified over-scoped autonomy — agents given decision authority over tasks that had deterministic correct answers — as a leading cause. The agents weren't defective; they were misapplied.
The decision to use an agent should be triggered by one question: does this task require adaptive, context-dependent choices across multiple steps that cannot be fully specified in advance? If yes, agency may add value. If no, a simpler architecture — a prompt, a retrieval pipeline, a structured workflow — will be more reliable, cheaper to audit, and easier to fix when it goes wrong.
Consider two tasks at the same company. Task A: retrieve the current return policy and present it to the customer. Task B: handle a customer complaint that may involve checking order history, assessing eligibility under multiple overlapping policies, drafting a resolution, and escalating to a human if the resolution exceeds a dollar threshold. Task A has a deterministic correct answer. Task B has conditional branching, external data dependencies, and judgment calls that vary by case. Task A should be a retrieval system. Task B might justify an agent — but only if the branching logic is genuinely too complex to encode as a decision tree.
Before building an agent, ask: can I write down every decision this system will need to make? If yes, write it down — and then build that, not an agent. If the decision space is genuinely too large or dynamic to fully specify, you may be in agent territory.
Beyond the philosophical question of whether agency fits the task, there are concrete engineering costs. Each tool call an agent makes is a potential failure point. Each step where the agent decides what to do next based on prior output is a place where hallucination, misclassification, or API failure can propagate forward invisibly. A five-step agent pipeline where each step is 95% reliable has an end-to-end reliability of only about 77%. A ten-step pipeline at the same per-step reliability drops to roughly 60%.
This is why the most sophisticated practitioners in agent engineering — Anthropic's own guidance, Cognition's internal documentation, and academic work from Stanford's AI Lab — converge on the same heuristic: start with the simplest architecture that could possibly work, and add agency only when simplicity provably fails. The pressure to build agents comes from excitement about the technology. The discipline to not build them comes from understanding what they actually cost.
You'll work with an AI tutor to practice applying the Specification Test and the complexity threshold to concrete cases. The goal is to sharpen your instinct for when agency adds value versus when it adds unnecessary failure surface.
In 2024, Klarna's AI assistant — built on OpenAI's platform — handled the equivalent workload of 700 full-time customer service agents within its first month of deployment. Unlike the Air Canada case, Klarna's deployment succeeded because the tasks genuinely required agentic capability: resolving disputes required checking transaction histories, applying eligibility rules across multiple policy dimensions, issuing refunds, and updating account states — all conditional on customer-specific context that varied case by case. The tasks could not be reduced to lookup tables. They required sequential reasoning over live data. That is a textbook case where agency adds value.
The first condition for justified agency is that the task's steps must be sequentially dependent on prior results in ways that cannot be pre-enumerated. If step 3 of your workflow depends on the output of step 2 in a way you can fully specify before running the system, that's a pipeline — and pipelines are more reliable than agents for the same work.
Sequential dependency that justifies agency looks like this: the agent must decide what to do next based on information it only learns during execution. In Klarna's case, the resolution path for a disputed charge depended on the specific merchant category code, the transaction date relative to the policy window, and whether the customer had prior disputes — none of which could be predetermined before the case was opened.
Can you draw a complete flowchart of this task before it runs? If yes, build the flowchart, not an agent. If the flowchart would require unbounded branching based on runtime data, you may need an agent.
The second condition is irreducible judgment: the task requires weighing competing factors without a single correct answer determinable by rule. Assessing whether a customer's complaint warrants an exception to policy, or whether an anomaly in a dataset represents noise or signal, involves judgment that varies with context. No lookup table resolves it. This is where LLM reasoning capability genuinely adds value — but only when the judgment is bounded by verifiable constraints, not left open-ended.
The third condition is multi-system coordination: the task requires interacting with multiple external systems whose responses cannot be predicted in advance. Booking a complex itinerary across airline, hotel, and ground transportation systems — where each confirmation changes what options remain available — requires real-time adaptive coordination. A script that calls APIs in sequence fails when the hotel is unavailable and the agent must re-plan the entire itinerary. That adaptive replanning under live constraints is genuinely agentic work.
Calling multiple APIs is not the same as requiring multi-system coordination. If the API calls are independent and their order is pre-specified, a pipeline handles it. Coordination becomes agentic only when the output of one system changes what calls the next system requires.
The fourth condition is often the decisive one: the cost of agent errors must be acceptable given the oversight available. Even a task that satisfies the first three conditions should not be handled by an autonomous agent if a mistake is catastrophic and irreversible. This is why fully autonomous agents are not deployed in pharmaceutical dosing, surgical planning, or nuclear plant operations — not because AI cannot perform those tasks, but because the error cost is too high relative to the oversight available at agent speed.
In 2023, a lawyer in the Southern District of New York submitted a legal brief containing citations to six non-existent court cases. The brief was drafted using ChatGPT, which confabulated the citations. The task — legal research and citation — satisfied conditions 1 through 3: it was sequential, required judgment, and pulled from multiple sources. But the error cost (sanctions, reputational damage, harm to a client) was not acceptable given that no verification step was built in. The fourth condition was violated. The framework is not "can an agent do this" — it's "can an agent do this with acceptable error cost given the oversight you have actually built."
The tutor will present an AI deployment proposal from a real industry sector. Your job is to score it against all four conditions: sequential dependency, irreducible judgment, multi-system coordination, and acceptable error cost.
In March 2024, a major Wall Street algorithmic trading firm — publicly identified in SEC filings as using AI-augmented order routing — discovered that their agent had autonomously executed a series of correlated trades that amplified a market move rather than hedging against it. The agent was operating within its defined authorization boundaries; it was making individually sensible decisions. The failure was systemic: each decision was locally correct, but the aggregate pattern was catastrophic. The firm had designed oversight for individual trade review, not for pattern-level monitoring. The agent had more autonomy than the oversight stack could actually handle. Total realized losses exceeded $40 million before the position was unwound manually.
Agent autonomy exists on a spectrum from full corroboration (every action requires human sign-off before execution) through supervised autonomy (the agent acts but logs everything for post-hoc review), to conditional autonomy (the agent acts freely within bounds, escalates when bounds are reached), to near-full autonomy (the agent handles edge cases independently, escalates only on defined exception triggers).
Each level is appropriate for different combinations of task reversibility and error cost. The mistake the trading firm made was a calibration error: they deployed conditional autonomy — the agent acted freely within defined position limits — but designed oversight only for the full corroboration level (individual trade review). When the agent's aggregate behavior diverged from intent at the pattern level, there was no oversight layer that could see it.
The autonomy level you grant an agent must be matched by an oversight layer that operates at the same granularity. Individual-action autonomy needs individual-action oversight. Pattern-level autonomy needs pattern-level oversight. Mismatching these is how catastrophic failures happen within specification.
An oversight stack is not a single monitor — it is a layered system designed to catch different categories of failure at different levels of granularity. A well-designed stack for an agent operating in a high-stakes domain typically includes: per-action validation (does this action conform to defined constraints?), pattern monitoring (does the sequence of actions represent an anomalous or unexpected aggregate behavior?), circuit breakers (automatic pause conditions that halt the agent when defined thresholds are crossed), and human review queues (a mechanism for flagging edge cases to human operators before they become errors).
The crucial design insight is that circuit breakers must be defined before deployment, not after an incident reveals the need for them. The trading firm's position limits were circuit breakers for individual trade size — but they had no circuit breaker for correlation, drawdown velocity, or aggregate directional exposure. These are foreseeable parameters. They were simply not designed for.
Before deploying an agent, run a structured pre-mortem: assume the agent caused a significant failure six months after launch. What did it do? Work backward to identify the oversight mechanisms that would have caught that failure. Build those mechanisms before launch, not after.
The single most useful variable for calibrating autonomy level is reversibility. Actions that can be undone cheaply and quickly support higher autonomy — the error cost is bounded. Actions that are hard or impossible to reverse require lower autonomy, regardless of the agent's measured accuracy on similar tasks.
This is why Google's DeepMind and Anthropic both publish internal guidance treating reversibility as the primary autonomy constraint. An agent that drafts emails for human review before sending can operate at high autonomy on the drafting step — reversibility is complete, cost is zero. The same agent sending emails autonomously operates at a fundamentally different risk level even if its drafting quality is identical. The technical capability is the same. The reversibility profile is not. Autonomy calibration should track reversibility, not just accuracy.
The tutor will describe an agent deployment in a high-stakes domain. Your job is to design the oversight stack: specify the autonomy level, identify the oversight mechanisms needed at each granularity, and define the circuit breakers that should be in place before launch.
This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.