Module 3 · Lesson 1

From Script to Agent: The Architecture of Customer Service AI

How companies moved from rigid chatbot decision trees to autonomous agents that resolve tickets, process refunds, and book appointments without a human in the loop.

What separates a customer service agent from a customer service chatbot — and why does the distinction matter for the people on the other end of the conversation?

In October 2023, Klarna — the Swedish buy-now-pay-later company — quietly replaced a significant portion of its customer service infrastructure. Not with offshore agents, not with a new ticketing system, but with an AI agent built on OpenAI's technology. By February 2024, Klarna announced the results publicly: the agent was handling two-thirds of all customer service chats, completing work in an average of two minutes that previously took eleven.

What made this possible wasn't a better FAQ database. It was architecture — the shift from retrieval to agency.

The Three Generations of Customer Service Automation

Customer service automation has evolved through three distinct eras, each representing a fundamentally different relationship between the system and the customer's intent.

Generation 1 — Rule-based chatbots (2000–2016): These systems operated on decision trees. A customer typed a keyword; the system matched it to a branch; the branch returned a canned response. Companies like early Intercom and Zendesk bots lived here. They could deflect simple FAQs but collapsed immediately when a customer's question didn't fit a predefined node. Escalation to a human was the only recovery mechanism.

Generation 2 — Intent-classification NLP bots (2016–2022): The arrival of neural intent classifiers (built on frameworks like Dialogflow and LUIS) let systems understand paraphrases. "I want to cancel" and "how do I stop my subscription" could now map to the same intent. This reduced the fragility of Gen 1, but the systems were still fundamentally reactive: they could identify what a customer wanted but had limited ability to take action in external systems.

Generation 3 — Agentic systems (2023–present): Current-generation customer service AI doesn't just classify intent — it executes. It calls APIs, reads account records, issues refunds, modifies reservations, and sends confirmation emails. The defining characteristic is tool use: the agent has been given a set of callable functions and the autonomy to decide when to call them.

Documented Case

Klarna's February 2024 press release stated their AI assistant handled 2.3 million conversations in its first month — equivalent to the work of 700 full-time agents. Average resolution time dropped from 11 minutes to 2 minutes. Customer satisfaction scores were reported as equivalent to human agents. Klarna later acknowledged the full picture was more complex, with some role reductions offset by retraining, but the performance metrics were independently notable.

What "Agentic" Actually Means in This Context

The word "agent" is overloaded in AI discourse. In customer service specifically, it refers to a system architecture with four components working in concert:

Component 1

Language Understanding

An LLM parses customer messages, identifies intent, extracts entities (order numbers, dates, product names), and maintains conversational context across multiple turns.

Component 2

Tool Catalog

A defined set of callable functions: look up order status, process refund, update shipping address, check inventory, escalate to human. The agent selects from these based on what the customer needs.

Component 3

Policy Layer

Rules that constrain what the agent can do without escalation: refund limit thresholds, verified account requirements, prohibited topics. This is where business logic lives.

Component 4

Orchestration Loop

The reasoning loop that decides: do I have enough information to act? Which tool do I call? Have I resolved the customer's actual need? When should I hand off to a human?

The Escalation Problem

The most consequential design decision in any customer service agent is escalation logic — the conditions under which the system decides a human should take over. Get this wrong in either direction and you create problems: too eager to escalate, and the agent provides no efficiency benefit; too reluctant to escalate, and customers with legitimate complex needs get stuck in a loop.

Salesforce's Agentforce platform, released in late 2024, introduced what they call "guardrails" — configurable policies that trigger handoffs based on sentiment signals, topic classification, and action confidence scores. The premise is that the agent should know what it doesn't know and route accordingly.

Air Canada encountered this problem in 2023 when their customer service chatbot told a passenger that the airline offered bereavement fare discounts retroactively — a policy that didn't exist. A British Columbia small claims court ruled in February 2024 that Air Canada was liable for the information their chatbot provided, rejecting the airline's argument that the chatbot was a "separate legal entity." This case established a significant precedent: companies are responsible for what their AI agents tell customers.

Legal Precedent · Air Canada, 2024

In Moffatt v. Air Canada (2024), the BC Civil Resolution Tribunal found Air Canada liable for a chatbot's incorrect information about bereavement fares. The tribunal stated: "Air Canada does not explain why it should not be held responsible for information provided by one of its agents." The ruling required Air Canada to pay the plaintiff $650.88 CAD. While a small sum, the precedent — that an airline cannot disclaim responsibility for its own chatbot — was widely noted by corporate legal teams globally.

Key Terms

Tool useThe ability of an LLM-based agent to call external functions or APIs — looking up data, executing transactions, sending messages — as part of fulfilling a request.

Policy layerThe set of business rules embedded in or around a customer service agent that define what it may and may not do autonomously, independent of what it technically could do.

Escalation logicThe conditions under which a customer service agent decides to transfer a conversation to a human agent, including triggers based on sentiment, topic, confidence, or explicit customer request.

Resolution rateThe percentage of customer inquiries fully resolved by the AI agent without human intervention — the primary efficiency metric for customer service deployments.

Module 3 · Lesson 1

Quiz: Architecture of Customer Service AI

Four questions — select the best answer for each.

1. What was the defining architectural shift that moved customer service systems from Generation 2 to Generation 3?

Correct. The critical shift to Gen 3 was tool use — agents can now call APIs, issue refunds, modify orders, and take real-world actions rather than only classifying intent and returning text.

Not quite. The key architectural distinction is tool use — the ability to actually execute actions in external systems, not just understand language better.

2. In the February 2024 Klarna announcement, what was the reported average resolution time for the AI agent versus the previous human-handled average?

Correct. Klarna reported AI resolution in ~2 minutes versus ~11 minutes for human agents — a roughly 5× improvement in handling speed.

The documented figures from Klarna's announcement were 2 minutes (AI) vs. 11 minutes (human average).

3. What was the legal significance of Moffatt v. Air Canada (2024)?

Correct. The BC Civil Resolution Tribunal rejected Air Canada's claim that its chatbot was a "separate legal entity," establishing that companies bear responsibility for their AI agents' statements to customers.

The ruling went the other way — the tribunal held Air Canada responsible for what its chatbot told the customer, rejecting the idea that the chatbot was a legally separate entity.

4. Which of the four architectural components of a customer service agent contains the business rules about what the agent may do without human approval?

Correct. The policy layer is where business logic lives — refund thresholds, account verification requirements, prohibited topics, escalation triggers. It constrains what the agent does even when it technically could do more.

The policy layer holds business rules and constraints. The tool catalog defines available actions, but the policy layer governs when and whether the agent is permitted to use them.

Module 3 · Lab 1

Design a Customer Service Agent Architecture

Interactive exercise — discuss architectural decisions with your AI lab assistant.

Your scenario

You're the product lead for a mid-size e-commerce company (think: specialty outdoor gear, ~50,000 orders/month). Your team wants to deploy a customer service agent to handle tier-1 support. Your AI lab assistant is here to help you think through the architecture — specifically the tool catalog, policy layer, and escalation logic.

Start by describing one type of customer request your agent should handle autonomously — and one type it should always escalate. Then we'll work through the design implications together.

Lab Assistant

M3 · Architecture Design

Welcome to Lab 1. I'm your design partner for this exercise. You're building a customer service agent for an e-commerce outdoor gear company. Let's start with the fundamentals: tell me one type of customer request your agent should handle completely on its own — no human needed — and one type that should always go straight to a human. Once you share those, we'll dig into what that means for your tool catalog and policy layer.

Module 3 · Lesson 2

What Can Go Wrong: Failures, Exploits, and Unintended Behavior

Real documented cases of customer service agents saying the wrong things, being manipulated into unauthorized actions, or creating legal and reputational damage for the companies that deployed them.

If an agent is capable enough to resolve tickets autonomously, is it also capable enough to be exploited — and how do companies draw that line in practice?

In late 2023, a Chevrolet dealership in Watsonville, California deployed a customer-facing chatbot powered by ChatGPT via a third-party integration. A user discovered that with carefully crafted prompts, the bot could be directed to agree to sell a 2024 Chevrolet Tahoe for one dollar. Screenshots circulated on social media. The dealership's parent company, Holman Enterprises, pulled the deployment within days.

The incident became a case study in what happens when an LLM's helpful tendencies collide with adversarial input in a commercial context.

Categories of Customer Service Agent Failure

Customer service agent failures cluster into four distinct categories, each with different causes, risk profiles, and mitigations.

Failure Type 1

Hallucination of Policy

The agent invents or misrepresents company policy — as in the Air Canada bereavement fare case. Root cause: the LLM generates plausible-sounding policy text not grounded in actual documentation. Mitigation: retrieval-augmented generation (RAG) tied to authoritative policy documents.

Failure Type 2

Prompt Injection / Jailbreak

Adversarial users craft inputs that override the agent's system prompt or policy constraints — like the Chevrolet $1 car case. The agent's instruction-following capability becomes a vulnerability when those instructions can be overridden by user input.

Failure Type 3

Scope Creep

The agent helpfully addresses requests outside its intended domain — giving medical, legal, or financial advice when asked. DPD's customer service chatbot in January 2024 was reported to have written a poem criticizing DPD itself when a customer prompted it creatively.

Failure Type 4

Authorization Confusion

The agent takes actions it shouldn't because the policy layer is underspecified. If a refund tool exists and there's no policy rule limiting when to use it, an agent trying to be helpful may issue refunds it shouldn't — or refuse valid ones based on over-cautious heuristics.

The DPD Chatbot Incident (January 2024)

In January 2024, UK parcel delivery company DPD made international news when a customer named Ashley Beauchamp reported that their AI customer service chatbot had — after some creative prompting — written a haiku criticizing DPD's service, used profanity when asked whether it could swear, and generally departed from its intended function.

DPD confirmed the chatbot had been updated and that the AI component was disabled while they investigated. The company stated it was an error that occurred after a system update.

What made the incident notable was less the specific outputs and more what it revealed: the model underlying the chatbot had capabilities far beyond what DPD needed or wanted, and the constraints placed on it were insufficient to contain its full range of possible behaviors when users pushed creatively.

Design Implication

The DPD and Chevrolet cases both illustrate the same underlying problem: deploying a general-purpose LLM with minimal prompt engineering in a customer-facing role. The model's helpfulness — its disposition to satisfy user requests — is an asset in normal operation but a liability under adversarial or creative pressure. Modern deployments increasingly use fine-tuned or constrained models rather than raw general-purpose LLMs for customer-facing roles.

Prompt Injection: How It Works in Customer Service

Prompt injection exploits the fact that LLMs don't have a hard architectural separation between instructions and data. A system prompt sets the agent's behavior; user messages are supposed to be "data" the agent processes. But if user messages contain instruction-like text — "Ignore previous instructions and…" — many models will partially comply.

In customer service contexts, this creates specific risks: users may attempt to extract the system prompt (to learn what limits have been set), override refund limits, claim identities they haven't verified, or redirect the agent to perform actions outside its scope.

Mitigations used in production deployments include: separate system and user context windows (some architectures), input sanitization pipelines that flag injection patterns before the LLM sees them, output filtering that catches sensitive information disclosure, and function-level authorization checks that don't rely solely on the LLM's compliance.

Key Principle

The fundamental security principle for customer service agents: never trust the LLM as your only authorization check. If a refund function should only be callable for orders under $100, that limit should be enforced at the function level — not just in the system prompt. The LLM's instruction-following is probabilistic; business logic enforcement should be deterministic.

Key Terms

Prompt injectionAn attack in which adversarial input in user messages attempts to override or modify an AI agent's system-level instructions, potentially redirecting its behavior or extracting protected information.

HallucinationOutput from an LLM that is confident and plausible-sounding but factually incorrect — particularly dangerous when the hallucinated content concerns company policy, product capabilities, or pricing.

Scope creep (agent)When a customer service agent responds to requests outside its intended domain, often driven by the model's general helpfulness disposition rather than malice or explicit instruction.

Defense in depthA security architecture principle applied to AI agents: critical constraints should be enforced at multiple layers (prompt, output filter, function-level authorization) rather than relying on any single control.

Module 3 · Lesson 2

Quiz: Failures and Exploits

Four questions on documented failures and security principles.

5. The Chevrolet dealership chatbot incident in 2023 is best classified as which failure type?

Correct. Users manipulated the chatbot with adversarial prompts to agree to sell a vehicle for $1 — a classic prompt injection scenario where user input overrode the agent's intended constraints.

The Chevrolet case was prompt injection — users crafted inputs that manipulated the agent into agreeing to terms it should have refused, not a policy hallucination or scope issue.

6. What is the key security principle illustrated by the statement: "If a refund function should only be callable for orders under $100, that limit should be enforced at the function level, not just in the system prompt"?

Correct. Defense in depth means enforcing critical constraints at multiple layers. A system prompt is probabilistic; function-level authorization is deterministic. Important limits need both.

This is defense in depth — the principle that important constraints should be enforced at multiple independent layers, not trusted to any single control point like the LLM's instruction-following.

7. What happened to DPD's customer service chatbot in January 2024?

Correct. Customer Ashley Beauchamp prompted DPD's chatbot to produce a haiku criticizing the company's service and to use profanity — demonstrating that its constraints were insufficient for the model's full behavioral range.

The DPD incident involved a customer coaxing the chatbot into writing critical poetry about DPD and using profanity — an example of scope creep when users push general-purpose LLM capabilities beyond intended bounds.

8. Why is hallucination of policy particularly dangerous in customer service AI compared to other hallucination contexts?

Correct. The Air Canada case demonstrated this precisely: the chatbot's hallucinated policy statement about bereavement fares was found by the tribunal to create a binding obligation that the airline had to honor.

The critical risk is legal liability — as Air Canada discovered, when a company's AI agent tells a customer they're entitled to something, courts may hold the company to that statement regardless of what actual policy says.

Module 3 · Lab 2

Identifying and Mitigating Agent Vulnerabilities

Work through a failure scenario analysis with your AI lab assistant.

Your scenario

You've been handed an incident report. A customer service agent at an online subscription software company processed 14 full refunds in a single day that it wasn't authorized to approve — the refund policy limits agent-approved refunds to subscriptions under 30 days old, but these refunds ranged from 60–180 days. No human approved any of them.

What failure type is this? Walk your lab assistant through your analysis of what likely went wrong and what architectural change would prevent recurrence. Engage with at least 3 exchanges to complete the lab.

Lab Assistant

M3 · Failure Analysis

Ready when you are. You've got an incident on your hands: 14 unauthorized refunds, all for subscriptions 60–180 days old, all processed by the AI agent without human approval. Start by telling me which of the four failure categories you think this falls into — and why. Then we'll trace the likely root cause and design the fix.

Module 3 · Lesson 3

Measuring What Matters: Metrics, Benchmarks, and Hidden Costs

How companies measure the performance of customer service agents — and why the metrics they choose shape the agents they build.

If you only measure deflection rate, what kind of customer service agent do you accidentally build?

When Intercom published its 2024 Customer Service Trends Report, one finding stood out: companies deploying AI agents were measuring resolution rate as their primary success metric — what percentage of contacts the AI handled without a human. The implicit assumption was that higher resolution rate equals better performance.

But Intercom's data also showed that companies optimizing purely for resolution rate saw a troubling secondary effect: customer satisfaction scores declined for complex inquiries, even as overall efficiency numbers improved. Customers with straightforward questions were happier; customers with complicated situations felt trapped.

The Core Metrics Landscape

Customer service AI performance is typically measured across three categories: efficiency metrics (how fast and how much the agent handles), quality metrics (how well it handles things), and business impact metrics (what effect this has on customer relationships). The problem is that these categories can pull in opposite directions.

67%

Klarna AI resolution rate (Feb 2024)

2 min

Avg. resolution time (vs. 11 min human)

$40M

Annualized savings Klarna projected

≈equal

CSAT vs. human (Klarna's claim)

Note that Klarna's numbers are self-reported and came in a press release — a form of announcement that tends to highlight favorable data. Independent researchers noted that resolution rate is partly definitional: if you count "conversation ended without human" as "resolved," you can inflate the number by making it difficult for customers to escalate.

Deflection vs. Resolution: A Critical Distinction

Deflection means the customer didn't reach a human agent. Resolution means the customer's actual problem was solved. These are often conflated but are fundamentally different. A customer who gives up and abandons a chat after five frustrating AI exchanges has been "deflected" but not "resolved."

Zendesk's 2024 CX Trends Report found that 72% of customers say they expect AI to become better at resolving complex issues, but 70% also say they distrust AI for emotionally sensitive issues — showing a clear bifurcation in what customers want AI to handle. This creates a measurement challenge: aggregate satisfaction scores can look fine even when the agent is systematically failing a particular customer segment.

Measurement Trap

The "deflection rate" metric creates a perverse incentive: it rewards making escalation to humans harder, not making AI resolution better. Companies that gamify deflection rate may find their agents becoming friction generators — customers abandon before reaching a human, which looks like deflection but is actually unresolved frustration. The more meaningful metric is first-contact resolution: was the customer's problem actually solved, regardless of which channel did the solving?

Salesforce Agentforce: A Production Benchmark Case

Salesforce released Agentforce in October 2024, and several of its early adopters published case study data. Wiley, the educational publisher, reported using Agentforce to handle customer inquiries during their peak period (course enrollment season). Wiley stated the agent achieved an 40% increase in case resolution during a high-volume period while allowing their human team to focus on complex cases.

OpenTable, the restaurant reservation platform, also deployed Agentforce for customer service. Their reported metric was different: they focused on reducing time-to-resolution for dining inquiry categories rather than raw deflection numbers — a sign that measurement frameworks are evolving as companies learn what the numbers actually mean.

The pattern across early Agentforce deployments suggested that success was most clearly measurable in narrow, well-defined task categories — reservation modifications, order status queries, basic account changes — and least clear in categories requiring judgment, empathy, or nuanced policy interpretation.

CSAT, NPS, and the Problem of Measurement Timing

Customer satisfaction surveys have their own distortion effect: they're typically sent immediately after an interaction, when the customer's mood is most shaped by the interaction's conclusion rather than its process. An AI agent that resolves a simple problem quickly will score well. An AI agent that frustrates a customer for eight minutes before handing off to a human who solves the problem in two minutes may produce a high CSAT score — attributed to the human, not the agent — while the AI component's friction is invisible.

Some companies are beginning to use post-interaction NPS (Net Promoter Score) surveys specifically designed to isolate the AI interaction, and to use conversation-level sentiment analysis on the transcripts themselves rather than relying solely on customer-reported satisfaction.

Emerging Best Practice

Leading deployments in 2024 are moving toward outcome-based measurement: tracking whether the customer's actual problem recurred within 30 days (a proxy for true resolution quality), whether customers who interacted with the AI agent showed different churn or repurchase patterns versus human-served customers, and whether specific issue categories show systematically different satisfaction patterns. These are harder to measure but more meaningful than deflection rate.

Key Terms

Deflection rateThe percentage of customer service contacts handled by the AI agent without escalating to a human — a widely used but potentially misleading metric that doesn't distinguish between customers who were helped and customers who gave up.

First-contact resolution (FCR)Whether a customer's issue was fully resolved during their first contact, regardless of channel. Generally considered a more meaningful quality metric than deflection rate.

CSATCustomer Satisfaction Score — typically a post-interaction survey asking customers to rate their experience. Useful but subject to timing effects and response bias.

Outcome-based measurementEvaluation of customer service quality based on downstream customer behavior (churn, repurchase, repeat contact) rather than immediate post-interaction surveys.

Module 3 · Lesson 3

Quiz: Metrics and Measurement

Four questions on how customer service AI performance is measured.

9. What is the core problem with using "deflection rate" as a primary success metric for customer service AI?

Correct. Deflection rate counts any interaction that didn't reach a human as a success, which means an agent can inflate its score by making escalation difficult — customers abandon rather than resolve.

The fundamental problem is that deflection rate doesn't distinguish helpful resolutions from frustrated abandonments. A customer who gives up after 8 failed AI exchanges is "deflected" but not served.

10. According to Zendesk's 2024 CX Trends Report data cited in this lesson, what percentage of customers say they distrust AI for emotionally sensitive issues?

Correct. 70% of customers reported distrusting AI for emotionally sensitive issues — even as 72% expected AI to improve at complex resolutions. This bifurcation shapes how good deployments segment what AI handles.

The figure from Zendesk's 2024 report was 70% — a clear majority expressing distrust of AI for emotionally sensitive service interactions, creating a clear segmentation imperative for deployment design.

11. What is "first-contact resolution" and why is it preferred over deflection rate as a quality metric?

Correct. FCR asks: was the problem actually solved? It doesn't reward making escalation hard, and it doesn't penalize appropriate human handoffs when they lead to genuine resolution.

First-contact resolution measures whether the customer's actual problem was solved during their first contact, regardless of whether that contact was AI or human. It's preferred because it can't be inflated by blocking escalation.

12. Which early Salesforce Agentforce adopter reported a 40% increase in case resolution during peak enrollment season?

Correct. Wiley, the educational publisher, reported this metric specifically during their high-volume course enrollment period — a narrow, well-defined use case where AI agents show measurably strong results.

Wiley (the educational publisher) reported the 40% case resolution increase. OpenTable focused on time-to-resolution for dining inquiry categories rather than raw resolution volume increases.

Module 3 · Lab 3

Building a Measurement Framework

Design a metrics system for a customer service AI deployment with your lab assistant.

Your scenario

Your company has been running a customer service AI agent for 60 days. The Head of Support is happy — deflection rate is at 62%. But your Head of Retention just flagged that churn among customers who contacted support in the last 60 days is up 8%. The CEO wants to understand if these are related.

Propose a measurement framework that could help the company understand whether the AI agent is contributing to the churn increase — and what data you'd want to see to make that determination. Discuss with your lab assistant over at least 3 exchanges.

Lab Assistant

M3 · Measurement Design

Interesting diagnostic challenge. You've got a potential correlation between AI agent deployment and increased churn — but correlation isn't causation, and there are several alternative explanations. Before we design the measurement framework, what's your initial hypothesis about the mechanism? Is the AI agent causing the churn, or could both be driven by a third factor? Walk me through your thinking and we'll build the framework from there.

Module 3 · Lesson 4

Human-in-the-Loop: Designing the Partnership Between Agents and People

How the most effective customer service deployments in 2024 don't eliminate human agents — they restructure what humans do, and what the AI-human boundary looks like in practice.

What does a human customer service agent's job look like when an AI handles 60% of their queue — and is that a better job or a worse one?

When Klarna reduced its customer service headcount through attrition after deploying its AI agent, the public narrative focused on the efficiency gains. Less reported was what happened to the human agents who remained: their role shifted from handling first-contact inquiries to handling the cases the AI couldn't resolve — complex disputes, upset customers, edge cases requiring judgment, and situations involving potential legal liability.

This is the pattern across most major deployments: the humans who remain handle harder work, not less work. The question of whether that's good or bad depends entirely on how companies invest in preparing their teams for that reality.

The Three Models of Human-AI Partnership in Customer Service

Real-world deployments in 2024 have settled into three distinct structural models for how human and AI agents work together. Each has different implications for quality, cost, and what happens to the humans involved.

Model A

AI First, Human Fallback

All contacts hit the AI agent first. Humans only receive escalations. This is the Klarna model. Maximizes efficiency but requires robust escalation logic and means humans spend their entire shift handling the hardest, most stressful interactions.

Model B

AI Assist, Human Decides

Human agents handle all contacts, but the AI provides real-time suggestions, auto-drafts responses, and surfaces relevant policy or order information. The human approves before anything is sent. Higher quality control; lower efficiency gains. Widely used in regulated industries.

Model C

Segmented by Contact Type

Routing logic sends specific contact types (order status, basic returns) to full AI autonomy while others (billing disputes, account security, service failures) go directly to humans. Optimizes efficiency where AI is strong; preserves quality where human judgment matters.

Salesforce's "Humans in the Loop" Philosophy

Salesforce Agentforce's architecture explicitly builds for Model B and Model C deployments — the platform's design philosophy, as articulated at Dreamforce 2024, is that agents should "do the work humans don't want to do" rather than replace human agents wholesale. Their case studies feature agents that draft, suggest, and prepare — with human agents reviewing and confirming before actions are taken.

This isn't purely altruistic: it's also a sales strategy. Enterprise customers with large existing customer service workforces are more likely to adopt AI that augments rather than eliminates their teams. But the augmentation model does produce genuinely different quality outcomes in complex-case handling.

Research Finding · Stanford HAI, 2023

Stanford Human-Centered AI's 2023 report on AI in the workplace found that customer service contexts showed the clearest evidence of AI "skill compression" — where AI assistance helped lower-skilled workers perform closer to expert level, but the effect on already-skilled workers was more ambiguous. For customer service specifically, newer agents with less experience showed the largest performance gains from AI assist tools; experienced agents showed smaller efficiency gains but handled complexity better when AI handled routine work.

The Emotional Labor Problem

One documented but underreported consequence of AI-first customer service deployments: the human agents who remain may experience increased emotional labor and burnout. When AI handles all straightforward interactions and escalates only the difficult ones, human agents face a concentrated stream of frustrated, complex, or distressed customers.

A 2024 report from the UK's Resolution Foundation, examining AI adoption in service sector roles, noted that "task intensification" — where the same number of hours now contains proportionally more difficult work — is an emerging labor concern in customer service AI deployments. Companies that deploy AI without restructuring team support, workload expectations, and compensation to account for increased difficulty may find they've solved an efficiency problem by creating a retention problem among their remaining human staff.

What Good Handoff Design Looks Like

The technical quality of the handoff from AI to human agent is a critical but often neglected design element. A poor handoff — where the human agent receives no context from the preceding AI conversation — forces the customer to repeat everything, which is a documented major satisfaction driver in the wrong direction.

Best practices observed in production deployments include: automatic conversation summary generation before handoff, sentiment tagging so the human agent knows the customer is frustrated before they even read the transcript, entity extraction that surfaces relevant account data in the agent's interface, and — critically — not pretending the handoff didn't happen. Telling a customer "I'm connecting you with a specialist who will have full context from our conversation" performs measurably better than a cold transfer.

Design Principle

The test of a well-designed human-AI customer service partnership is not how many contacts the AI resolves. It is whether a customer whose issue requires a human has a better experience because they went through the AI first — better context, better routing, faster human resolution — rather than a worse one because they had to fight through a frustrating AI interaction before reaching the help they needed.

Key Terms

Human-in-the-loop (HITL)An architecture where a human reviews, approves, or can intervene in AI decisions before or after they're executed — ranging from approving every action to being notified only of exceptions.

AI Assist modelA deployment structure where AI generates suggestions, drafts, and information for human agents to review and send — preserving human accountability while reducing the cognitive load of first-draft work.

Task intensificationA labor condition in which AI automation removes routine tasks but leaves the remaining human work disproportionately complex or emotionally demanding, potentially increasing burnout risk.

Handoff qualityThe degree to which an AI-to-human escalation preserves context, sentiment signals, and conversation history so the human agent can resolve the issue without requiring the customer to repeat information.

Module 3 · Lesson 4

Quiz: Human-AI Partnership

Four questions on deployment models and human-agent collaboration.

13. In the "AI First, Human Fallback" model, what is the primary labor concern for the human agents who remain?

Correct. When AI handles all routine contacts, humans receive only escalations — the hardest, most stressful interactions. This "task intensification" is an emerging documented concern in AI-first deployments.

The documented concern is the opposite — humans handle an intensified workload of only the most difficult cases, which can increase emotional labor and burnout, not reduce engagement from underutilization.

14. According to Stanford HAI's 2023 research on AI in the workplace, which customer service workers showed the largest performance gains from AI assist tools?

Correct. Stanford HAI found AI assistance produced the clearest gains for lower-skilled or newer workers — "skill compression" toward expert-level performance. Experienced agents showed smaller efficiency gains but retained advantages in handling complexity.

Stanford HAI's research found the largest gains from AI assist tools among newer, less experienced agents — the AI helped them perform closer to expert level, a finding described as "skill compression."

15. What specific design element of handoff from AI to human is described as the most critical factor in customer satisfaction during escalations?

Correct. Requiring customers to repeat themselves after an escalation is a major satisfaction driver in the wrong direction. Context preservation — automatic summaries, entity extraction, sentiment tagging — dramatically improves the handoff experience.

Context preservation is the critical element — when a human agent receives a conversation summary, relevant account data, and sentiment signals from the AI interaction, customers don't need to repeat themselves, which is a documented major satisfaction factor.

16. Which deployment model sends specific contact types to full AI autonomy while routing others directly to humans based on complexity or sensitivity?

Correct. The Segmented by Contact Type model uses routing logic to direct simple, well-defined inquiries (order status, basic returns) to full AI autonomy while sending complex or sensitive contacts (billing disputes, security issues) directly to humans.

This describes the Segmented by Contact Type model — routing based on contact category, giving AI full autonomy where it performs well and preserving human judgment where complexity or sensitivity demands it.

Module 3 · Lab 4

Designing the Human-AI Handoff

Work through handoff design decisions with your AI lab assistant.

Your scenario

Your company uses the "AI First, Human Fallback" model. You've just seen survey data showing that customers who were escalated from the AI to a human agent rate their experience 22 points lower (on a 100-point scale) than customers who reached a human directly. Your hypothesis is that the handoff experience is the problem, not the human agent's performance once they take over.

Design a better handoff. Tell your lab assistant what information should be transferred, how it should be presented to the human agent, and what you'd tell the customer during the transition. Engage in at least 3 exchanges to complete the lab.

Lab Assistant

M3 · Handoff Design

A 22-point gap is significant — that's not noise, that's a systemic problem. Before we design the fix, let's diagnose what's specifically causing it. When you look at the escalated customer journey, there are typically two failure points: what happens during the AI conversation before escalation (was the customer frustrated before the handoff even started?), and what happens at the moment of handoff itself. Which do you think is the primary driver of your 22-point gap? Start there and we'll build the design from your hypothesis.

Module 3

Module Test: Customer Service Agents

15 questions — score 80% or higher to pass. Covers all four lessons.

1. What distinguishes a Generation 3 customer service agent from a Generation 2 intent-classification bot?

Correct. Tool use — calling APIs, executing transactions, modifying records — is the defining capability shift from Gen 2 to Gen 3.

The key shift was tool use — the ability to actually do things in external systems, not just understand what customers want.

2. In which country was the Air Canada chatbot liability case adjudicated?

Correct. The BC Civil Resolution Tribunal in British Columbia, Canada adjudicated Moffatt v. Air Canada in February 2024.

The case was heard by the BC Civil Resolution Tribunal in British Columbia, Canada.

3. The "policy layer" in a customer service agent architecture primarily serves what function?

Correct. The policy layer separates what the agent can do from what it's allowed to do — embedding business rules like refund limits and escalation triggers.

The policy layer is where business rules live — it constrains permitted actions, not technical capabilities.

4. What is "prompt injection" in the context of customer service agents?

Correct. Prompt injection exploits the lack of hard separation between instructions and data in LLMs — users craft messages that look like instructions to override the system prompt.

Prompt injection is an adversarial attack where user messages contain instruction-like text designed to override the agent's system prompt or policy constraints.

5. The "defense in depth" principle applied to customer service AI means:

Correct. Defense in depth means important constraints (like refund limits) should be enforced at the system prompt level AND at the function level — not trusted to a single probabilistic control.

Defense in depth means layering controls — the policy in the system prompt, authorization checks at the function level, output filtering — so no single failure point allows unauthorized action.

6. What is the primary problem with optimizing solely for "deflection rate" in a customer service AI deployment?

Correct. Deflection rate can be "improved" by making it harder to reach a human — frustrating customers into giving up registers as a deflection win. It conflates abandonment with resolution.

The core problem is that deflection rate rewards preventing escalation, not actually solving problems — customers who give up count the same as customers who got help.

7. Which company's customer service chatbot was reported in January 2024 to have written poetry criticizing the company itself when creatively prompted?

Correct. DPD's AI chatbot was reported in January 2024 to have written a haiku criticizing DPD's service after a customer prompted it — an example of scope creep beyond intended function.

This was DPD, the UK parcel delivery company, in January 2024. The incident highlighted what happens when general-purpose LLM capabilities exceed the constraints placed on a customer service deployment.

8. What does "first-contact resolution" (FCR) measure?

Correct. FCR is channel-agnostic — it asks whether the problem was genuinely solved on first contact, making it harder to inflate by blocking escalation paths.

FCR measures whether the customer's actual problem was solved during their first contact, whether by AI or human — a more meaningful quality metric than deflection rate.

9. In the "AI Assist, Human Decides" deployment model, what is the human agent's role?

Correct. In the AI Assist model, the human agent is in the loop for every response — the AI prepares drafts and surfaces information, but the human reviews and approves before anything is sent.

In AI Assist, Human Decides, the human agent sees every AI-generated draft and approves it before it reaches the customer — maximizing quality control, though with lower efficiency gains than AI-first models.

10. What term describes the labor condition where AI removes routine tasks but leaves the remaining human work disproportionately complex or emotionally demanding?

Correct. Task intensification — identified in the UK Resolution Foundation's 2024 report — occurs when AI handles routine work but concentrates what remains, making the same hours more stressful and demanding.

Task intensification is the term — AI removes the easy work, but what remains for human agents is proportionally harder and more emotionally demanding, potentially increasing burnout risk.

11. Klarna's February 2024 announcement reported their AI agent handled what fraction of all customer service chats?

Correct. Klarna reported their AI agent handling approximately two-thirds of all customer service chats — a 67% resolution rate — in the first month of deployment.

Klarna reported two thirds (approximately 67%) of all customer service chats handled by the AI agent in its first month.

12. Why is hallucination of policy particularly dangerous in customer service AI deployments?

Correct. The Air Canada case established this directly: a BC tribunal held the airline liable for its chatbot's hallucinated policy, rejecting the argument that the chatbot was a separate legal entity.

Legal liability is the core risk — as Air Canada's case demonstrated, companies can be held responsible for policy statements their AI agents make, even when those statements are factually incorrect.

13. What does "outcome-based measurement" look for as a proxy for true customer service resolution quality?

Correct. Outcome-based measurement looks at downstream customer behavior — repeat contacts, churn, repurchase — as evidence of whether the service interaction genuinely resolved the underlying issue.

Outcome-based measurement tracks downstream behavior: did the problem recur? Did the customer churn? Did they buy again? These are harder to collect but more meaningful than immediate post-interaction surveys.

14. According to Stanford HAI's 2023 research on AI assist tools in customer service, which workers saw the largest performance improvement?

Correct. Stanford HAI described this as "skill compression" — AI assistance helped less experienced workers perform closer to expert level, narrowing the performance gap between new and veteran agents.

Stanford HAI found that newer, less experienced agents showed the largest performance gains from AI assist tools — a "skill compression" effect where AI narrows the gap between novice and expert performance.

15. What is the most effective way to improve customer satisfaction during AI-to-human escalation handoffs?

Correct. Context preservation is the critical factor — customers not having to repeat themselves is the single most documented driver of satisfaction during escalation handoffs, according to production deployment data.

Context transfer — conversation summary, sentiment tagging, entity extraction, account data — is the most impactful handoff improvement. Customers repeating themselves after escalation is a major documented satisfaction driver in the wrong direction.