Module 3 · Lesson 1

The Centaur Model: Dividing Cognitive Labor

How humans and AI systems split tasks to outperform either working alone

What does it actually look like when a human and an AI each do what they do best?

When organizers of the 2005 PAL/CSS Freestyle Chess Tournament allowed human-computer teams to compete together, the results upended everyone's assumptions. The strongest players were not grandmasters. They were not supercomputers. They were pairs of average-skilled humans working in tight coordination with consumer chess software. The humans decided when to trust the engine, when to override it, and how to structure the search. The combination beat both unaided grandmasters and autonomous computers. Garry Kasparov called these teams "centaurs" — half human, half machine.

The lesson echoed far beyond chess: structured division of cognitive labor consistently produces outcomes neither partner achieves alone.

What the Centaur Model Means in Practice

The centaur metaphor captures a specific workflow architecture: the human sets the goal, evaluates options at high-stakes decision points, and provides contextual judgment; the AI generates options, checks consistency, processes large data sets, and flags anomalies at speed the human cannot match. Neither partner tries to do the other's job.

This is different from "human oversight of AI." Oversight implies the AI does the work and the human checks it afterward. In the centaur model, the division happens before and during the task, not only at the end.

Real Case — Radiologists + AI, 2019 NHS Pilot

A 2019 NHS trial by DeepMind (now Google Health) at Moorfields Eye Hospital showed AI-assisted diagnosis of eye diseases matched or exceeded the accuracy of senior ophthalmologists on 50+ conditions. Critically, the system was deployed not to replace clinicians but to triage the scan queue — flagging urgent cases so specialists focused attention where it mattered most. The human made the final recommendation; the AI restructured which cases the human saw first. That single task division reduced average referral time from 41 days to under 7.

The Four Zones of Cognitive Labor

Researchers at MIT's Work of the Future task force (2021) proposed a practical four-zone map for assigning tasks in human-AI teams:

Zone 1 — AI-Led

High-volume, rule-consistent tasks with clear success criteria. Data classification, spell-check, scheduling conflicts, fraud pattern detection. Human reviews by exception.

Zone 2 — AI-Assisted Human

Complex tasks where AI surfaces options and flags risks, but human judgment drives decisions. Medical diagnosis, legal research, architectural design review.

Zone 3 — Human-Led, AI-Informed

High-stakes or novel situations where AI provides data but the human owns the reasoning. Negotiation, crisis management, ethical policy decisions.

Zone 4 — Human-Only

Relationship-critical, legally accountable, or morally weighted actions. Terminating an employee, consent conversations with patients, courtroom advocacy.

Why Division Fails Without Explicit Design

Studies of air-traffic control automation failures (FAA Human Factors Research, 2018) found the most common error pattern was mode confusion — controllers assumed the automation was handling a task that the system believed the controller was handling. No explicit handoff had been agreed. Task overlap and task gaps both create risk.

The centaur model requires an explicit interface: who owns each decision, how the human signals override, and what happens when the AI's confidence is low. Without these agreements, research by Harvard Business School's Tsedal Neeley (2022) found AI collaboration productivity gains dropped by more than 40% within three months as teams defaulted back to sequential review rather than genuine parallel labor.

Key Insight

The chess centaur's edge came not from using the AI more — it came from knowing precisely when not to follow it. That metacognitive skill — recognizing the boundary of the AI's reliability — is the core human contribution in any centaur arrangement.

Key Terms

Centaur ModelA human-AI team structure in which each partner handles distinct cognitive tasks aligned to their comparative advantage, developed from Kasparov's analysis of Freestyle Chess results.

Mode ConfusionAn automation error where human and system both assume the other is managing a task, resulting in neither managing it.

Cognitive Labor DivisionExplicit pre-assignment of task types to human vs. AI agents before work begins, rather than ad-hoc delegation.

Lesson 1 Quiz

The Centaur Model · 4 questions

1. In the 2005 Freestyle Chess Tournament, which team type produced the strongest results?

Correct. Kasparov noted that the human-computer teams ("centaurs") outperformed both unaided grandmasters and autonomous machines — the key was structured task division, not raw power.

Not quite. The tournament's surprise result was that average humans paired with consumer software beat both grandmasters and supercomputers, because the humans knew when to trust and when to override the engine.

2. In the 2019 NHS DeepMind eye-disease trial at Moorfields, what was the primary role of the AI system?

Correct. The AI restructured which cases the human specialist saw first — a task-division move that cut average referral time from 41 days to under 7 without removing the human from final decisions.

Review the NHS case. The AI's job was queue triage, not diagnosis replacement. Humans retained all final clinical decisions.

3. According to the MIT Work of the Future framework, which zone would best describe a negotiation where an AI provides market-rate data but a human leads all strategic choices?

Correct. Zone 3 applies when humans own the reasoning and stakes are high, but AI still informs the process with data. Negotiation fits this description.

Review the four zones. Zone 3 is "Human-Led, AI-Informed" — the AI contributes data but the human drives all strategic judgment, which describes a negotiation scenario accurately.

4. What is "mode confusion" in the context of human-AI collaboration?

Correct. Mode confusion — documented in FAA air-traffic control research — occurs when task ownership is ambiguous and both parties assume the other is responsible, creating dangerous gaps.

Mode confusion specifically refers to the dangerous gap that appears when task ownership hasn't been explicitly agreed upon, leaving both human and system assuming the other is in control.

Lab 1 — Mapping the Centaur

Practice applying the four-zone cognitive labor model to real scenarios

Your Task

You'll work with the AI assistant to analyze real or realistic work scenarios and assign them to the correct cognitive labor zone (1–4). Practice explaining why a task belongs in a zone, and what the human-AI handoff should look like.

Starter prompt: "A hospital billing department wants to use AI to review insurance claims. Some claims are straightforward; others involve appeals requiring patient context. How should they divide this work across the four zones?"

AI Lab Assistant

Centaur Model · Zone Analysis

Welcome to Lab 1. We're practicing the four-zone cognitive labor framework from the MIT Work of the Future research. Give me any work scenario — real or hypothetical — and we'll figure out together how to divide it between human and AI. You can use the starter prompt above or bring your own example. What scenario would you like to analyze?

Module 3 · Lesson 2

Prompt Engineering as a Professional Skill

How the way you frame a request to an AI system changes the quality of everything that follows

Is writing good prompts a technical skill, a communication skill, or something else entirely?

In early 2023, Wharton professor Ethan Mollick ran a controlled experiment with MBA students. Two groups were given the same business strategy task using GPT-4. One group used the model with no instruction. The other used a structured prompting framework Mollick had developed: specify role, context, constraints, and output format explicitly. The structured group's outputs were rated significantly higher by blind evaluators — not because they had better AI access, but because they had better prompting discipline. Mollick published the results and the framework, and it spread rapidly through corporate training programs at Deloitte, BCG, and Microsoft.

Why Prompting Matters More Than Most People Think

A large language model doesn't "know" what you want — it predicts what should come next given your input. Vague input produces vague output. Specific, structured input produces specific, useful output. This isn't a limitation to work around; it's a fundamental property that skilled users exploit.

Anthropic's internal analysis of Claude usage patterns (released in summary form in their 2023 model card) found that the quality difference between the top quartile and bottom quartile of user outputs was explained more by prompt structure than by any difference in the underlying model version being used.

The RCTF Framework

Derived from Mollick's Wharton work and subsequent refinements by Microsoft's AI productivity team (published in their 2023 Copilot usage guidelines), the RCTF structure gives you four components for any significant prompt:

Role. Tell the AI what expert or perspective to adopt. "Act as a senior employment lawyer reviewing this contract" outperforms "review this contract" every time.

Context. Provide the situation, constraints, and audience. What does the AI need to know about your specific case that it cannot infer from the task alone?

Task. State the action verb precisely. "Summarize" vs. "identify three risks" vs. "write a rebuttal" — each produces entirely different outputs.

Format. Specify length, structure, and audience. "In bullet points, under 200 words, for a non-technical board member" shapes the response dramatically.

Real Case — GitHub Copilot Productivity Study, 2022

Microsoft Research published a randomized controlled trial in September 2022 showing GitHub Copilot users completed coding tasks 55% faster than the control group. But a follow-up analysis by the GitHub team found that the fastest users weren't using autocomplete passively — they were writing detailed comment-prompts before each function, essentially using comments as structured task specifications that Copilot then fulfilled. The "prompt-first" coders outperformed passive users by an additional 23%.

Chain-of-Thought and Iterative Prompting

A single prompt is rarely the end of the interaction. Research by Wei et al. at Google Brain (2022, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models") showed that asking a model to reason step-by-step before answering dramatically improved accuracy on multi-step problems — from 18% to 57% on a set of grade-school math problems. The technique transferred to workplace reasoning tasks.

Iterative prompting — treating the AI as a collaborator you can ask follow-up questions and give corrective feedback to — compounds the benefit. BCG's 2023 AI productivity study found consultants who engaged in multi-turn iterative sessions produced work rated higher by senior partners than those who accepted first-round outputs.

55%Faster task completion — GitHub Copilot RCT 2022

+23%Prompt-first coders vs. passive Copilot users

57%Chain-of-thought accuracy vs. 18% baseline — Google Brain 2022

Practical Principle

The prompt is your half of the collaboration. A poorly specified prompt doesn't reveal the AI's limitations — it reveals yours. Treating prompt construction as a professional discipline, not a casual act of typing, is the single highest-leverage behavior change most knowledge workers can make in their AI adoption.

Key Terms

RCTF FrameworkRole, Context, Task, Format — a four-component prompt structure developed from Wharton and Microsoft research to reliably improve AI output quality.

Chain-of-Thought PromptingInstructing an AI to reason step-by-step before answering, proven to improve multi-step accuracy significantly (Wei et al., Google Brain, 2022).

Iterative PromptingMulti-turn dialogue with an AI in which each exchange refines the task, corrects errors, or deepens the output, rather than treating a single prompt as final.

Lesson 2 Quiz

Prompt Engineering · 4 questions

1. In Ethan Mollick's 2023 Wharton experiment, why did the structured prompting group produce higher-rated outputs?

Correct. Both groups had the same AI access. The difference was entirely in prompt structure — role, context, constraints, and output format were specified explicitly by the structured group.

The experiment controlled for AI access — both groups used the same model. The structured group's advantage came entirely from their prompting discipline, not from tool access or editing time.

2. What does the "F" in the RCTF framework stand for, and why does it matter?

Correct. Format specifies output structure, length, and audience — "bullet points, under 200 words, for a non-technical board member" dramatically shapes what you receive.

F is Format — the explicit specification of how the output should be structured, how long it should be, and who it's for. This component alone can transform the usability of AI responses.

3. According to the 2022 GitHub Copilot Microsoft Research RCT, what did the fastest users do differently from passive users?

Correct. "Prompt-first" coders wrote specification comments before functions, which Copilot then fulfilled — outperforming passive users by an additional 23% beyond the baseline 55% speed gain.

The key differentiator was writing detailed comment-prompts before each function — essentially treating comments as structured task specifications. This "prompt-first" approach added another 23% speed advantage.

4. What did Wei et al.'s 2022 Google Brain research demonstrate about chain-of-thought prompting on multi-step problems?

Correct. Asking the model to reason step-by-step before answering raised accuracy on multi-step problems from 18% to 57% in the Google Brain study — a finding that transferred to workplace reasoning tasks.

Chain-of-thought prompting dramatically improved accuracy — from 18% to 57% — by instructing the model to show its reasoning steps before giving a final answer. The benefit generalized beyond math.

Lab 2 — RCTF Prompt Workshop

Build and refine prompts using the Role-Context-Task-Format framework

Your Task

You'll practice writing RCTF-structured prompts and getting feedback on their quality. Start with a weak prompt, then rebuild it using all four components. The assistant will evaluate your prompts and suggest improvements.

Starter prompt: "Here's a weak prompt: 'Tell me about marketing.' Help me rebuild it using RCTF — I work in B2B SaaS and need to present to our CEO next week."

AI Lab Assistant

RCTF Framework · Prompt Workshop

Welcome to Lab 2. We're going to build your prompt engineering skills using the RCTF framework: Role, Context, Task, Format. Bring me any weak or vague prompt you've used recently, and we'll rebuild it together. Or use the starter prompt above. What prompt do you want to work on?

Module 3 · Lesson 3

Calibrated Trust: When to Follow, When to Override

The science of knowing when AI output is reliable — and when your own judgment should take precedence

How do you avoid both the trap of over-trusting AI and the trap of dismissing it too quickly?

In 2018, Reuters reported that Amazon had quietly scrapped a machine-learning recruiting tool it had built in 2014. The system was trained on a decade of hiring data and rated candidates with one to five stars. Recruiters used its scores heavily. The problem: the model had learned to penalize résumés that included the word "women's" — as in "women's chess club" — and downgraded graduates of two all-women's colleges. It had absorbed the bias of past human hiring decisions and amplified it at scale.

Amazon's recruiters had trusted the system's numeric scores without adequately modeling when that output should be questioned. The lesson wasn't "don't use AI in hiring." It was: calibrate your trust to the specific domain, the training data, and the type of error that matters most.

The Automation Bias Problem

Automation bias — the tendency to over-rely on automated recommendations — was documented extensively in aviation research by Mosier & Skitka (1996) and has been replicated in medical, legal, and financial contexts. A 2020 study in JAMA Network Open found that radiologists who received AI-flagged scans first showed a measurable increase in confirmation bias: they were less likely to find errors the AI missed and more likely to note issues the AI had flagged, even when the AI was deliberately wrong.

The mirror problem is algorithm aversion — documented by Berkeley Dietvorst (2015) — where people distrust algorithmic outputs even when they are more accurate than human judgment, simply because the algorithm made an error they witnessed. Both biases degrade human-AI team performance.

Real Case — ProPublica COMPAS Analysis, 2016

ProPublica's 2016 investigation into the COMPAS recidivism algorithm used by US courts found that it incorrectly flagged Black defendants as future criminals at nearly twice the rate of white defendants. Judges who used COMPAS scores were shown to override them less than 20% of the time, even when presenting contextual information strongly contradicted the score. The case became a foundational reference in AI governance research, illustrating how uncalibrated trust can amplify systemic bias at institutional scale.

A Framework for Calibrating Trust

Research by MIT Sloan Management Review (Raja Chatila, 2021) and subsequent work by the Alan Turing Institute (2022) converges on four trust calibration questions every professional should ask when receiving AI output:

What was the training distribution? Is the case in front of you similar to what the model was trained on? AI confidence does not transfer across distribution shifts.

What type of error matters more? False positives vs. false negatives have asymmetric costs. Medical screening, fraud detection, and hiring each have different error tolerance profiles.

Is the AI confident for the right reason? High confidence scores can reflect pattern-matching to superficial features rather than genuine signal. Ask: what would make this output wrong?

What is my independent evidence? Do you have contextual information the model cannot access? If so, weight it. Humans consistently underweight their own legitimate informational advantage.

Productive Override: How to Disagree With AI Effectively

Overriding an AI system without documentation or reasoning defeats the purpose of the collaboration. A 2023 study by researchers at Stanford HAI found that teams who logged their reasons for overriding AI recommendations improved their override accuracy by 31% over six months — because the logging forced explicit reasoning and created feedback loops that revealed which override instincts were reliable and which were noise.

The practice also serves organizational learning. When a human's override proves correct, the documented reasoning becomes evidence for improving the model's training or scope boundaries. When the override proves wrong, it teaches the human to trust the model more in that domain.

The Calibration Mindset

Calibrated trust is not a fixed setting — it's a dynamic skill that improves with deliberate practice. The goal is not to trust AI more or trust it less. The goal is to trust it accurately: at the right level, in the right domains, with the right error model in mind.

Key Terms

Automation BiasThe tendency to over-rely on automated recommendations, even when human contextual evidence contradicts them (Mosier & Skitka, 1996).

Algorithm AversionDistrust of algorithmic outputs even when they are statistically more accurate than human judgment, typically triggered by witnessing a single algorithm error (Dietvorst, 2015).

Calibrated TrustTrust in AI output that is dynamically adjusted based on domain, training distribution, error type, and available independent evidence.

Lesson 3 Quiz

Calibrated Trust · 4 questions

1. What was the core failure in Amazon's 2014–2018 ML recruiting tool, as reported by Reuters in 2018?

Correct. The model was trained on historical hiring decisions that reflected past gender bias, which it then learned to replicate and amplify — penalizing candidates who mentioned women's organizations or graduated from women's colleges.

The core failure was bias absorption and amplification. The model learned to penalize résumés mentioning "women's" activities because it trained on historical data reflecting past discriminatory hiring patterns.

2. A 2020 study in JAMA Network Open found that radiologists who received AI-flagged scans first were more likely to:

Correct. The study documented automation bias in radiology — radiologists anchored to the AI's flags, making them less likely to find issues the AI missed and more likely to confirm AI errors.

The JAMA study found automation bias: radiologists anchored to the AI's assessments, missing what the AI missed and confirming what the AI incorrectly flagged. This is a classic automation bias pattern.

3. What is "algorithm aversion" as documented by Berkeley Dietvorst in 2015?

Correct. Dietvorst found that people would abandon algorithms that were statistically more accurate than human judgment after seeing just one error — a disproportionate response that degrades team performance.

Algorithm aversion is the documented tendency to reject algorithmic tools after seeing a single mistake, even when the algorithm outperforms human judgment in aggregate. It's the mirror problem to automation bias.

4. According to the 2023 Stanford HAI study, what happened when teams logged their reasons for overriding AI recommendations?

Correct. Logging override reasoning forced explicit thinking and created feedback loops — teams learned which override instincts were reliable, improving their override accuracy by 31% over six months.

The Stanford HAI finding was that documented overrides improved accuracy by 31% — the logging process created feedback loops showing which human override instincts were well-calibrated and which were noise.

Lab 3 — Trust Calibration Scenarios

Practice deciding when to follow, question, or override AI output

Your Task

Work through trust calibration scenarios with the assistant. For each scenario, apply the four calibration questions: training distribution, error type asymmetry, confidence validity, and your independent evidence.

Starter prompt: "An AI loan-approval system flags a small business owner as high-risk based on credit score patterns. The loan officer knows the owner personally and believes the business is fundamentally sound. Walk me through the four calibration questions for this situation."

AI Lab Assistant

Calibrated Trust · Override Analysis

Welcome to Lab 3. We're practicing trust calibration — the skill of deciding when to follow AI recommendations, when to question them, and when to override them. I'll walk you through real scenarios using the four calibration questions from the Alan Turing Institute framework. Use the starter prompt above or bring your own scenario. What situation would you like to work through?

Module 3 · Lesson 4

Building Human-AI Workflows That Actually Stick

How organizations design, implement, and sustain effective human-AI collaboration — and why most early deployments fail

Why do AI tools that work brilliantly in pilots so often disappoint at scale?

IBM's Watson for Oncology was announced with extraordinary ambition: AI-assisted cancer treatment recommendations trained on Memorial Sloan Kettering data. Hospitals in India, South Korea, and the US adopted it at scale. By 2018, internal documents obtained by STAT News revealed that Watson was generating recommendations described by oncologists as "unsafe and incorrect" in a significant share of cases — including recommending treatments contraindicated for patients with specific conditions.

The failure wasn't purely technical. It was workflow design. The system had been trained on hypothetical patient cases, not real clinical records. The feedback loops between clinician judgment and model updates were never built. Clinicians weren't asked what decisions they needed help with — they were given a system designed around what IBM assumed oncology decisions looked like.

The Three Failure Modes of AI Deployment

Research from Carnegie Mellon's Software Engineering Institute (2022) and McKinsey's AI adoption survey (2023, n=1,492 executives) converges on three dominant reasons enterprise AI deployments underperform after initial pilots:

Failure Mode 1 — Task Mismatch

The AI is deployed for a task that is adjacent to but not identical to what workers actually need. Workers route around it rather than use it, reverting to manual processes within weeks. McKinsey found this in 41% of underperforming deployments.

Failure Mode 2 — No Feedback Loop

The model doesn't learn from real usage. Errors accumulate without correction. Workers lose trust when they see the same mistakes repeat. Found in 38% of cases. Watson for Oncology was a textbook example.

Failure Mode 3 — Skill Atrophy

Workers delegate tasks to AI and lose the skill to recognize when the AI is wrong. This creates brittleness: when the AI fails on an edge case, the human can no longer compensate. Documented extensively in aviation automation research (FAA, 2021).

What Successful Deployments Share

Co-design with end users, explicit task boundaries, structured error reporting, maintained human skill via deliberate practice, and iterative model improvement with real operational data.

Real Case — Klarna AI Customer Service, 2024

Swedish fintech Klarna announced in February 2024 that its AI assistant was handling the equivalent of 700 human agents' workload, resolving 2.3 million customer conversations in its first month. Klarna's CEO credited the success to two design choices: the AI handled only well-defined query types where it had high accuracy, and every conversation had a one-click human escalation path that was extensively used in the first weeks. The feedback from escalated conversations was fed back into the model weekly. The workflow was designed with explicit task limits, not unlimited scope.

The PDCA Cycle for Human-AI Workflow Design

Adapted from lean manufacturing, the Plan-Do-Check-Adjust cycle is increasingly used by AI deployment teams (Google's People + AI Research group, PAIR, published a version of this in their 2023 AI deployment playbook):

Plan. Define the specific decision or task type the AI will assist with. Set explicit scope limits. Identify which errors are acceptable and which are not. Co-design the workflow with the people who will use it.

Do. Deploy in a bounded context with a small team. Maintain human alternatives in parallel. Build in deliberate friction at the highest-stakes decision points to prevent uncritical acceptance.

Check. Measure outcomes — not just efficiency metrics, but error rates, override rates, user trust calibration, and downstream result quality. Track skill maintenance in the human team.

Adjust. Update the model, the task scope, or the workflow based on real operational data. Never treat a deployment as finished. Scheduled quarterly reviews are standard practice in mature deployments.

Maintaining Human Skill Deliberately

The skill atrophy problem has a straightforward solution that most organizations skip: deliberate practice outside the AI-assisted workflow. Air traffic control agencies that maintained simulator training where controllers handled situations without automation assistance preserved manual competency that proved critical when systems failed (FAA, 2021). The same principle applies in any domain where AI does the heavy lifting — if the skill matters in failure scenarios, it must be practiced without AI support regularly.

Google PAIR's 2023 playbook recommends a minimum of 10% of task volume handled manually for any workflow where skill maintenance matters, specifically to preserve the human's ability to recognize model errors.

The Design Principle

Build human-AI workflows for the long run, not the demo. The workflows that last are those designed with explicit limits, real feedback loops, maintained human skill, and scheduled review — not those optimized to show maximum AI capability in a proof-of-concept setting.

Key Terms

Task MismatchA deployment failure mode where AI addresses a task adjacent to but not identical to what users actually need, causing them to route around the system.

Skill AtrophyThe gradual loss of human competency in tasks delegated to AI, reducing the human's ability to detect AI errors or compensate when the system fails.

PDCA CyclePlan-Do-Check-Adjust: an iterative workflow design framework adapted for AI deployments by Google PAIR and others, emphasizing continuous improvement over fixed-scope launches.

Lesson 4 Quiz

Building Lasting Workflows · 4 questions

1. What was the primary reason IBM Watson for Oncology failed in clinical use, according to STAT News documents obtained in 2018?

Correct. Watson was trained on hypothetical patient cases, not real records, and had no mechanism to incorporate clinician corrections. The workflow design lacked any feedback loop — a textbook Failure Mode 2.

The core problem was workflow design: training on hypothetical rather than real cases, and no feedback loop for clinician corrections. The model accumulated errors with no mechanism for improvement.

2. Klarna's 2024 AI customer service success was credited to two specific design choices. Which pair best describes them?

Correct. Klarna's success came from explicit scope limitation (well-defined query types only) and a tight feedback loop (escalated conversations fed back weekly). These are canonical good deployment practices.

Klarna's CEO specifically credited two choices: limiting the AI to well-defined query types where accuracy was high, and continuously feeding escalated conversation feedback back into the model. Scope limits and feedback loops.

3. What does McKinsey's 2023 AI adoption survey (n=1,492) identify as the most common reason AI deployments underperform after pilots?

Correct. Task mismatch appeared in 41% of underperforming deployments — the AI solves a problem near the real one but not the real one, so workers route around it within weeks.

McKinsey found task mismatch in 41% of underperforming cases — the single largest failure category. Workers route around tools that don't address their actual needs, reverting to manual processes.

4. Google PAIR's 2023 AI deployment playbook recommends handling at least 10% of task volume manually in AI-assisted workflows. What is the primary reason?

Correct. The 10% manual threshold is specifically about preventing skill atrophy — preserving the human's ability to detect AI errors and compensate when the system fails, as demonstrated in aviation automation research.

The recommendation is about skill maintenance — preventing the atrophy that occurs when humans fully delegate tasks to AI. If the skill matters in failure scenarios, it must be practiced without AI support. This is the lesson from aviation automation research.

Lab 4 — Workflow Design Workshop

Design a human-AI workflow using the PDCA framework for a real work context

Your Task

Work with the assistant to design a complete human-AI workflow for a task from your own work context — or use the provided scenario. You'll identify failure mode risks, set task scope limits, design the feedback loop, and plan skill maintenance.

Starter prompt: "I want to design a workflow where AI helps a team of three content writers at a mid-size company. They need help with research, first drafts, and SEO analysis. Help me apply the PDCA framework and flag the failure mode risks."

AI Lab Assistant

PDCA Workflow Design · Failure Mode Analysis

Welcome to Lab 4. We're designing human-AI workflows that are built to last — not just to impress in a demo. We'll use the PDCA cycle (Plan-Do-Check-Adjust) and deliberately check against the three failure modes: task mismatch, missing feedback loops, and skill atrophy. Bring me a real work scenario, or use the starter prompt above. What workflow do you want to design?

Module 3 Test

Human-AI Collaboration Strategies · 15 questions · 80% to pass

1. Kasparov's term "centaur" described human-computer chess teams in which the primary human contribution was:

Correct. The human's edge was metacognitive — knowing the boundary of the AI's reliability, not outperforming it computationally.

The human contribution in centaur chess was metacognitive: deciding when to trust the engine and when to override it. That judgment was the key differentiator.

2. In the MIT Work of the Future four-zone model, "AI-Assisted Human" (Zone 2) is best described as:

Correct. Zone 2 is the classic "AI in the copilot seat" configuration — the human drives, the AI surfaces information and risks.

Zone 2 is AI-Assisted Human: the AI provides options and flags risks, but the human owns all decisions. The AI informs without leading.

3. FAA Human Factors Research (2018) found "mode confusion" in air-traffic control was primarily caused by:

Correct. Mode confusion emerges from ambiguous task ownership — when no explicit agreement exists about who handles what, both parties may assume the other is responsible.

Mode confusion is caused by absent task ownership agreements — neither human nor automation has been explicitly assigned the task, so both assume the other is handling it.

4. The "R" in the RCTF prompt framework stands for Role. What does specifying a role accomplish?

Correct. Specifying a role — "as a senior employment lawyer" vs. "as a junior analyst" — shapes the perspective, vocabulary, and depth of the AI's response significantly.

Specifying a role shapes the AI's response framing and perspective — telling it to think from a particular expert's standpoint dramatically changes the output quality and relevance.

5. Chain-of-thought prompting (Wei et al., Google Brain, 2022) improved multi-step accuracy from roughly 18% to:

Correct. The improvement from 18% to 57% on grade-school math problems — simply by asking the model to reason step-by-step — was one of the most striking prompt technique findings of 2022.

Chain-of-thought prompting raised accuracy from 18% to 57% in the Google Brain study — asking the model to show its reasoning steps dramatically improved multi-step problem performance.

6. Automation bias, as documented by Mosier & Skitka (1996), refers to:

Correct. Automation bias is a human cognitive tendency — not a model property — to anchor on automated outputs and discount contradicting evidence.

Automation bias describes a human tendency: over-relying on automated recommendations even when available evidence should prompt doubt or override.

7. The ProPublica 2016 COMPAS analysis found that judges using the recidivism algorithm overrode its scores less than what percentage of the time?

Correct. Judges overrode COMPAS less than 20% of the time even when contextual information strongly contradicted the score — a textbook automation bias pattern with serious equity consequences.

Judges overrode COMPAS in fewer than 20% of cases despite strong contradicting contextual evidence — demonstrating automation bias at institutional scale.

8. According to the Stanford HAI 2023 study, logging override reasons improved override accuracy by what percentage over six months?

Correct. 31% improvement over six months — the logging created feedback loops that helped teams distinguish reliable override instincts from noise.

The Stanford HAI finding was a 31% improvement in override accuracy when teams logged their reasoning — the documentation process created the feedback loops needed for calibration.

9. The JAMA Network Open 2020 radiology study found that AI-flagged scan presentation caused radiologists to:

Correct. The study documented a confirmation bias amplification: radiologists anchored to AI flags, finding what the AI found and missing what the AI missed — even when the AI was deliberately set to be wrong.

The radiology study found automation bias in action: radiologists who saw AI-flagged scans first were less likely to find what the AI missed and more likely to "find" what the AI incorrectly flagged.

10. What was the fundamental design error in IBM Watson for Oncology's training approach?

Correct. Training on hypothetical cases meant Watson's recommendations didn't reflect real clinical complexity — and without a feedback loop from actual clinicians, errors compounded uncorrected.

Watson was trained on hypothetical cases, not real clinical records — making its recommendations unreliable in actual clinical practice, with no feedback mechanism to correct errors.

11. In McKinsey's 2023 AI adoption survey, which failure mode appeared in the highest percentage (41%) of underperforming deployments?

Correct. Task mismatch was the most common failure at 41% — workers route around AI that doesn't solve their actual problem, reverting to manual processes within weeks.

Task mismatch (41%) was the most common failure — followed by no feedback loop (38%). Both are design problems, not technology problems.

12. Klarna's 2024 AI assistant success depended in part on handling only "well-defined query types." Which deployment principle does this exemplify?

Correct. Klarna's explicit scope limitation — only well-defined query types — is a direct implementation of the task boundary principle that prevents task mismatch failure.

Limiting AI to well-defined query types is explicit scope limitation — the design practice that directly prevents task mismatch, the most common AI deployment failure mode.

13. Google PAIR's 2023 deployment playbook recommends that at least 10% of task volume be handled manually in AI-assisted workflows primarily to:

Correct. The 10% manual threshold is a skill atrophy countermeasure — ensuring humans maintain the competency needed to detect AI errors and compensate when the system fails.

The recommendation targets skill atrophy: maintaining a minimum level of manual practice preserves human competency to detect AI errors and handle failure scenarios.

14. The "Check" phase of the PDCA cycle for AI workflow design should measure which of the following in addition to efficiency metrics?

Correct. The Check phase requires outcome quality metrics — not just speed or cost. Error rates, override rates, and trust calibration all reveal whether the workflow is actually working.

The Check phase needs substantive outcome metrics: error rates, override rates, trust calibration, and result quality. Efficiency alone doesn't reveal whether the human-AI collaboration is sound.

15. Which of the following best summarizes the core principle of the centaur model as applied to modern knowledge work?

Correct. The centaur model is fundamentally about explicit, pre-designed task division — not ad-hoc delegation or passive oversight. Clear handoff protocols prevent mode confusion and unlock the collaborative advantage.

The centaur model is about explicit task assignment aligned to comparative advantage, with clear handoff protocols. Without that deliberate design, neither the efficiency gains nor the quality improvements materialize reliably.