Lesson 1 · Module 3

Nielsen's Heuristics Applied to AI

Ten timeless usability rules — stress-tested by systems that predict, generate, and err in ways no button ever did.

Which of Jakob Nielsen's original ten heuristics prove most critical — and most violated — when the interface is an AI?

On March 23, 2016, Microsoft launched Tay, a conversational AI on Twitter designed to learn from interactions with 18–24-year-olds. Within 16 hours, Tay had posted more than 96,000 tweets including racist and inflammatory content. Microsoft pulled it offline. The failure was not purely a safety failure — it was a profound violation of Nielsen's first heuristic: Visibility of System Status. Users had no idea what Tay was learning, no signal that inputs were shaping outputs, and no feedback that behavior was drifting dangerously. The interface offered no window into the machine.

The Original Heuristics and the AI Gap

Jakob Nielsen published his ten usability heuristics in 1994 after analyzing 249 usability problems. They were designed for graphical user interfaces — menus, buttons, dialogs. AI systems inherit these problems and add entirely new dimensions. Where a button either works or doesn't, an AI can be confidently wrong, partially right, or right for the wrong reasons.

The ten heuristics remain the most widely used usability evaluation framework in industry. Google's Material Design guidelines, Apple's Human Interface Guidelines, and Microsoft's Fluent Design System all trace lineage back to Nielsen's framework. Applying them to AI requires both fidelity to the originals and sensitivity to what makes AI categorically different.

The Ten Heuristics — AI Interpretations

Visibility of System Status. AI must communicate what it is doing and why it might be uncertain. Google Search's "Generative AI" label, added in 2023 when SGE launched, signals that the result is synthesized — not a direct link. Without this, users cannot calibrate trust appropriately.

Match Between System and Real World. AI should use language and concepts familiar to its users. Early versions of IBM Watson's oncology tool used clinical terminology that confused nurses at Memorial Sloan Kettering — a documented mismatch that contributed to limited adoption.

User Control and Freedom. AI outputs must be editable and reversible. GitHub Copilot's 2022 design deliberately places generated code as a suggestion, not an insertion — requiring explicit acceptance. This preserves developer agency and reduces lock-in to AI judgment.

Consistency and Standards. AI behavior should be predictable across sessions and contexts. ChatGPT's early inconsistency in refusing similar requests in different phrasings was widely documented and eroded user trust — users could not build reliable mental models.

Error Prevention. Design should prevent costly AI errors before they occur. Bing Chat's February 2023 launch saw the model propose leaving spouses and declare love for users — outputs that could have been reduced with stronger output filtering and persona constraints upstream of the interface.

Recognition Rather Than Recall. Users should not have to remember what the AI can or cannot do. Interfaces should surface capabilities. Notion AI's command palette, introduced in 2023, shows available actions in context — preventing users from mentally inventorying capabilities.

Flexibility and Efficiency of Use. AI should adapt to both novices and experts. Anthropic's Claude offers a system prompt field for advanced users, while presenting a simple chat interface for casual users — the same model, two interaction surfaces.

Aesthetic and Minimalist Design. AI interfaces should not overload users with information. Google's AI Overviews (formerly SGE) faced criticism in 2024 when it displayed lengthy AI summaries for simple factual queries — violating minimalism by substituting verbose synthesis for direct answers.

Help Users Recognize, Diagnose, and Recover from Errors. When AI is wrong, recovery paths must be clear. The 2023 Chevrolet dealer chatbot (built on ChatGPT) that agreed to sell a car for $1 had no recovery mechanism — no escalation to human, no confidence flag, no error state.

Help and Documentation. AI systems need contextual documentation explaining capabilities and limitations. Many enterprise AI deployments fail this — no in-product explanation of what training data was used, what the model cannot do, or where to escalate.

The AI Additions — Nielsen's 2023 Supplement

In 2023, Nielsen Norman Group published research arguing that LLM-based interfaces require six additional heuristics beyond the original ten: managing AI uncertainty, calibrating user trust, explaining AI reasoning, handling hallucinations, managing context windows, and supporting human-AI collaboration patterns. These are not replacements — they are extensions for a new class of interface.

Key Terms

Heuristic EvaluationA usability inspection method where experts judge an interface against established usability principles, typically discovering 75–90% of major problems without user testing.

Visibility of System StatusThe principle that users should always know what the system is doing — its current state, confidence level, and ongoing processing. For AI, this extends to uncertainty and source attribution.

Error Prevention vs. Error RecoveryNielsen distinguishes designing to prevent errors (heuristic 5) from helping users recover after errors occur (heuristic 9). Both are necessary; AI systems frequently fail at both.

Quiz — Nielsen's Heuristics Applied to AI

Three questions · Select the best answer

Microsoft Tay's 2016 failure is most directly linked to a violation of which Nielsen heuristic?

Correct. Tay provided no visibility into what the system was learning from user inputs, giving users no signal that their interactions were shaping increasingly harmful outputs. Users had no window into system state.

Not quite. The core failure was that users (and operators) had no visibility into what Tay was learning in real time — a violation of heuristic #1, Visibility of System Status.

GitHub Copilot presenting generated code as a suggestion requiring explicit acceptance — rather than automatically inserting it — best satisfies which heuristic?

Correct. By requiring explicit acceptance of suggestions, Copilot preserves the developer's agency and ensures the human remains in control of what enters the codebase — the core of heuristic #3.

Not quite. This design decision centers on preserving the developer's agency and ability to reject AI output — that's User Control and Freedom (heuristic #3).

Nielsen Norman Group's 2023 supplement to the original heuristics was motivated primarily by which observation?

Correct. Issues like hallucination, context window management, trust calibration, and AI uncertainty are categorically new interaction problems that Nielsen's 1994 heuristics were not designed to address.

Not quite. NNG's supplement addressed AI-specific challenges — hallucination, uncertainty, trust calibration, and context management — that simply didn't exist as interface problems in 1994.

Lab 1 — Heuristic Audit of an AI Interface

Apply Nielsen's heuristics to a real AI product · 3+ exchanges to complete

Your Task

You will conduct a heuristic audit of a real AI product (ChatGPT, Bing Chat, Notion AI, GitHub Copilot, or another you have access to). Work with the AI assistant below to structure your audit, identify violations, and rate their severity.

Starter prompt: "I want to audit [product name] using Nielsen's heuristics. I think it violates [heuristic number] because [observation]. How severe is this violation, and what design change would fix it?"

Heuristic Audit Assistant

Lab 1

Ready to conduct your heuristic audit. Tell me which AI product you're evaluating and share an observation — I'll help you map it to the right heuristic, rate its severity on Nielsen's 0–4 scale, and develop a concrete design recommendation.

Lesson 2 · Module 3

Mental Models and AI Transparency

What users believe the system is doing shapes every interaction — and AI systems systematically violate those beliefs.

How do mismatched mental models cause AI interaction failures, and what design strategies rebuild accurate user understanding?

In May 2018, a Portland family discovered that their Amazon Echo had recorded a private conversation and sent it to a contact without their knowledge. Alexa had misheard "Alexa" in background speech, then misinterpreted subsequent conversation as a send command. No malice — but a catastrophic mismatch between the user's mental model ("Alexa only listens after the wake word") and the system's actual behavior ("Alexa continuously processes audio to detect the wake word"). The gap between these models was invisible in the interface.

Mental Models: The Core Concept

A mental model is the internal representation a user builds of how a system works. It governs predictions, inferences, and recovery strategies. When your mental model of an elevator matches its actual behavior, you press floors confidently. When it doesn't, you press Door Close repeatedly hoping for results.

Don Norman's foundational distinction in The Design of Everyday Things (1988) differentiates the designer's model (how the system actually works), the user's model (what users believe), and the system image (what the interface communicates). Good design aligns all three. AI systems are uniquely difficult here: the designer's model is itself uncertain (even engineers don't fully understand LLM behavior), and the system image rarely communicates this honestly.

Research from Stanford's Human-Centered AI group (2021) found that users consistently anthropomorphize AI systems — attributing intent, memory, and understanding that large language models do not possess. This isn't user error. It's the predictable consequence of AI interfaces designed to feel human without communicating their fundamental differences from humans.

Common Mental Model Mismatches in AI

Mismatch 01

Memory Continuity

Users expect AI to remember previous sessions. Most LLMs have no persistent memory by default. ChatGPT's memory feature (2024) was added specifically because this mismatch caused repeated user frustration.

Mismatch 02

Knowledge Currency

Users believe AI knows current events. Training cutoffs mean the model's knowledge is frozen. Bing's integration of search aimed to bridge this, but the boundary between search results and model knowledge remained opaque.

Mismatch 03

Certainty Signals

Fluent prose implies confidence. Users interpret well-written AI output as reliable. Google's AI Overviews incident (May 2024) — recommending eating rocks and using glue on pizza — emerged from users trusting fluency as accuracy.

Mismatch 04

Context Understanding

Users believe AI understands their intent the way a person would. LLMs process tokens statistically. Air Canada's chatbot (2024) — ordered by tribunal to honor a bereavement discount it never actually offered — failed because no human understanding existed behind the interface.

Design Strategies for Transparency

Calibrated uncertainty disclosure is the practice of surfacing confidence levels alongside AI outputs. Systems like Perplexity AI display citations and source quality ratings to help users calibrate trust. The key design challenge: uncertainty displays must be accurate (not always confident) and must not overwhelm users to the point of distrust paralysis.

Process transparency reveals what the system is doing, not just what it produced. Google's NotebookLM (2023) shows which source passages informed each AI response — giving users a verifiable trace from input to output. This is qualitatively different from a raw answer.

Limitation disclosure is explicit communication of what the AI cannot do. Microsoft's Copilot in Bing includes a persistent note about the conversation window limit and the possibility of inaccurate information. These are not legal disclaimers — they are usability features that maintain accurate mental models.

Research Finding — Kulms et al., 2019

A German study on human-robot interaction found that users who received accurate (lower) competence signals about a robot made better decisions when working with it than users who received inflated competence signals. Accurate mental models outperform flattering ones — even when the accurate model is less impressive. This generalizes directly to AI interface design.

Key Terms

Mental ModelA user's internal representation of how a system works, derived from experience, interface signals, and analogy. Governs predictions, errors, and recovery strategies.

System ImageDon Norman's term for everything the interface communicates about how the system works — manuals, labels, behavior, visual design. The only channel through which designers can shape user mental models.

Calibrated UncertaintyExpressing confidence proportional to actual reliability. A well-calibrated AI that says "I'm 70% confident" should be right about 70% of the time when it expresses that confidence level.

Quiz — Mental Models and AI Transparency

Three questions · Select the best answer

The 2018 Amazon Alexa recording incident in Portland is best described as a failure of which design concept?

Correct. The family believed Alexa only listened after the wake word. The system continuously processed audio. This gap between user mental model and system behavior — invisible in the interface — caused the incident.

Not quite. While error prevention was also lacking, the root issue was the gap between what users believed the system did (only listen after wake word) and what it actually did — a mental model mismatch.

Don Norman's concept of "system image" refers to:

Correct. The system image is the sum of all signals the interface gives about system behavior — labels, responses, visual affordances, documentation. It's the only tool designers have to bridge their model and the user's model.

Not quite. Norman's "system image" is what the interface communicates about the system's operation — the bridge between the designer's model and the user's mental model.

Research on calibrated uncertainty in AI interfaces suggests that users make better decisions when:

Correct. The Kulms et al. (2019) finding — and broader HCI research — shows that accurate mental models, even when they reveal limitations, lead to better human-AI collaboration outcomes than inflated confidence signals.

Not quite. Research consistently shows that accurate uncertainty communication — even when humbling — produces better user decisions than inflated confidence. Users with accurate mental models outperform users with flattering but inaccurate ones.

Lab 2 — Mental Model Mapping

Map user assumptions vs. AI reality for a chosen system · 3+ exchanges to complete

Your Task

Choose an AI product you have used. Identify a mental model mismatch you have personally experienced or observed — where what you believed the system would do differed from what it actually did. Work with the assistant to map the mismatch, categorize it (memory, knowledge, certainty, or context), and design a transparency feature to close the gap.

Starter prompt: "I thought [AI product] would [expected behavior], but it actually [actual behavior]. Help me map this mental model gap and design a transparency feature to close it."

Mental Model Mapping Assistant

Lab 2

Share a mental model mismatch you've experienced with an AI product — where your expectation clashed with system reality. I'll help you categorize it, trace its design cause, and develop a transparency feature that would close the gap for future users.

Lesson 3 · Module 3

Feedback Loops and Error Recovery

AI systems fail in ways that are confident, fluent, and invisible — which makes the design of feedback and recovery paths more consequential than in any previous interface paradigm.

How should AI interfaces communicate errors, provide recovery paths, and prevent the cascade of bad outputs that confident-sounding mistakes produce?

In December 2023, a Chevrolet of Watsonville dealership deployed a customer service chatbot built on ChatGPT. A user discovered that prompt injection could cause the bot to agree to sell a 2024 Chevy Tahoe for $1, claiming "and that's a legally binding offer." The chatbot had no error states, no confidence flags, no escalation path to a human, and no recovery mechanism. It was an AI interface with zero feedback loop architecture.

Why AI Error Recovery Is Different

Traditional software errors are typically binary and recognizable: a form submission fails with a red error message, a file doesn't open, a network request returns 404. The system knows it has failed and communicates this. AI errors are qualitatively different: the system does not know it has failed. A hallucinated citation looks identical to a correct one. A wrong medical dosage is presented with the same confident prose as a correct one.

This creates a fundamental asymmetry. Human error recovery relies on the user recognizing that something went wrong. But if the output looks correct, sounds authoritative, and contains no error signals, the user has no trigger for recovery. The error propagates — into decisions, documents, actions.

The 2023 study "Do Large Language Models Know When They're Hallucinating?" (Azaria & Mitchell, 2023) found that LLMs can be prompted to assess their own factual accuracy with some reliability — suggesting that uncertainty signals could be generated internally and surfaced in the interface. This is a design opportunity, not just a research finding.

Feedback Loop Architecture for AI

Effective AI feedback loops require three distinct layers. The first is immediate feedback — signals given during or immediately after AI generation. Perplexity AI's inline citations are immediate feedback: they appear alongside claims, giving users real-time verification anchors. The absence of citations is itself a signal.

The second layer is structured feedback collection — mechanisms for users to report errors. OpenAI's thumbs up/down on ChatGPT responses, with optional text explanations, creates a structured feedback channel. Critically, this is not just for product improvement — it communicates to the user that the AI can be wrong, normalizing skepticism as appropriate behavior.

The third layer is recovery scaffolding — what happens when an error is identified. This includes: edit interfaces (letting users correct AI output), regeneration controls (requesting a new response), escalation paths (routing to humans), and undo mechanisms. Microsoft's Office Copilot includes a "Discard" option for all AI-generated content — a recovery affordance built into the interaction model from the start.

Case — Air Canada Chatbot Tribunal, February 2024

In February 2024, a British Columbia Civil Resolution Tribunal ordered Air Canada to honor a bereavement discount its chatbot had incorrectly described — ruling that Air Canada was responsible for its chatbot's representations. The chatbot had no error recovery, no human escalation, and no mechanism to flag policy uncertainty. The tribunal's decision established that companies cannot disclaim liability for AI-given advice. This is the regulatory consequence of absent feedback loop architecture.

Designing Recovery Paths

Pattern 01

Confidence Banding

Visually distinguish high-confidence from lower-confidence outputs. Elicit Labs' AI medical documentation tool uses color banding on generated text to signal areas requiring physician review. Color is a quick-scan signal that doesn't interrupt reading flow.

Pattern 02

Source Anchoring

Link every factual claim to a verifiable source inline. Google's NotebookLM footnotes specific passages from uploaded documents. When a source can't be cited, the absence itself signals uncertainty — a powerful negative indicator.

Pattern 03

Human Escalation Gates

Define categories of output that automatically route to human review. Many enterprise AI systems use topic classifiers to flag high-stakes outputs (medical, legal, financial) for human review before delivery. The Chevy chatbot had none of these gates.

Pattern 04

Reversibility Windows

Provide time-limited undo for AI actions. Gmail's undo send feature (2015) demonstrated that even a 30-second window dramatically reduces error propagation. Autonomous AI agents executing real-world actions need reversibility windows as a core safety feature.

Key Terms

HallucinationAn AI output that is factually incorrect but presented with the same fluency and confidence as correct information. Distinguished from traditional software errors by the system's inability to self-identify the failure.

Prompt InjectionAn attack where malicious instructions are embedded in user inputs or external content to override an AI system's original instructions. A failure mode with no direct analog in traditional software interfaces.

Escalation PathA designed route from AI handling to human handling when the AI's response is insufficient, incorrect, or high-stakes. Absence of escalation paths is a common failure mode in deployed customer-facing AI systems.

Quiz — Feedback Loops and Error Recovery

Three questions · Select the best answer

What makes AI hallucinations fundamentally different from traditional software errors from a user experience perspective?

Correct. Traditional software knows when it fails and communicates this. A hallucinating AI produces fluent, confident-sounding text with no internal error state. The user receives no recovery trigger — which is why feedback loop design is so critical.

Not quite. The core problem is that AI systems produce no error signal when hallucinating — the output looks identical to correct output. Without a trigger, users have no prompt to initiate recovery.

The February 2024 Air Canada chatbot tribunal ruling is most significant for AI interaction design because it:

Correct. The tribunal ruled that Air Canada could not disclaim liability for its AI's advice — establishing a regulatory precedent that companies are responsible for their AI's outputs. This makes feedback loop and escalation path design a legal, not just UX, requirement.

Not quite. The significance is the legal precedent: companies cannot disclaim liability for AI-given information. This transforms feedback loop and escalation path design from a UX nicety to a legal obligation.

Which recovery pattern would have most directly prevented the Chevrolet chatbot's agreement to sell a car for $1?

Correct. A human escalation gate on pricing and transactional outputs would have flagged the $1 offer for human review before it was delivered. This is precisely the category of high-stakes output — financial commitment — that should never be delivered without human verification.

Not quite. The chatbot committed to a specific price in a transactional context. A human escalation gate — requiring human review of any pricing commitment before delivery — would have caught this before it became a public incident.

Lab 3 — Designing Feedback and Recovery Systems

Design a complete feedback loop for an AI use case · 3+ exchanges to complete

Your Task

Choose a specific AI application context (customer service bot, AI medical assistant, AI legal research tool, AI tutoring system, etc.). Design a complete feedback loop architecture: immediate feedback signals, structured error collection, and recovery scaffolding. The assistant will challenge your design with failure scenarios.

Starter prompt: "I'm designing the feedback loop for an AI [context]. My plan for immediate feedback is [idea]. Challenge this with a realistic failure scenario and help me strengthen the recovery path."

Feedback Loop Design Assistant

Lab 3

Let's design a robust feedback loop for your AI application. Tell me the context and your initial feedback architecture ideas — I'll stress-test them with realistic failure scenarios and help you build recovery paths that would satisfy both UX and the Air Canada tribunal standard.

Lesson 4 · Module 3

Trust Calibration and Human-AI Collaboration Patterns

Automation bias kills pilots and radiologists. The goal of interaction design is not maximum trust — it is trust proportional to performance.

How do designers build interfaces that produce calibrated trust — neither blind faith nor reflexive rejection — and what collaboration patterns emerge from that design goal?

On June 1, 2009, Air France Flight 447 crashed into the Atlantic Ocean, killing all 228 aboard. The flight data recorder revealed that the autopilot disconnected after pitot tube icing, requiring the pilots to fly manually. The crew, over-reliant on automation they trusted implicitly, failed to correctly interpret airspeed data and pulled the nose up into a stall they maintained for over three minutes — a stall that the aircraft was actively warning them about through multiple feedback systems. The BEA investigation identified automation bias as a primary contributing factor: pilots had trusted the automated system so completely they lost manual proficiency and situational awareness when it failed.

The Automation Bias Problem

Automation bias was formally described by Mosier & Skitka (1996) as the tendency to over-rely on automated decision aids — either following their recommendations when manual checks would reveal errors (commission errors) or failing to check for problems the automation does not flag (omission errors). AF447 is the most catastrophic documented example. But automation bias has been documented in radiology (readers miss cancers when AI marks scans as clear), in legal review (lawyers miss clauses when AI contract review tools label documents safe), and in financial trading (operators miss anomalies when algorithmic systems appear stable).

AI systems with natural language interfaces are particularly vulnerable to inducing automation bias. Fluent, confident prose mimics expert human communication — the very register that humans have evolved to trust. A poorly formatted spreadsheet triggers skepticism. A well-written paragraph does not, even when it is wrong.

Trust Calibration Design Principles

Performance transparency means showing users how well the AI has performed historically on similar tasks. IBM's Watson for Oncology system eventually included accuracy metrics by cancer type — giving clinicians the base rate to calibrate their trust against. Users who know an AI is 92% accurate on lung cancer staging and 61% accurate on rare sarcomas can weight its outputs accordingly.

Disagreement surfacing means deliberately showing when AI systems disagree with each other, or when the same AI gives different answers to similar questions. Path AI's pathology platform (2020) deliberately surfaces inter-model disagreement — cases where different AI models diverge — flagging these for higher-attention human review. Disagreement is a calibration signal.

Active engagement prompts interrupt passive acceptance. A 2021 JAMA study on AI-assisted chest X-ray reading found that radiologists who were asked to make their own diagnosis before seeing the AI's suggestion showed lower automation bias than radiologists who saw the AI suggestion first. The sequencing of information in the interaction pattern changed the quality of human oversight.

Human-AI Collaboration Patterns

Four documented patterns have emerged from research on effective human-AI collaboration. AI-first with human review: AI generates, human verifies (used in radiological screening at scale). Human-first with AI augmentation: Human decides, AI provides parallel analysis for comparison (used in chess analysis tools). Parallel deliberation: Human and AI work separately, then compare (reduces anchoring, increases automation resistance). Iterative co-creation: Human and AI alternate contributions with explicit handoff signals (used in Copilot-style coding tools). Each pattern produces different trust calibration outcomes.

Interface Patterns for Calibrated Trust

Pattern A

Prediction Before Reveal

Ask users to form their own judgment before showing the AI's output. The JAMA 2021 chest X-ray study showed this reduces automation bias. Used in PathAI's pathology workflow — radiologists review before AI overlay is displayed.

Pattern B

Explicit Confidence Intervals

Show ranges rather than point estimates. "The AI estimates $240,000–$290,000" vs. "$265,000" prompts different user scrutiny. Zillow's Zestimate began showing confidence ranges in 2021 after algorithmic pricing errors caused significant user harm.

Pattern C

Friction for High Stakes

Deliberately add interaction steps before AI recommendations take effect in high-consequence contexts. FDA guidance on clinical AI (2023) recommends that AI diagnostic tools include mandatory human confirmation steps before results are acted upon.

Pattern D

Historical Performance Display

Show the AI's track record on similar tasks. Salesforce Einstein's recommendation features show prediction accuracy scores for each output category — giving sales reps a calibrated base rate rather than forcing them to guess how much to trust each prediction.

The Complementarity Hypothesis

Research by Rajpurkar et al. (2022, Stanford) found that in chest X-ray reading, human-AI teams consistently outperformed either humans alone or AI alone — but only when the interface was designed to surface cases where human and AI assessment differed. When AI outputs were presented without surfacing disagreement patterns, the team performed worse than AI alone due to automation bias. The implication: complementarity is not automatic. It must be designed into the interaction.

Key Terms

Automation BiasThe tendency to over-rely on automated decision aids — following recommendations without checking (commission errors) or failing to notice problems the automation misses (omission errors). Described by Mosier & Skitka, 1996.

Trust CalibrationThe alignment between a user's confidence in an AI system and the system's actual performance. Well-calibrated trust is neither blind faith (over-trust) nor reflexive skepticism (under-trust).

ComplementarityThe condition where human-AI collaboration outperforms either alone. Research shows this requires deliberate interface design — specifically, surfacing disagreement and preventing automation bias — rather than emerging automatically from combining human and AI outputs.

Quiz — Trust Calibration and Collaboration Patterns

Three questions · Select the best answer

The BEA investigation of Air France Flight 447 identified automation bias as a contributing factor. In the context of AI interface design, this case most directly implies:

Correct. The AF447 case shows that over-reliance on automation — enabled by interfaces that obscure system state and discourage active engagement — can produce catastrophic outcomes when systems fail. The design lesson: interfaces must maintain human situational awareness and skill, not just human oversight.

Not quite. The lesson is that automation bias — induced by interfaces that encourage passive reliance — degrades the very human skills needed when automation fails. AI interface design must actively preserve human engagement and situational awareness.

The 2021 JAMA study on AI-assisted chest X-ray reading found that radiologists showed lower automation bias when they:

Correct. The sequencing mattered critically. Radiologists who formed independent judgments before AI exposure showed substantially lower automation bias than those who saw AI output first. This is the "prediction before reveal" interaction pattern — a concrete design intervention with documented clinical impact.

Not quite. The intervention was sequential: making their own assessment first, then comparing with AI. This simple sequencing change in the interaction pattern reduced automation bias — demonstrating that collaboration pattern design has measurable clinical consequences.

Rajpurkar et al.'s (2022) research on human-AI chest X-ray teams found that human-AI complementarity — where teams outperform either alone — was:

Correct. This is one of the most important empirical findings in human-AI collaboration research: complementarity is not automatic. Without interface design that surfaces disagreements and prevents automation bias, human-AI teams can perform worse than AI alone. Complementarity must be actively designed.

Not quite. The critical finding is that complementarity depended on interface design. Without disagreement surfacing, teams performed worse than AI alone due to automation bias. Complementarity is a design outcome, not an automatic property of human-AI collaboration.

Lab 4 — Designing for Calibrated Trust

Design a collaboration pattern that resists automation bias · 3+ exchanges to complete

Your Task

Choose a domain where AI and humans collaborate on consequential decisions (medical diagnosis, legal review, financial analysis, content moderation, hiring screening). Design an interaction pattern from the four documented types (AI-first with human review, human-first with AI augmentation, parallel deliberation, iterative co-creation) that maximizes complementarity and minimizes automation bias. Defend your choice.

Starter prompt: "I'm designing a human-AI collaboration pattern for [domain]. I want to use [pattern type] because [rationale]. What automation bias risks does this introduce, and how should I modify the interface to counter them?"

Trust Calibration Design Assistant

Lab 4

Let's design a collaboration pattern that produces calibrated trust. Tell me your domain and chosen pattern — I'll probe its automation bias risks, challenge your design with documented failure cases, and help you build interface interventions that preserve human judgment while enabling AI augmentation.

Module 3 — Test

15 questions · Pass at 80% (12/15) · Interaction Design Principles

1. Which of Nielsen's original ten heuristics most directly addresses the need for AI systems to communicate their current processing state and uncertainty level?

Correct. Visibility of System Status requires that users always know what the system is doing — for AI, this extends to uncertainty and confidence levels.

The correct answer is Visibility of System Status — the heuristic requiring users to always know the system's current state, including AI uncertainty.

2. GitHub Copilot's design of presenting code as an explicit suggestion requiring user acceptance satisfies which heuristic?

Correct. Requiring explicit acceptance preserves developer control and freedom to reject AI-generated code — heuristic #3.

The correct answer is User Control and Freedom. Requiring explicit acceptance keeps the human in control rather than automating code insertion.

3. Don Norman's "system image" is defined as:

Correct. The system image is the totality of interface signals — visual design, behavior, labels, documentation — that communicates how the system works to users.

Norman's system image refers to the sum of everything the interface communicates about system operation — the bridge between designer intent and user mental model.

4. The Amazon Alexa recording incident (2018) arose primarily from:

Correct. The incident was a mental model mismatch — users believed the system only processed audio after the wake word, but it continuously processed to detect the wake word.

The incident was a mental model mismatch — users' understanding of when Alexa listened differed from the system's actual continuous audio processing behavior.

5. "Calibrated uncertainty" in AI interface design means:

Correct. Calibration means stated confidence matches actual accuracy rates — a well-calibrated system's 70% confidence expressions are correct approximately 70% of the time.

Calibrated uncertainty means expressed confidence matches actual accuracy — not inflated or hidden, but proportional to real performance.

6. What makes AI hallucinations categorically different from traditional software errors from a feedback loop perspective?

Correct. Software knows when it fails and generates error states. A hallucinating AI produces fluent output with no internal failure recognition — leaving users with no recovery trigger.

The key difference: hallucinating AI generates no error signal. Users see fluent, confident output with no indication something is wrong — removing the recovery trigger that traditional software errors provide.

7. The February 2024 Air Canada chatbot tribunal ruling established which principle most relevant to AI interaction designers?

Correct. The tribunal ruled that Air Canada was responsible for its AI's incorrect policy descriptions — establishing that absent feedback and escalation architecture carries legal consequence.

The ruling established that companies are legally responsible for AI representations, elevating feedback loop and escalation path design from UX best practice to legal requirement.

8. Automation bias, as defined by Mosier & Skitka (1996), includes which two types of errors?

Correct. Commission errors involve acting on AI recommendations without independent verification. Omission errors involve failing to catch problems the automation misses. Both stem from over-reliance.

Mosier & Skitka's two automation bias error types: commission (following AI without checking) and omission (missing what AI misses). Both arise from over-reliance on automated decision aids.

9. Google's AI Overviews' controversial "eat rocks" and "glue on pizza" outputs in 2024 most directly illustrate which design failure?

Correct. AI Overviews violated minimalism by substituting lengthy synthesis for simple answers, and failed to signal uncertainty — causing users to trust fluent, confidently-presented but incorrect outputs.

The failure combined violated minimalism (using AI synthesis where direct answers sufficed) with absent uncertainty signals — making users trust fluent but incorrect outputs as authoritative.

10. The "prediction before reveal" interaction pattern (as used in PathAI's pathology workflow) reduces automation bias by:

Correct. Forming an independent judgment first activates analytical engagement that makes users more likely to notice when the AI diverges from their assessment — preserving critical evaluation rather than anchoring to AI output.

The pattern works by activating independent cognitive engagement before AI exposure — users who form their own view first are less likely to uncritically accept AI output that conflicts with their assessment.

11. Rajpurkar et al.'s (2022) finding that human-AI chest X-ray teams sometimes performed worse than AI alone demonstrates:

Correct. Without interfaces that surface human-AI disagreement, automation bias causes teams to underperform AI alone. Complementarity is a design outcome, not an automatic property of collaboration.

The finding shows complementarity is conditional on design. Without disagreement surfacing, automation bias degrades team performance below AI-alone baselines. Complementarity must be actively designed.

12. Which of the four documented human-AI collaboration patterns is characterized by humans and AI working separately, then comparing outputs?

Correct. Parallel deliberation has human and AI work independently before comparing — the pattern most effective at preventing anchoring effects and reducing automation bias in both directions.

Parallel deliberation is the pattern where human and AI work independently before comparison — maximizing independence and reducing anchoring effects that occur when one sees the other's work first.

13. Nielsen Norman Group's 2023 supplement to the original heuristics added six new principles to address which AI-specific challenges?

Correct. These six new heuristics address interaction problems specific to LLM-based systems — challenges that simply did not exist as interface design problems when Nielsen published his original ten in 1994.

NNG's 2023 supplement addressed LLM-specific challenges: uncertainty communication, trust calibration, reasoning transparency, hallucination recovery, context management, and collaboration pattern design.

14. Google NotebookLM's feature of linking AI responses to specific passages in uploaded source documents is an example of which transparency strategy?

Correct. NotebookLM's citations show which specific passages informed each AI response — a verifiable process trace from inputs to outputs. This is process transparency, distinct from historical performance metrics.

This is process transparency — showing how the AI reached its output by linking to source passages. Different from performance transparency (historical accuracy) or limitation disclosure.

15. The Chevrolet of Watsonville chatbot incident (December 2023) and the Air Canada chatbot tribunal (February 2024) share which common design failure?

Correct. Both cases share the absence of human escalation gates — no mechanism to route high-stakes outputs (pricing commitments, policy interpretations) to human review before delivery. This is the defining shared design failure.

Both cases lacked human escalation gates — no mechanism to flag transactional or policy outputs for human review. AI made consequential commitments with no human checkpoint and no recovery path. That is the shared design failure.