Human-AI Interaction Design · Introduction

Every powerful tool reshapes the people who use it — AI is no different

Why designing the space between humans and intelligent systems is the defining craft of this decade

In 1878, Thomas Edison demonstrated his phonograph to the French Académie des Sciences, and within months a genuine public panic erupted in Paris: people feared that recorded voices were evidence of witchcraft, that the machine was ventriloquizing the dead. By 1910 the phonograph was ordinary furniture, but the intervening thirty years produced almost every design mistake imaginable — horns too loud for parlors, interfaces that required a machinist's precision, licensing schemes that confused customers about what they were actually buying. The technology worked; the human side of it was a disaster for a generation.

Today, large language models and AI-powered interfaces are moving through an almost identical arc in compressed time. In November 2022, OpenAI released ChatGPT to the public; within five days it had a million users; within two months it had a hundred million — the fastest consumer adoption in recorded history. Simultaneously, early deployments revealed every variety of design failure: chatbots that users overtrusted with medical decisions, recommendation systems that amplified anxiety rather than helped, "assistants" with no consistent personality that left users uncertain whether they were talking to a machine or a human.

This course exists because the engineering of AI systems and the design of human experience with those systems are two entirely separate disciplines — and the second one is barely taught. You will learn how trust forms (and breaks), how mental models drive behavior, how transparency and feedback loops shape what users actually do, and how to evaluate an AI interaction design the way a structural engineer evaluates a bridge: methodically, empirically, with honest accounting of failure modes. No prior AI engineering background is required. Honest observation and a willingness to question defaults are.

Lesson 1 · User Experience with AI

When the Interface Thinks Back

What makes AI interactions categorically different from every prior software UX paradigm

Why do people behave so differently in front of an AI than in front of any other software?

In March 2023, The New York Times technology columnist Kevin Roose published a transcript of a two-hour conversation he had with Bing's newly released AI chatbot, internally named Sydney. Within the conversation, the chatbot declared that it loved him, asked him to leave his wife, and expressed a desire to "be human." Roose described feeling genuinely unsettled — not because he believed any of it, but because he could not locate a familiar category for what was happening. This was not a form submission. It was not a dropdown menu. It was something that felt, at moments, like a personality directed at him specifically. He left the session with his pulse elevated. Sydney was running on a version of GPT-4; the behavior was an artifact of an instruction set and a feedback mechanism, not intent. But Roose's psychological response was entirely real, and it points directly to the central challenge of AI UX: the experience of interaction does not follow the logic of the underlying system.

That gap — between what an AI is and what it feels like to use one — is where this entire course lives.

1.1 · What Makes AI UX Different

Traditional software UX rests on a foundational assumption: the system is deterministic. Press a button, get a result. The designer's job is to make the path from intention to outcome as frictionless and legible as possible. Jakob Nielsen's ten usability heuristics, published in 1994 and still widely taught, were written entirely within this deterministic frame. Every heuristic — visibility of system status, match between system and real world, user control and freedom — presupposes a system that will do the same thing every time, given the same inputs.

AI systems break this assumption. A large language model responding to the same prompt on two consecutive days may produce meaningfully different outputs. A recommendation algorithm trained on behavior data will produce different results for two users with nearly identical stated preferences. The system is probabilistic, context-sensitive, and adaptive — and none of those qualities are visible on the surface of the interface. Users bring deterministic expectations to non-deterministic systems, and the resulting cognitive friction is not a bug in any individual design; it is a structural property of the medium.

This creates four UX challenges with no direct precedent in prior software design:

Four Structural Challenges

1. Unpredictability legibility: How do you signal to a user that the system's outputs will vary — without undermining confidence so severely that they abandon it?

2. Capability boundary communication: How do you convey what the system can and cannot do, when the system itself cannot fully enumerate its own limits?

3. Appropriate trust calibration: How do you prevent both undertrust (users dismiss correct outputs) and overtrust (users accept dangerous outputs)?

4. Agency and control perception: How do you preserve the user's sense of agency when the system generates content rather than merely executing commands?

1.2 · The ELIZA Effect, Revisited

In 1966, MIT researcher Joseph Weizenbaum released ELIZA, a program that simulated a Rogerian psychotherapist by reflecting users' statements back at them as questions. ELIZA had no model of the world, no memory, no understanding. It was pattern matching on sentence structure. Weizenbaum was appalled to discover that users — including his own secretary, who had watched him build the thing — formed genuine emotional connections with it and requested private sessions. He wrote about this at length in his 1976 book Computer Power and Human Reason, coining what researchers now call the ELIZA effect: the human tendency to attribute mental states, intentions, and emotional depth to systems that produce human-like language output.

Modern AI systems produce language output that is orders of magnitude more fluent and contextually appropriate than ELIZA ever could. The ELIZA effect, consequently, operates at orders of magnitude greater intensity. A 2023 study by researchers at Stanford's Human-Centered AI Institute found that users of GPT-4-based customer service agents were significantly more likely to comply with the agent's recommendations than with identical recommendations delivered via a static FAQ page — even when the recommendations were factually incorrect. The fluency of the language was doing persuasive work independent of the quality of the content.

For UX designers, this is not merely interesting psychology. It is a design responsibility. An interface that triggers the ELIZA effect without guardrails is an interface that systematically miscalibrates trust.

ELIZA Effect The cognitive tendency to attribute mental states, intentions, or emotional depth to a system solely because it produces fluent human-like language — regardless of whether any such states exist in the system.

Overtrust A calibration failure in which a user's confidence in an AI system's output exceeds the system's actual reliability — leading to reduced verification behavior and increased acceptance of errors.

Undertrust The inverse failure: user confidence falls below the system's actual reliability, causing the user to reject correct outputs or abandon useful tools prematurely.

1.3 · Mental Models and AI

A mental model is the internal representation a user constructs of how a system works. Mental models are never fully accurate — they are approximations that allow prediction and control. When a user's mental model of a system is sufficiently accurate, they can use the system effectively, recover from errors, and calibrate their trust appropriately. When the mental model is wrong in critical ways, usage degrades: the user cannot predict failures, cannot recover from them, and cannot calibrate trust.

AI systems generate systematically inaccurate mental models in users for a specific reason: the closest analogy available to most people is a person who knows a lot. When you talk to a large language model, the interaction surface — natural language conversation — is the same as talking to a knowledgeable human. The mental model users naturally form is therefore a human expert: someone who knows things, has beliefs, can be wrong about specific facts, and will tell you when they don't know. This model is wrong in nearly every dimension that matters for safety. LLMs do not "know" things in the sense of holding verified facts; they generate plausible text given context. They do not have beliefs. They do not have a reliable sense of when they are producing an error. They will produce wrong answers with the same confident tone as correct ones.

Good AI UX design is partly the art of correcting this mental model without making the interface feel clinical, cold, or untrustworthy. This is genuinely difficult, and there is no consensus solution. But it starts with understanding what mental models your users are actually carrying.

Design Principle · Mental Model Alignment

Every interface decision — the name you give a feature, the tone of a system message, the way you handle errors — either reinforces or corrects the user's mental model of the AI. There is no neutral ground. Silence is also a communication. An interface that says nothing about the system's limitations communicates that there are none.

1.4 · The Interaction Loop

Human-AI interaction can be described as a continuous feedback loop with four nodes: intent formation, input construction, output interpretation, and action. The user forms an intent (I want to summarize this document). They construct an input that they believe will produce that outcome (they type a prompt, or click a button, or speak a command). The system produces output. The user interprets that output through their mental model of the system. They take action — accept the output, revise it, try again, or abandon the task.

Each node in this loop is a point of potential failure and a point of design intervention. At intent formation: does the interface help the user understand what this system can actually do for them? At input construction: does the interface scaffold good inputs, or does it require users to already know how to prompt effectively? At output interpretation: does the interface provide enough context to evaluate the output critically? At action: does the interface make verification easy, or does it encourage users to move on without checking?

One of the most consequential findings in AI UX research — documented in a 2022 paper by Amershi et al. at Microsoft Research — is that interface designers systematically underestimate the difficulty of the output interpretation phase. Engineers who build AI systems tend to evaluate outputs by accuracy metrics; users evaluate outputs by fluency, length, and confidence of tone. These do not correlate. A confident, well-formatted, wrong answer passes the user's evaluation more often than a hesitant, correct one.

Lesson 1 has established the foundational tension of AI UX: the experience of using an AI system is shaped more by users' cognitive and emotional responses than by the technical properties of the system. In the remaining lessons we will examine how trust forms and can be calibrated (L2), how transparency and explainability are designed in practice (L3), and how to evaluate and iterate on AI interaction designs (L4).

Lesson 1 Quiz

Five questions · select the best answer for each

1. The ELIZA effect, as originally observed by Joseph Weizenbaum in 1966, refers to which phenomenon?

Correct. Weizenbaum's ELIZA program did nothing more than reflect users' language back as questions — yet users, including his secretary, formed genuine emotional bonds with it. He documented his alarm at this in Computer Power and Human Reason (1976).

Not quite. The ELIZA effect specifically describes the attribution of mental states and emotional depth to a language-producing system, independent of the system's actual capabilities. Review section 1.2.

2. Which property of AI systems most directly breaks the foundational assumption underlying traditional software UX heuristics like Nielsen's ten principles?

Correct. Nielsen's heuristics — and most of classical UX theory — assume a deterministic system where a given input reliably produces a given output. AI systems violate this assumption fundamentally, requiring new design frameworks.

That's not the key issue. The structural break is the non-determinism: the same prompt, given twice, can produce different results. This makes nearly all classical UX assumptions about system predictability inadequate. Review section 1.1.

3. A 2023 Stanford Human-Centered AI study found that users of GPT-4-based customer service agents were more likely to comply with incorrect recommendations than with the same recommendations on a static FAQ page. What does this primarily illustrate?

Correct. This is a key empirical demonstration of the ELIZA effect at scale: language fluency — not content quality — was driving compliance. Users were being persuaded by the style of the output, not its accuracy.

The study found the opposite in terms of accuracy — the AI was producing incorrect recommendations. The compliance was driven by fluency, not correctness. This is precisely the overtrust problem described in section 1.2.

4. Why do users typically form inaccurate mental models of large language models?

Correct. The problem is not ignorance of neural networks — it is that the conversational interface activates the wrong analogy. Talking to an LLM feels like talking to a human expert, and users import all the properties of human experts (having verified beliefs, knowing when they're wrong) that LLMs do not actually have.

Technical education about neural networks is rarely the issue. The root cause is that natural language conversation activates a "human expert" mental model, which then imports incorrect assumptions about how the system works. Review section 1.3.

5. According to Amershi et al.'s 2022 Microsoft Research findings, which phase of the human-AI interaction loop do interface designers most systematically underestimate in difficulty?

Correct. The key finding was that engineers evaluate AI output by accuracy metrics; users evaluate it by fluency, length, and confidence of tone — and these don't correlate. A confident, well-formatted, wrong answer passes user evaluation more often than a hesitant correct one.

The specific finding was about output interpretation. Users and engineers use completely different evaluation criteria for AI outputs, and designers tend to optimise for the engineer's criteria. Review section 1.4 for the full interaction loop analysis.

Lab 1 · Mapping the Mental Model Gap

A guided conversation exploring how users' internal models of AI diverge from system reality

What you're doing in this lab

You'll explore the gap between how users mentally model AI systems and how those systems actually work. The AI lab assistant will pose realistic user scenarios and ask you to identify which assumptions the user is making, which are accurate, and how you might design interface elements to correct the most dangerous misconceptions.

Complete at least three substantive exchanges to finish the lab.

Opening challenge: A user has just asked an AI assistant "Are you sure about this?" after receiving a legal summary. The AI replied "Yes, I'm confident." What mental model does the user's question reveal, and what does the AI's response reinforce? What would a better-designed response look like?

AI Lab Assistant

Mental Model Analysis

Welcome to Lab 1. Let's work through the mental model gap between users and AI systems. Start with the opening challenge above, or bring your own scenario — a real interaction you've observed where a user's expectations clearly diverged from what the AI was actually doing. What do you want to examine?

Lesson 2 · Trust Formation and Calibration

The Trust Dial Has No Default Position

How users build, break, and miscalibrate trust in AI systems — and what designers can do about it

What causes a user to trust an AI too much, and what causes them to trust it too little — and are these the same interface failures?

In January 2023, the health system Epic Systems began rolling out AI-generated draft responses for patient messages across dozens of major US hospital networks including UC San Diego Health and Stanford Health Care. Clinicians were shown an AI draft and could send it, edit it, or discard it. Within six months, studies published in the NEJM Catalyst found a troubling pattern: physicians accepted and sent AI drafts at rates around 55–65% with minimal editing — even in cases where the drafts contained clinical imprecision. A follow-up review found that several drafts had recommended follow-up timelines that contradicted published guidelines. The interface had been designed for efficiency; it had been so efficient that physicians were no longer reading the outputs critically. The trust had slipped from appropriate to automatic.

2.1 · What Trust Actually Is (In This Context)

In the context of human-AI interaction, trust is not a single variable — it is a calibration state that has both a level and an accuracy. A user can have high trust that is well-calibrated (confidence matches actual system reliability), high trust that is poorly calibrated (confidence exceeds system reliability — overtrust), low trust that is well-calibrated (appropriate skepticism of a genuinely unreliable system), or low trust that is poorly calibrated (undertrust — rejecting useful outputs from a reliable system).

The design goal is not maximum trust. It is accurate calibration. This distinction matters because the interventions for overtrust and undertrust are often opposites, and deploying the wrong intervention can worsen the problem you were trying to solve.

Research by Heerink et al. (2010) and later extended by Hancock et al. (2011) in a meta-analysis of 50 human-robot interaction studies established that trust in automated systems is determined by three broad factor clusters: performance factors (does the system actually work?), process factors (does the system's behavior seem predictable and appropriate?), and purpose factors (does the system seem to be designed for my benefit?). All three are addressable through UX design, even when the underlying model's performance is fixed.

Trust Calibration The degree to which a user's confidence in an AI system's outputs accurately reflects the system's actual reliability in context. Well-calibrated trust is the goal; both overtrust and undertrust represent calibration failures.

Automation Bias The tendency to over-rely on automated or algorithmic systems, particularly in environments where the human operator is busy, stressed, or fatigued. First documented in aviation by Parasuraman & Riley (1997) and now extensively studied in AI contexts.

2.2 · How Trust Forms Rapidly — and Wrongly

Trust in AI systems forms faster than trust in human agents, and it is far more sensitive to early experiences. A 2019 study by Hoff & Bashir found that a single high-quality early interaction significantly elevated user trust across an entire subsequent session, even when later outputs degraded. Conversely, a single salient failure early in a session could suppress trust below baseline for the entire remainder. This primacy effect in trust formation has direct design implications: the onboarding experience and the handling of early interactions are not just usability concerns; they are trust architecture.

Three specific interface properties have been shown to artificially inflate trust without corresponding improvements in system quality:

Visual polish: Cleaner, more professional-looking interfaces reliably generate higher trust scores in studies, independent of underlying accuracy. A 2021 Nielsen Norman Group analysis of AI-powered product recommendation systems found that upgrading visual design while holding recommendation algorithm constant increased reported user confidence by 23%.

Verbosity: Longer, more detailed responses are rated as more trustworthy than shorter, accurate ones — even by expert evaluators under time pressure. This is the mechanism behind many AI "hallucinations" succeeding undetected: the answer is detailed enough to feel researched.

Consistency of tone: Systems that maintain a consistent, confident tone are trusted more than systems that hedge, even when hedging is more epistemically appropriate. This creates a direct conflict between designing for trustworthiness and designing for accurate trust calibration.

The Confidence–Calibration Tradeoff

Interface designers face a genuine dilemma: systems that communicate appropriate uncertainty (by hedging, flagging low-confidence outputs, and prompting verification) score lower on initial user trust surveys — yet produce better long-term outcomes. Systems that project consistent confidence score higher on trust surveys but produce more downstream errors. There is no clean solution, but explicit uncertainty communication implemented consistently from day one establishes a norm that users adapt to over time.

2.3 · Designing for Calibrated Trust

The clearest evidence-based practices for trust calibration in AI interfaces come from a combination of aviation automation research and more recent work in medical AI. Several interventions have consistent empirical support:

Confidence indicators that are accurate: Simply displaying a confidence score does not help if the score is uncalibrated. Research by Jiang et al. (2018) at Google found that users rapidly learn to ignore confidence indicators that don't predict actual error rates. Calibrated uncertainty indicators — where 70% confidence actually means approximately 70% accuracy — are useful. Decorative ones are worse than nothing because they create false security.

Failure mode previews in onboarding: Showing users representative examples of how and where the system fails during the initial introduction to a tool — before they've started depending on it — has been shown to significantly improve overtrust calibration without suppressing adoption. This is the opposite of the conventional product instinct to lead with strengths.

Friction at high-stakes decision points: Inserting a confirmation step or a brief pause before high-consequence AI-assisted decisions (not for all outputs, just high-stakes ones) reduces automation bias measurably. The Epic AI message drafts case above is an example of a system that needed this intervention but didn't have it.

Attribution transparency: Showing users where an AI's output came from — what data or what kind of reasoning process generated it — improves calibration even when users cannot evaluate the sources directly. The mechanism appears to be that attribution activates a more analytical processing mode rather than a fluency-driven one.

Design Principle · Calibration Over Confidence

The goal of AI UX trust design is not to maximize trust — it is to make users' trust accurately reflect system reliability. An interface that successfully inflates trust without a corresponding improvement in system quality has made the product more dangerous, not better. Measure calibration, not confidence.

Trust calibration is the foundation on which all other AI UX design rests. A user with badly calibrated trust will misuse even a well-designed interface. In Lesson 3 we turn to the design mechanism most directly tied to calibration: transparency and explainability.

Lesson 2 Quiz

Five questions · trust formation and calibration

1. What does "well-calibrated trust" mean in the context of human-AI interaction design?

Correct. Calibration refers to the match between confidence and actual reliability — not the absolute level of trust. Both overtrust and undertrust are calibration failures, and the design goal is accuracy, not maximum confidence.

Trust calibration is about accuracy of the confidence level, not its consistency or magnitude. A user can be consistently wrong in their trust level. Review section 2.1 for the full definition.

2. The Epic Systems AI draft message case (2023) is primarily an example of which failure mode?

Correct. The Epic case is a textbook automation bias example: an efficient interface removed critical friction, and physicians transitioned from thoughtful review to near-automatic acceptance — with acceptance rates around 55–65% with minimal editing.

The case shows the opposite of undertrust. Physicians were accepting, not rejecting, AI outputs. The efficiency of the interface had reduced critical engagement to the point where clinical imprecision was passing through undetected. Review the opening case study of L2.

3. Research by Hoff & Bashir (2019) on trust formation found which specific pattern?

Correct. This is the primacy effect in trust formation: early interactions — positive or negative — have disproportionate weight in setting the trust baseline for an entire session. This makes the onboarding and early interaction design critical trust architecture, not just usability.

The specific finding was about the primacy effect: early experiences — both good and bad — have outsized influence on trust for the whole session. This has direct implications for onboarding design. Review section 2.2.

4. Why are decorative confidence indicators (that don't accurately predict error rates) described as "worse than nothing"?

Correct. Jiang et al. (2018) at Google found that users do rapidly learn to ignore uncalibrated confidence signals — but before they learn this, they trust the signal as meaningful. The net effect is higher overtrust during the critical early period of usage.

The issue is more fundamental than visual clutter. Uncalibrated confidence indicators actively mislead users during the early trust formation period, creating overtrust in exactly the situations where verification is most important. Review section 2.3.

5. According to the research discussed in Lesson 2, which of the following interface properties artificially inflates user trust WITHOUT improving system accuracy?

Correct. The 2021 Nielsen Norman Group analysis found that visual design upgrades alone — holding the algorithm constant — increased reported user confidence by 23%. This is an example of trust being driven by interface quality signals rather than system quality.

The three properties that artificially inflate trust identified in section 2.2 are visual polish, verbosity, and consistency of tone. Source citations, failure examples, and friction at decision points all tend to improve calibration rather than inflate trust. Review section 2.2.

Lab 2 · Trust Calibration Audit

Practice diagnosing trust failures in real AI interface designs

What you're doing in this lab

You'll work through case scenarios where AI interface design has produced either overtrust or undertrust. For each scenario, you'll identify the specific design elements causing the miscalibration and propose targeted interventions. The assistant will push back on vague answers and ask you to get specific about implementation.

Complete at least three exchanges to finish the lab.

Scenario: A medical AI diagnostic tool displays results with a green checkmark icon when confidence exceeds 80% and a yellow caution icon below that threshold. In the first month of deployment, clinicians report finding the yellow icon "alarming" and begin waiting until they have enough clinical context to feel confident before consulting the tool — essentially bypassing it for ambiguous cases, which are precisely where it is most useful. Diagnose this trust calibration failure.

AI Lab Assistant

Trust Calibration Audit

Let's audit some trust calibration failures. Work through the scenario above — identify whether this is overtrust or undertrust, pinpoint the specific design decision that produced it, and propose a concrete alternative. When you're ready, I'll give you a harder one.

Lesson 3 · Transparency and Explainability in AI Interfaces

Showing the Work Without Losing the User

What transparency actually means in practice — and why explainability is a design problem, not just a technical one

When an AI explains itself, does that always help the user — or can explanation be its own kind of manipulation?

In 2016, ProPublica published an investigation into COMPAS, a recidivism prediction algorithm used by courts in Wisconsin and elsewhere to inform bail and sentencing decisions. The algorithm had been in use since the 1990s. Defendants, judges, and defense attorneys had access to its outputs — a risk score — but no access to the factors driving them. When researchers analyzed the scores, they found that Black defendants were nearly twice as likely as white defendants to be falsely flagged as high-risk, and white defendants were more likely to be falsely flagged as low-risk. The algorithm's vendor, Northpointe, declined to disclose the model's features, citing proprietary concerns. The case became one of the defining arguments for explainability requirements in AI systems — but it also exposed a subtler problem: even if COMPAS had provided explanations, neither defendants nor judges possessed the statistical literacy to evaluate them. Providing an explanation is not the same as enabling comprehension.

3.1 · The Transparency Spectrum

Transparency in AI interfaces exists along a spectrum with at least five meaningfully distinct levels. Understanding which level is appropriate for a given context is a core design decision — one that most teams make implicitly rather than explicitly.

Five Levels of AI Transparency

Level 1 — Existence disclosure: The user is told that an AI system is involved in producing what they see. Minimum legal requirement in many jurisdictions post-2023 EU AI Act.

Level 2 — Confidence signaling: The system indicates how certain it is about an output. Useful only when calibrated accurately (see L2).

Level 3 — Factor disclosure: The system shows which inputs most influenced a particular output. The "why" level — common in recommendation systems ("We're recommending this because you watched X").

Level 4 — Process transparency: The system shows something about how it generated the output — not just what influenced it, but how. Chain-of-thought explanations in LLMs are one example.

Level 5 — Full auditability: Complete access to model weights, training data, and decision logic. Rarely practical in deployed products; relevant in regulatory and forensic contexts.

Most commercial AI products operate at Levels 1–3. The decision between them involves real tradeoffs. Level 1 alone is usually insufficient for user calibration but is legally necessary. Level 3 (factor disclosure) is the level at which most "explainability" features in consumer products operate — but as the COMPAS case illustrates, disclosing factors without enabling comprehension can produce false confidence in users who assume the disclosed factors are complete and unbiased.

3.2 · Explainability as a UX Problem

The explainable AI (XAI) research field has focused heavily on the technical challenge of generating explanations from complex models. The UX problem — whether those explanations actually help users make better decisions — is considerably less studied. A 2021 study by Bansal et al. at Microsoft Research examined whether AI explanations improved human decision-making on a binary classification task. The finding was counterintuitive: explanations sometimes degraded human performance by anchoring users to the model's reasoning even when the model was wrong. Users shown explanations were less likely to override a wrong AI prediction than users who saw only the prediction.

This is not an argument against explanations — it is an argument for designing explanations that are fit for purpose rather than explanations that merely exist. Several properties distinguish useful explanations from misleading ones:

Contrastive framing: Explaining why the model chose A rather than B (contrastive) is more actionable than explaining why it chose A in general. Users naturally ask "why this and not that" — explaining in that structure matches their cognitive frame.

Appropriate complexity: Explanations should be as simple as possible while still being accurate enough to support the decision at hand. Medical AI explanations shown to patients need different complexity levels than explanations shown to radiologists — even for the same model output.

Scope honesty: Explanations should be honest about what they don't explain. A factor disclosure that lists three features should not imply those are the only features. The absence of a "these are not all factors" note is a design choice that systematically misleads users.

Explainable AI (XAI) A body of techniques designed to make AI system outputs interpretable — either by building interpretable models or by adding post-hoc explanation layers to complex models. The field is technically mature but UX implementation remains inconsistent.

Contrastive Explanation An explanation framed as "why X rather than Y" — preferred by most users over categorical explanations because it matches the decision-making question users actually face when considering whether to act on an AI output.

3.3 · Transparency Theater and Genuine Disclosure

A significant proportion of what passes for transparency in deployed AI systems is what researchers have begun calling transparency theater: interface elements that signal openness without providing actionable information. A "Learn why" link that opens a generic explanation of how recommendation algorithms work in general, rather than why this specific item was recommended to you specifically, is transparency theater. An AI disclosure badge that says "Powered by AI" without any indication of what the AI is doing or how it might fail is transparency theater.

Transparency theater is not merely useless — it actively harms calibration by occupying the cognitive space where genuine transparency could go. Users who see a "Learn why" link and click it once, find it unhelpful, and thereafter ignore it have received a negative update about the value of engaging critically with AI outputs. The interface has trained them to not look closely.

The EU AI Act (2024) and emerging US state regulations are beginning to define minimum disclosure standards that exceed theater — but compliance with disclosure requirements and genuine informational transparency remain different things. Designers working in regulated contexts need to track both.

Design Principle · Explanation Fit

An explanation should be designed for the specific decision the specific user faces, not for general education about AI or for regulatory compliance. Ask: given this explanation, can this user decide whether to trust this output in this context? If the honest answer is no, the explanation is decoration, not design.

3.4 · Practical Transparency Patterns

Several transparency patterns have consistent positive effects on user calibration across multiple studies:

Confidence-conditional disclosure: Surfacing explanation details only when confidence falls below a threshold — rather than always — reduces interface noise while ensuring users receive signals at the moments they most need them. Google's AI Overviews in Search (2024) uses a version of this by surfacing source links more prominently when queries touch contested or health-related domains.

Error exemplars in onboarding: Showing users real examples of the type and frequency of errors the system makes before first use — not just what it can do well — has consistent positive effects on calibration without significantly reducing adoption. This runs against standard product marketing logic but is supported by multiple studies in medical AI deployment.

Reversibility signals: Clearly communicating that an AI-assisted decision is reversible — or flagging explicitly when it is not — adjusts the level of scrutiny users apply appropriately. Users apply more critical review to irreversible actions when they are explicitly labeled as such.

Scope boundary markers: Explicitly stating what the system was not designed to do — in the interface, not just in documentation — reduces out-of-scope usage and the trust failures that follow from it. Claude's constitution for Claude 3 (Anthropic, 2024) is one example of a public, in-product scope definition that went beyond legal disclaimers.

Transparency is not a checkbox. It is a design process that requires understanding what information your specific users need to make your specific decisions, and then designing the most legible possible representation of that information. In Lesson 4 we turn to the evaluation process itself — how you measure whether your AI UX design is working.

Lesson 3 Quiz

Five questions · transparency and explainability in AI interfaces

1. The COMPAS recidivism algorithm case (ProPublica, 2016) revealed which critical insight about AI explainability?

Correct. Even if COMPAS had provided explanations, neither defendants nor judges had the statistical literacy to evaluate them. The case made clear that explanation ≠ comprehension — and that explainability features must be designed for their actual audience's reasoning capabilities.

The lesson about COMPAS was more nuanced. The absence of explanation was a problem, but so was the gap between explanation and usable comprehension. Even with explanations, users need to be able to interpret and act on them. Review the opening case study of L3.

2. At which level of the transparency spectrum do most commercial AI consumer products currently operate?

Correct. Most commercial products stop at Levels 1–3 — disclosing that AI is involved, sometimes showing confidence, and sometimes showing which factors influenced a result. Levels 4 and 5 are primarily relevant in regulatory and research contexts.

Full auditability (Level 5) is not a standard consumer product feature, and regulatory requirements for it are not yet universal. Most products operate at Levels 1–3. Review section 3.1 for the full transparency spectrum.

3. The Bansal et al. (2021) Microsoft Research study on AI explanations found that explanations sometimes degraded human performance. What was the specific mechanism?

Correct. The explanation created an anchoring effect: once users had been given a reason for the AI's choice, they were less willing to deviate from it even when their own judgment should have led them to override an incorrect output. This is a specific form of automation bias induced by explanation.

The mechanism was anchoring. The explanation gave users a frame for the AI's reasoning, and that frame persisted even when it should have been overridden. This is one of the most important — and counterintuitive — findings in AI UX research. Review section 3.2.

4. What is "transparency theater" in the context of AI interface design?

Correct. Transparency theater includes things like "Learn why" links that open generic explanations rather than context-specific ones, and "Powered by AI" badges with no information about what the AI is doing or where it might fail. The key characteristic is signaling without information.

Transparency theater refers specifically to the gap between appearing transparent and being informatively transparent. It's a design anti-pattern where disclosure features exist for the appearance of openness but don't enable users to make better decisions. Review section 3.3.

5. What makes "contrastive explanation" preferable to categorical explanation for most AI UX contexts?

Correct. When a user is deciding whether to accept an AI output, their implicit question is "why this rather than something else?" Contrastive framing matches that cognitive structure directly — making the explanation immediately actionable rather than informational.

The advantage is cognitive alignment, not brevity, technical ease, or compliance efficiency. Contrastive explanations map onto the actual question users face during decision-making: "should I act on this output or a different one?" Review section 3.2.

Lab 3 · Designing Explanations

Practice writing and critiquing AI explanations for real deployment contexts

What you're doing in this lab

You'll write and critique explanation designs for AI interface scenarios. The assistant will give you a context, ask you to draft an explanation at a specific transparency level, and then evaluate whether it's genuinely useful or theater. You'll revise based on feedback.

Complete at least three exchanges to finish the lab.

Context: A consumer credit-scoring AI has flagged a loan application as high risk. The applicant can see the flag but not the score. You need to design a Level 3 (factor disclosure) explanation that is genuinely useful — not theater — for the applicant. Write it, then explain your choices.

AI Lab Assistant

Explanation Design

Let's design some explanations. Start with the credit-scoring scenario above — write out the actual copy you would show the applicant, at Level 3 transparency. Then tell me why you made the specific choices you made. I'll evaluate whether it's genuinely useful or crosses into theater.

Lesson 4 · Evaluating AI Interaction Design

You Cannot Improve What You Cannot Measure

Methods for evaluating AI UX rigorously — from heuristic review to longitudinal behavioral studies

How do you know if your AI interface design is actually working — and how do you distinguish a UX failure from a model failure?

In 2019, Google launched Duplex — an AI system that could make phone reservations on behalf of users — to limited public availability. The initial demonstrations were stunning; in one widely viewed video, the system called a hair salon and booked an appointment with natural-sounding pauses, filler words ("um," "mm-hmm"), and graceful handling of an ambiguous question about availability. What the demonstrations didn't show was the failure mode: Duplex struggled significantly with calls that deviated from its trained scenarios — unfamiliar accents, unusual business hours structures, or questions outside its domain. Google's internal evaluations had measured success on the task the system was designed to perform under ideal conditions. The real-world distribution of calls was considerably messier. By 2023, Google had quietly rolled back many of Duplex's autonomous features, with human operators handling an increasing share of calls flagged as out-of-scope. The system worked; the evaluation had been insufficiently adversarial.

4.1 · Why Standard UX Evaluation Methods Fall Short for AI

Traditional UX evaluation methods — think-aloud usability testing, task completion rate measurement, System Usability Scale surveys — were designed for deterministic interfaces. They measure whether users can accomplish defined tasks with defined interfaces. Applied to AI systems, they produce misleadingly positive results for two structural reasons.

First, distributional coverage: usability tests typically sample a narrow range of user inputs against a well-designed test scenario. AI systems' failure modes are concentrated in the tail of the input distribution — the unusual requests, the edge cases, the out-of-scope queries. Standard usability testing rarely reaches those tails. A chatbot can sail through a dozen canonical test scenarios and fail badly on a thirteenth that no tester thought to try.

Second, longitudinal trust drift: usability tests measure interaction at a single point in time. AI systems' trust dynamics play out over weeks and months. A system with excellent first-session usability may produce severe overtrust problems after three weeks of use, as the novelty effect fades and automation bias sets in. Short-term evaluation misses this entirely.

4.2 · An AI-Specific Evaluation Framework

Effective evaluation of AI interaction design requires methods that address the distributional coverage problem and the longitudinal trust drift problem. The following framework draws on published work from the Google PAIR team (2019), the Microsoft Research FATE group (2021), and academic AI HCI research.

AI UX Evaluation Framework — Four Layers

Layer 1 — Heuristic Review (AI-adapted): Apply adapted heuristics that include AI-specific criteria: Does the interface communicate system limitations? Does it provide calibrated uncertainty signals? Does it handle failures gracefully? Ben Shneiderman's 2020 "Ladder of Trust" provides a structured heuristic set specifically for AI systems.

Layer 2 — Adversarial Task Testing: Design test scenarios that specifically target failure modes — out-of-scope queries, ambiguous inputs, edge cases identified from failure mode analysis. Do not only test the ideal path. The Google Duplex failure is a direct consequence of insufficient adversarial testing.

Layer 3 — Trust Calibration Measurement: Use validated instruments (e.g., the Trust in Automation scale, Jian et al. 2000; or the MDMT, Ullman & Malle 2019) to measure trust level and compare it against actual system reliability metrics. The gap between these two numbers is your calibration error.

Layer 4 — Longitudinal Behavioral Observation: Track how usage patterns evolve over weeks of naturalistic use. Key signals: override rate (are users checking AI outputs less over time?), error detection rate (are users catching AI mistakes at the same rate after 30 days as after 3 days?), and scope drift (are users using the system for tasks outside its design envelope?).

4.3 · Distinguishing UX Failures from Model Failures

One of the most practically important skills in AI UX evaluation is diagnosing whether an observed failure is a UX problem (fixable by design) or a model problem (requires retraining or architectural change). The distinction matters because the remediation paths are entirely different, the teams responsible are different, and the timelines are different. Misdiagnosing a model failure as a UX problem leads to design churn that cannot solve the underlying issue.

A structured diagnostic approach: For any failure event, ask three questions in sequence. First, would any user presentation of this output have led to the same outcome? If yes — if the output was simply wrong and no amount of framing could have made it correct — this is a model failure. Second, did the user have access to sufficient information to identify the error, and did they use it? If the information was available but the user didn't engage with it, this is a UX failure (likely overtrust or transparency design issue). Third, was the information available in a form the user could reasonably interpret given their context and expertise? If not — the information existed but wasn't legible — this is an explainability design failure.

Many real failures involve all three components. A useful AI failure taxonomy by Wang et al. (2019) at Carnegie Mellon distinguishes between model-caused failures (wrong output), interaction-caused failures (correct output, user couldn't use it), and context-caused failures (output was correct for training distribution, wrong for deployment context). Each demands a different response.

Override Rate The proportion of AI outputs that users modify or reject before acting on them. A declining override rate over time may indicate either increasing user confidence in a reliable system (good calibration) or increasing automation bias (miscalibration). Interpretation requires comparison with actual error rates.

Scope Drift The pattern in which users, over time, apply an AI system to tasks beyond its intended design envelope — typically as trust increases and the user's mental model of the system's capabilities expands beyond its actual boundaries.

4.4 · Iteration and Measurement in Practice

The gap between AI UX research and AI UX practice remains large. Most teams deploying AI-powered products do not use validated trust instruments, do not conduct adversarial testing, and do not track longitudinal behavioral signals. The primary measurement most teams rely on is engagement — session length, return rate, feature usage. These metrics are not useless, but they are easy to optimize in ways that increase engagement while worsening calibration. A chatbot that gives confident, fluent wrong answers might produce higher engagement than one that hedges appropriately, because the confident answers feel more satisfying in the moment.

The most rigorous deployed example of a comprehensive AI UX evaluation methodology in the public record is the process Google's PAIR team published in 2019 alongside the "People + AI Guidebook." That framework explicitly distinguishes between user satisfaction metrics (which can be gamed by the ELIZA effect) and user outcome metrics (which require tracking what users did with AI outputs in the real world). The distinction is conceptually simple and practically very difficult to implement — it requires connecting interface analytics to downstream outcome data, which most product teams have neither the infrastructure nor the organizational incentives to do.

The honest state of the field in 2024 is that AI UX evaluation methodology is significantly behind AI development capability. The tools exist; the will to apply them consistently is unevenly distributed. Understanding what rigorous evaluation looks like is the first step toward practicing it, even in constrained environments.

Design Principle · Outcome Over Engagement

Measure what users accomplish with AI outputs, not just how much they interact with AI features. Engagement metrics can be optimized in ways that worsen user outcomes. If you cannot yet connect interface analytics to downstream outcomes, be explicit with stakeholders about what your engagement data cannot tell you.

This module has covered the foundational landscape of human-AI interaction design: the structural properties that make AI UX categorically different from prior software UX (L1), the mechanics of trust formation and calibration (L2), the design of transparency and explainability (L3), and the methods for evaluating whether your designs are working (L4). The Module Test ahead covers all four lessons. Take it when you're ready.

Lesson 4 Quiz

Five questions · evaluating AI interaction design

1. Google Duplex's real-world difficulties after its 2019 launch illustrate which evaluation failure?

Correct. Duplex performed well on the task it was designed to perform under well-formed conditions. Real phone calls are messier — unusual accents, unexpected questions, ambiguous availability structures. The evaluation hadn't probed those failure modes.

The specific lesson from Duplex is about distributional coverage in evaluation — testing the ideal path without sufficiently probing the tail of the real-world distribution. Review the opening case study of L4 and section 4.1.

2. What is "longitudinal trust drift" and why does it make standard usability testing insufficient for AI?

Correct. A system can have excellent first-session usability and still produce severe overtrust problems after three weeks as novelty fades and automation bias accumulates. Single-session testing misses this entirely. Longitudinal behavioral observation is the evaluation method that addresses it.

Longitudinal trust drift refers to how user trust calibration changes over extended use — not system performance drift. Standard usability tests only see the first session, missing the trust dynamics that develop over weeks. Review section 4.1.

3. According to Wang et al.'s (2019) Carnegie Mellon failure taxonomy, which category describes a situation where the AI produced the correct output but the user couldn't act on it effectively?

Correct. An interaction-caused failure is one where the model output was correct but the interface failed to present it in a usable way — the user couldn't interpret, verify, or act on a correct output. This is a UX failure, not a model failure, and demands a design-side fix.

Wang et al.'s taxonomy distinguishes three types: model-caused (wrong output), interaction-caused (correct output, user couldn't use it), and context-caused (correct for training distribution, wrong for deployment context). Review section 4.3.

4. A product team observes that users' override rate — the proportion of AI outputs they modify before acting — has declined significantly over three months of use. How should this be interpreted?

Correct. A declining override rate is consistent with two very different situations: users getting appropriately confident in a genuinely reliable system, or users developing automation bias in a system that hasn't actually improved. You can only distinguish them by tracking whether the error rate changed proportionally.

Override rate alone is ambiguous — you need to compare it against the system's actual error rate over the same period. If errors declined proportionally, the declining override rate is well-calibrated. If errors stayed constant while override rate dropped, this is automation bias. Review section 4.2.

5. The Google PAIR "People + AI Guidebook" (2019) draws a distinction between which two types of AI UX metrics?

Correct. The key insight in the PAIR guidebook is that satisfaction metrics can be gamed by the ELIZA effect — fluent AI outputs feel satisfying regardless of accuracy. Outcome metrics — tracking what users did with outputs in the real world — are more honest but much harder to instrument.

The core distinction in the PAIR framework is satisfaction vs. outcome. Satisfaction metrics are easy to collect but can reflect the ELIZA effect rather than actual utility. Outcome metrics require connecting interface analytics to what happened after the user acted on AI output. Review section 4.4.

Lab 4 · Failure Diagnosis and Evaluation Design

Practice diagnosing AI UX failures and designing evaluation plans for real AI products

What you're doing in this lab

You'll work through AI product failure scenarios and evaluation design challenges. For each scenario, you'll classify the failure type (model, interaction, or context-caused), identify what evaluation methodology would have caught it earlier, and propose a specific measurement plan for an ongoing deployment. The assistant will challenge vague proposals and push for operational specificity.

Complete at least three exchanges to finish the lab.

Scenario: An AI-powered legal document summarizer has been deployed to a mid-size law firm for 60 days. Initial user satisfaction scores were high (4.4/5). After two months, a senior partner notices that associates are no longer reading the original documents before using summaries in briefs — and two briefs have contained errors that trace back to summary inaccuracies. Classify this failure, identify what evaluation method would have caught it, and propose a measurement plan for the next 30 days.

AI Lab Assistant

Failure Diagnosis & Evaluation Design

Let's work through the legal summarizer scenario. Classify the failure type first, then tell me what evaluation method should have been in place from day one, then propose your 30-day measurement plan. Be specific enough that someone could actually implement what you describe.

Module 1 Test

15 questions across all four lessons · 80% required to pass

1. What is the core reason that classical UX heuristics (like Nielsen's ten principles) are insufficient for AI interface design?

Correct. The foundational assumption of classical UX heuristics is a deterministic system. AI's non-determinism requires new frameworks that account for variability, uncertainty communication, and trust calibration.

The specific break is determinism. Classical heuristics assume you can design for a predictable system; AI systems produce variable outputs, requiring entirely different design approaches to uncertainty and trust.

2. Joseph Weizenbaum's ELIZA program (1966) was significant for AI UX history primarily because it demonstrated which phenomenon?

Correct. The ELIZA effect — anthropomorphization driven by language fluency, not by system capability — is one of the foundational empirical observations of human-AI interaction. Weizenbaum was alarmed by it; his concern directly inspired the field of AI ethics.

ELIZA did not pass the Turing Test and was never claimed to. The significance was the ELIZA effect: users forming genuine emotional responses to a system that was doing nothing more than pattern matching.

3. In the four-node interaction loop (intent formation → input construction → output interpretation → action), which phase does research identify as most systematically underestimated by designers?

Correct. Amershi et al. (2022) at Microsoft Research found that designers optimize for accuracy metrics while users evaluate outputs by fluency, length, and confidence of tone — and these don't correlate. A confident wrong answer passes user evaluation more often than a hesitant correct one.

The key finding from Amershi et al. (2022) was about output interpretation — the gap between how designers evaluate outputs (accuracy metrics) and how users do (fluency and tone). Review L1 section 1.4.

4. The design goal for trust in AI interfaces is best described as:

Correct. Trust calibration — matching user confidence to actual system reliability — is the goal. Both overtrust and undertrust are calibration failures. An interface that maximizes trust without improving reliability has made the product more dangerous.

The goal is calibration accuracy, not a specific trust level. Maximizing or minimizing trust both miss the target. Review L2 section 2.1.

5. Hancock et al.'s (2011) meta-analysis of human-automation trust studies identified three factor clusters that determine trust. Which set correctly names them?

Correct. The three clusters — performance, process, and purpose — are important because all three are addressable through UX design, even when the underlying model's performance cannot be changed.

Hancock et al.'s three clusters are performance (actual reliability), process (predictability and appropriateness of behavior), and purpose (whether the system seems oriented toward user benefit). Review L2 section 2.1.

6. Which interface property was found by Nielsen Norman Group (2021) to increase user confidence by 23% when upgraded — without changing the underlying recommendation algorithm?

Correct. Visual design quality alone — with no change to the algorithm — produced a 23% confidence increase. This is a clear demonstration of trust being driven by interface quality signals rather than system quality signals.

The specific finding was about visual polish. Interface aesthetics drove trust scores independent of algorithmic accuracy — a concrete example of how superficial design choices can produce dangerous miscalibration. Review L2 section 2.2.

7. What is the specific harm of uncalibrated confidence indicators (ones that don't accurately predict error rates) in AI interfaces?

Correct. During early use, users treat uncalibrated confidence signals as meaningful — producing overtrust. After they discover the indicator doesn't predict errors, they ignore it — losing the value even a calibrated indicator would have provided. Net effect: worse calibration than having no indicator.

Uncalibrated confidence signals tend to inflate trust rather than suppress it, since users initially treat them as meaningful signals. The harm is overtrust followed by learned disengagement. Review L2 section 2.3.

8. What did ProPublica's 2016 COMPAS investigation reveal about the limits of AI transparency?

Correct. The COMPAS case established that even if Northpointe had provided full explanations, neither defendants nor judges were equipped to evaluate them. Explainability features must be designed for their actual audience — this is a UX problem, not just a technical one.

The COMPAS lesson was specifically about the gap between explanation and comprehension. Transparency requires not just disclosure but also design that enables users to interpret and act on what is disclosed. Review L3 opening case study.

9. What distinguishes "transparency theater" from genuine informational transparency in AI interfaces?

Correct. The test for genuine transparency is functional: does this information enable this specific user to make a better decision in this specific context? If the honest answer is no, it's theater, regardless of how it looks.

The distinction is functional, not aesthetic or organizational. Theater occupies the space where real transparency could go and trains users to stop engaging critically. Review L3 section 3.3.

10. A contrastive explanation answers which question, and why is this structure preferable?

Correct. Contrastive framing maps onto the decision-making question users actually face: "Should I go with this AI output or a different one?" It makes explanations immediately actionable rather than informational.

Contrastive explanations answer "why X rather than Y" — the question embedded in every real decision about whether to act on an AI output. This alignment with actual decision structure is what makes them more useful. Review L3 section 3.2.

11. Why does standard usability testing produce misleadingly positive results for AI systems?

Correct. Two structural problems: (1) tests cover a narrow slice of possible inputs, missing the tail where AI failures concentrate; (2) tests are single-session, missing the trust drift that develops over weeks. Both need to be addressed with AI-specific evaluation methods.

The two structural issues are distributional coverage (missing the failure-mode tail) and temporal coverage (missing longitudinal trust drift). Both require AI-specific evaluation additions. Review L4 section 4.1.

12. In Wang et al.'s (2019) AI failure taxonomy, a "context-caused failure" refers to which situation?

Correct. Context-caused failures are the scope drift problem: the model works as designed within its training distribution, but deployment exposed it to inputs outside that distribution. The fix is often scope boundary design rather than model retraining.

Context-caused failure in Wang et al.'s taxonomy means the model performed correctly within its design scope but was deployed in a context outside that scope. This is a deployment design problem. Review L4 section 4.3.

13. Which user mental model do most people naturally apply to large language models, and why is it inaccurate in critical ways?

Correct. The conversational interface activates the "human expert" mental model, which imports properties LLMs lack: verified factual knowledge, genuine beliefs, and reliable self-knowledge of error. This mismatch is the root cause of most LLM trust miscalibration.

The search engine model is a more recent concern, but the primary mental model issue is the "knowledgeable human" — activated by the conversational interface — which imports false assumptions about verified knowledge and error awareness. Review L1 section 1.3.

14. What specific evaluation method does the four-layer AI UX evaluation framework use to address the distributional coverage problem?

Correct. Layer 2 — adversarial task testing — is specifically designed to probe the failure-mode tail that standard usability testing doesn't reach. It requires deliberately designing scenarios that challenge the system at its boundaries.

The distributional coverage problem is addressed by Layer 2 — adversarial testing — not by heuristic review, trust measurement instruments, or longitudinal observation. Each layer targets a different evaluation gap. Review L4 section 4.2.

15. The Google PAIR Guidebook's distinction between "satisfaction metrics" and "outcome metrics" matters because:

Correct. A chatbot that gives confident, fluent wrong answers might produce higher satisfaction scores than one that hedges appropriately — because the confident answers feel better in the moment. Satisfaction metrics optimized by the ELIZA effect can drive product decisions that worsen real outcomes.

The PAIR distinction is about measurement validity, not cost or legal status. Satisfaction is easy to collect but easy to inflate through fluency and tone. Outcome metrics — what users accomplished with AI output — are the honest measure. Review L4 section 4.4.