On March 23, 2016, Microsoft launched Tay, a conversational AI on Twitter designed to learn from interactions with 18β24-year-olds. Within 16 hours, Tay had posted more than 96,000 tweets including racist and inflammatory content. Microsoft pulled it offline. The failure was not purely a safety failure β it was a profound violation of Nielsen's first heuristic: Visibility of System Status. Users had no idea what Tay was learning, no signal that inputs were shaping outputs, and no feedback that behavior was drifting dangerously. The interface offered no window into the machine.
Jakob Nielsen published his ten usability heuristics in 1994 after analyzing 249 usability problems. They were designed for graphical user interfaces β menus, buttons, dialogs. AI systems inherit these problems and add entirely new dimensions. Where a button either works or doesn't, an AI can be confidently wrong, partially right, or right for the wrong reasons.
The ten heuristics remain the most widely used usability evaluation framework in industry. Google's Material Design guidelines, Apple's Human Interface Guidelines, and Microsoft's Fluent Design System all trace lineage back to Nielsen's framework. Applying them to AI requires both fidelity to the originals and sensitivity to what makes AI categorically different.
In 2023, Nielsen Norman Group published research arguing that LLM-based interfaces require six additional heuristics beyond the original ten: managing AI uncertainty, calibrating user trust, explaining AI reasoning, handling hallucinations, managing context windows, and supporting human-AI collaboration patterns. These are not replacements β they are extensions for a new class of interface.
You will conduct a heuristic audit of a real AI product (ChatGPT, Bing Chat, Notion AI, GitHub Copilot, or another you have access to). Work with the AI assistant below to structure your audit, identify violations, and rate their severity.
In May 2018, a Portland family discovered that their Amazon Echo had recorded a private conversation and sent it to a contact without their knowledge. Alexa had misheard "Alexa" in background speech, then misinterpreted subsequent conversation as a send command. No malice β but a catastrophic mismatch between the user's mental model ("Alexa only listens after the wake word") and the system's actual behavior ("Alexa continuously processes audio to detect the wake word"). The gap between these models was invisible in the interface.
A mental model is the internal representation a user builds of how a system works. It governs predictions, inferences, and recovery strategies. When your mental model of an elevator matches its actual behavior, you press floors confidently. When it doesn't, you press Door Close repeatedly hoping for results.
Don Norman's foundational distinction in The Design of Everyday Things (1988) differentiates the designer's model (how the system actually works), the user's model (what users believe), and the system image (what the interface communicates). Good design aligns all three. AI systems are uniquely difficult here: the designer's model is itself uncertain (even engineers don't fully understand LLM behavior), and the system image rarely communicates this honestly.
Research from Stanford's Human-Centered AI group (2021) found that users consistently anthropomorphize AI systems β attributing intent, memory, and understanding that large language models do not possess. This isn't user error. It's the predictable consequence of AI interfaces designed to feel human without communicating their fundamental differences from humans.
Calibrated uncertainty disclosure is the practice of surfacing confidence levels alongside AI outputs. Systems like Perplexity AI display citations and source quality ratings to help users calibrate trust. The key design challenge: uncertainty displays must be accurate (not always confident) and must not overwhelm users to the point of distrust paralysis.
Process transparency reveals what the system is doing, not just what it produced. Google's NotebookLM (2023) shows which source passages informed each AI response β giving users a verifiable trace from input to output. This is qualitatively different from a raw answer.
Limitation disclosure is explicit communication of what the AI cannot do. Microsoft's Copilot in Bing includes a persistent note about the conversation window limit and the possibility of inaccurate information. These are not legal disclaimers β they are usability features that maintain accurate mental models.
A German study on human-robot interaction found that users who received accurate (lower) competence signals about a robot made better decisions when working with it than users who received inflated competence signals. Accurate mental models outperform flattering ones β even when the accurate model is less impressive. This generalizes directly to AI interface design.
Choose an AI product you have used. Identify a mental model mismatch you have personally experienced or observed β where what you believed the system would do differed from what it actually did. Work with the assistant to map the mismatch, categorize it (memory, knowledge, certainty, or context), and design a transparency feature to close the gap.
In December 2023, a Chevrolet of Watsonville dealership deployed a customer service chatbot built on ChatGPT. A user discovered that prompt injection could cause the bot to agree to sell a 2024 Chevy Tahoe for $1, claiming "and that's a legally binding offer." The chatbot had no error states, no confidence flags, no escalation path to a human, and no recovery mechanism. It was an AI interface with zero feedback loop architecture.
Traditional software errors are typically binary and recognizable: a form submission fails with a red error message, a file doesn't open, a network request returns 404. The system knows it has failed and communicates this. AI errors are qualitatively different: the system does not know it has failed. A hallucinated citation looks identical to a correct one. A wrong medical dosage is presented with the same confident prose as a correct one.
This creates a fundamental asymmetry. Human error recovery relies on the user recognizing that something went wrong. But if the output looks correct, sounds authoritative, and contains no error signals, the user has no trigger for recovery. The error propagates β into decisions, documents, actions.
The 2023 study "Do Large Language Models Know When They're Hallucinating?" (Azaria & Mitchell, 2023) found that LLMs can be prompted to assess their own factual accuracy with some reliability β suggesting that uncertainty signals could be generated internally and surfaced in the interface. This is a design opportunity, not just a research finding.
Effective AI feedback loops require three distinct layers. The first is immediate feedback β signals given during or immediately after AI generation. Perplexity AI's inline citations are immediate feedback: they appear alongside claims, giving users real-time verification anchors. The absence of citations is itself a signal.
The second layer is structured feedback collection β mechanisms for users to report errors. OpenAI's thumbs up/down on ChatGPT responses, with optional text explanations, creates a structured feedback channel. Critically, this is not just for product improvement β it communicates to the user that the AI can be wrong, normalizing skepticism as appropriate behavior.
The third layer is recovery scaffolding β what happens when an error is identified. This includes: edit interfaces (letting users correct AI output), regeneration controls (requesting a new response), escalation paths (routing to humans), and undo mechanisms. Microsoft's Office Copilot includes a "Discard" option for all AI-generated content β a recovery affordance built into the interaction model from the start.
In February 2024, a British Columbia Civil Resolution Tribunal ordered Air Canada to honor a bereavement discount its chatbot had incorrectly described β ruling that Air Canada was responsible for its chatbot's representations. The chatbot had no error recovery, no human escalation, and no mechanism to flag policy uncertainty. The tribunal's decision established that companies cannot disclaim liability for AI-given advice. This is the regulatory consequence of absent feedback loop architecture.
Choose a specific AI application context (customer service bot, AI medical assistant, AI legal research tool, AI tutoring system, etc.). Design a complete feedback loop architecture: immediate feedback signals, structured error collection, and recovery scaffolding. The assistant will challenge your design with failure scenarios.
On June 1, 2009, Air France Flight 447 crashed into the Atlantic Ocean, killing all 228 aboard. The flight data recorder revealed that the autopilot disconnected after pitot tube icing, requiring the pilots to fly manually. The crew, over-reliant on automation they trusted implicitly, failed to correctly interpret airspeed data and pulled the nose up into a stall they maintained for over three minutes β a stall that the aircraft was actively warning them about through multiple feedback systems. The BEA investigation identified automation bias as a primary contributing factor: pilots had trusted the automated system so completely they lost manual proficiency and situational awareness when it failed.
Automation bias was formally described by Mosier & Skitka (1996) as the tendency to over-rely on automated decision aids β either following their recommendations when manual checks would reveal errors (commission errors) or failing to check for problems the automation does not flag (omission errors). AF447 is the most catastrophic documented example. But automation bias has been documented in radiology (readers miss cancers when AI marks scans as clear), in legal review (lawyers miss clauses when AI contract review tools label documents safe), and in financial trading (operators miss anomalies when algorithmic systems appear stable).
AI systems with natural language interfaces are particularly vulnerable to inducing automation bias. Fluent, confident prose mimics expert human communication β the very register that humans have evolved to trust. A poorly formatted spreadsheet triggers skepticism. A well-written paragraph does not, even when it is wrong.
Performance transparency means showing users how well the AI has performed historically on similar tasks. IBM's Watson for Oncology system eventually included accuracy metrics by cancer type β giving clinicians the base rate to calibrate their trust against. Users who know an AI is 92% accurate on lung cancer staging and 61% accurate on rare sarcomas can weight its outputs accordingly.
Disagreement surfacing means deliberately showing when AI systems disagree with each other, or when the same AI gives different answers to similar questions. Path AI's pathology platform (2020) deliberately surfaces inter-model disagreement β cases where different AI models diverge β flagging these for higher-attention human review. Disagreement is a calibration signal.
Active engagement prompts interrupt passive acceptance. A 2021 JAMA study on AI-assisted chest X-ray reading found that radiologists who were asked to make their own diagnosis before seeing the AI's suggestion showed lower automation bias than radiologists who saw the AI suggestion first. The sequencing of information in the interaction pattern changed the quality of human oversight.
Four documented patterns have emerged from research on effective human-AI collaboration. AI-first with human review: AI generates, human verifies (used in radiological screening at scale). Human-first with AI augmentation: Human decides, AI provides parallel analysis for comparison (used in chess analysis tools). Parallel deliberation: Human and AI work separately, then compare (reduces anchoring, increases automation resistance). Iterative co-creation: Human and AI alternate contributions with explicit handoff signals (used in Copilot-style coding tools). Each pattern produces different trust calibration outcomes.
Research by Rajpurkar et al. (2022, Stanford) found that in chest X-ray reading, human-AI teams consistently outperformed either humans alone or AI alone β but only when the interface was designed to surface cases where human and AI assessment differed. When AI outputs were presented without surfacing disagreement patterns, the team performed worse than AI alone due to automation bias. The implication: complementarity is not automatic. It must be designed into the interaction.
Choose a domain where AI and humans collaborate on consequential decisions (medical diagnosis, legal review, financial analysis, content moderation, hiring screening). Design an interaction pattern from the four documented types (AI-first with human review, human-first with AI augmentation, parallel deliberation, iterative co-creation) that maximizes complementarity and minimizes automation bias. Defend your choice.