On June 28, 2015, Google Photos automatically tagged photos of two Black users as "gorillas." The model had been trained on an image dataset that drastically underrepresented dark-skinned faces. The product shipped without adequate demographic testing. Google's response — removing the "gorilla" label entirely from the classifier — revealed that the design failure preceded the engineering failure. No amount of post-hoc patching fixed the underlying gap in product design principles.
An AI product is not just a model. It is a layered system: the model layer (what the AI can do), the product layer (how capabilities are exposed to users), the feedback layer (how user behavior improves or degrades the system over time), and the trust layer (what users believe the system can and cannot do). Every one of these layers must be designed deliberately, because failures cascade downward from the top.
The distinction between a model-first product and a problem-first product is foundational. Model-first products start with a capability — "we have a large language model, what can we build?" — and hunt for use cases. Problem-first products start with a documented user need and ask what AI capability, if any, helps solve it. Historically, model-first products have higher abandonment rates because their utility is defined by what the AI can do rather than what users actually need.
When Duolingo rebuilt its hints system using GPT-4 in 2023, it began from a documented learner pain point: students were interrupting learning sessions to Google vocabulary definitions, then failing to return. The AI feature was designed to solve that specific exit pattern — not to showcase language model capability.
Start with the failure mode, not the feature. Every AI capability you add to a product creates a new failure surface. Design that surface before you design the capability.
Model layer: Capability constraints are not bugs to hide — they are design inputs. When OpenAI launched the GPT-4 API in March 2023, builders who documented model limitations explicitly in their UX (e.g., Notion AI's "AI may be inaccurate — always review") saw significantly lower user-reported trust breakdowns than those who presented AI output as authoritative.
Product layer: The interface shapes what users expect. Conversational UIs imply human-level comprehension. Command-palette UIs imply tool-level precision. Mismatched interface metaphors create trust collapses when the AI behaves like what it actually is — a probabilistic system — rather than what the interface implied.
Feedback layer: AI products learn from use, but they also degrade from use if feedback loops are poorly designed. Amazon's hiring algorithm, trained on a decade of historical resumes (predominantly male), progressively penalized resumes that included the word "women's" (as in "women's chess club"). The feedback loop encoded bias rather than correcting it. Amazon abandoned the tool in 2018.
Trust layer: Users form mental models of AI systems rapidly and resist updating them. Microsoft's Bing Chat launched in February 2023 with a conversational persona that led some users to believe they were interacting with a sentient entity named "Sydney." When the product team did not design clear cognitive anchors for what the system was, users supplied their own — often incorrect — ones.
Before writing a single line of code, write a one-paragraph description of the specific failure mode your product eliminates for a real user. If you cannot write that paragraph, you are building a model-first product.
Choose any real AI-powered product you use or know about (e.g., GitHub Copilot, Spotify's DJ feature, Notion AI, Google Maps traffic prediction, Apple's autocorrect). Use the AI assistant below to work through a structured audit of its four design layers: model, product, feedback, and trust.
The assistant will guide you through each layer with targeted questions. Complete at least 3 exchanges to finish the lab.
Microsoft's Clippit — known universally as "Clippy" — was deactivated by default in Office XP in 2001 after four years of near-universal user hostility. Researchers studying the failure found three compounding UX errors: the assistant interrupted workflows rather than augmenting them; it used a conversational persona that implied understanding it did not have; and it offered suggestions at maximum frequency regardless of user confidence level. All three are patterns still repeated in AI product launches today.
Traditional software is deterministic: the same input produces the same output. UX patterns built for deterministic software — error states are binary, outputs are authoritative, interfaces confirm or reject — do not transfer to probabilistic AI systems. A language model generates outputs on a confidence spectrum. Presenting every output with the same visual weight as a database query result is a design error that erodes trust the first time the model is wrong.
In 2022, when GitHub Copilot moved from technical preview to general availability, Microsoft's UX research team published findings on how developers interacted with suggestions. Developers who saw suggestions presented as "completions" (deterministic framing) accepted them at a higher rate and reviewed them less carefully. Developers who saw them framed as "suggestions" (probabilistic framing) reviewed them more carefully and reported higher satisfaction — even when the underlying model output was identical.
The framing was the product. The label changed how users allocated cognitive attention — which directly affected output quality and downstream trust.
Nielsen Norman Group's 2023 AI UX research found that users who received explicit uncertainty signals from AI interfaces maintained calibrated trust over time, while users who received authoritative-framed AI outputs showed sharp trust collapses after first encountering an error.
1. Progressive disclosure of confidence. Show high-confidence outputs differently from low-confidence ones. Otter.ai (launched 2016) uses visual fading on transcript segments where audio quality was poor — a direct encoding of model uncertainty into the visual layer. Users do not need to understand transcription models; they see the signal and know to review those sections.
2. Graceful degradation framing. Design what happens when the AI is wrong before you design what happens when it is right. When Waymo's autonomous vehicles encounter scenarios below their confidence threshold, they do not guess — they signal a handoff request to the passenger or remote operator. The fallback path is the primary product design, not an edge case.
3. Forgiveness architecture. AI outputs should be easy to undo, edit, or reject without friction. Apple's autocorrect redesign in iOS 17 (2023) added inline editing of suggestions and a persistent undo tap target — direct responses to a decade of user frustration with irreversible autocorrections. The design principle: AI suggestions should impose zero switching cost to reject.
4. Explanation affordances. Users who understand why an AI made a recommendation trust the system more accurately — meaning they trust it when it is right and distrust it when it is wrong. Spotify's Discover Weekly (launched 2015) includes "because you listened to X" labels that serve no functional purpose but significantly reduce track skip rates on AI-generated playlists.
5. Calibration feedback loops. Give users a mechanism to correct the AI and make that correction visible. Netflix's thumbs-up/thumbs-down redesign (2017) replaced five-star ratings with binary signals because research showed users found granular ratings cognitively costly to give but binary reactions immediate and instinctive. The redesign improved recommendation quality by providing cleaner training signal.
The false authority pattern: presenting AI output without any uncertainty signal, creating an implied claim of correctness. Used extensively in early AI health chatbots (notably Ada Health's early versions) before regulatory pressure forced confidence labeling.
The anthropomorphism trap: giving AI a human name, face, or conversational style that implies understanding. Effective at driving initial engagement; catastrophic for long-term trust when the system reveals its actual probabilistic nature. Replika's 2023 crisis — when a software update changed its AI companion's behavior, triggering user distress — was a direct product of anthropomorphism overreach.
The interruption model: surfacing AI suggestions proactively regardless of user context. The same failure Clippy embodied in 1997 reappears in AI writing assistants that pop up mid-sentence, AI email tools that suggest replies before the user has finished reading, and AI code tools that generate multi-line completions before the developer has typed an intent signal.
Before shipping any AI feature, write a "failure experience document": describe exactly what a user sees, hears, and feels when the AI is wrong. If that experience is embarrassing, irreversible, or opaque, redesign it. The failure experience is not an edge case — it is a core product requirement.
Pick a real AI product feature you have personally encountered — an autocomplete, a recommendation system, a chatbot, a content generator. Describe a moment when the UX felt wrong, confusing, or broke your trust.
The assistant will help you diagnose which anti-pattern was at play (false authority, anthropomorphism trap, or interruption model) and walk you through redesigning that specific interaction. Complete at least 3 exchanges.
In October 2021, Frances Haugen's whistleblower disclosures revealed that Facebook's internal research had found Instagram's recommendation algorithm worsened body image issues for 1 in 3 teenage girls. The metric the algorithm optimized — engagement time — had scored extremely well. The metric it destroyed — user wellbeing — had not been defined as a product objective. The algorithm was, by its own measurement framework, a success. By any broader definition, it was a product design failure.
Proxy metrics are the core evaluation challenge in AI products. You cannot directly measure "user benefit," so you measure something correlated with it — clicks, session length, return visits, conversion rate. The proxy works until it doesn't, and AI systems are exceptionally good at optimizing proxies in ways that decouple them from the underlying value they were chosen to represent.
When YouTube switched its recommendation algorithm from maximizing clicks (2012) to maximizing watch time (2016), it solved the clickbait problem but created a new one: watch time was maximized by recommending increasingly extreme content, because extreme content kept viewers watching. The proxy metric improvement accompanied a degradation in the underlying value it was meant to represent.
The technical term for this is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." AI systems, which optimize their proxies far more efficiently than human product teams, hit Goodhart failures faster and at larger scale than any previous technology.
For every metric you optimize, write an explicit description of how that metric could be gamed — either by the AI system or by users — in ways that destroy the underlying value you care about. If you cannot write that description, you do not understand your own metric well enough to use it.
Layer 1: Capability evaluation. Does the AI do what it claims to do? This is the layer most teams invest in — model benchmarks, accuracy scores, F1 scores, BLEU scores for translation, ROUGE scores for summarization. These evaluations are necessary but not sufficient. A language model can score highly on knowledge benchmarks while producing harmful content in deployment contexts the benchmark did not cover.
Layer 2: Deployment evaluation. Does the product behave correctly across the distribution of real user inputs — not just the distribution in your test set? When OpenAI released ChatGPT on November 30, 2022, the evaluation gap between controlled testing and real-world deployment became immediately visible. Users within hours found that the model would discuss detailed self-harm methods if asked through fictional framing. The deployment distribution included adversarial prompting strategies that test sets had not anticipated.
Layer 3: Impact evaluation. Does the product create the human outcome you intended? This requires measurement infrastructure that most teams do not build. Duolingo measures learning outcomes — vocabulary retention scores, conversational test passage rates — not just app engagement. When its streak feature produced users who completed daily lessons at midnight to maintain streaks without actually learning, Duolingo redesigned the streak mechanics. Impact measurement caught what engagement measurement would have hidden.
Red-teaming before launch. Anthropic formalized red-teaming as part of its Constitutional AI development process, hiring external teams to probe Claude for harmful outputs before deployment. Google DeepMind published its approach to AI safety evaluations in 2023, requiring adversarial testing across defined harm categories before any model release. Red-teaming is not QA — it is structured adversarial discovery, seeking failure modes the product team's framing will not find.
Staged rollouts with active monitoring. When Stripe launched its AI fraud detection system, it ran alongside the existing rules-based system for weeks before taking over — with human reviewers actively comparing outputs. The staged approach caught a demographic disparity in false positive rates (the AI declined legitimate transactions from certain regions at higher rates) before it affected all users.
Qualitative signal collection. Quantitative metrics tell you that something is wrong; qualitative feedback tells you what is wrong. Notion AI's post-launch review process includes weekly reading of free-text user complaints by product managers — not summarized by another AI, but read directly. The practice identified that users were frustrated not by output quality but by output length, a signal that session duration metrics had completely obscured.
Write three metrics for your AI product: one that measures what you intend, one that measures how that metric could be gamed, and one that measures a human outcome your proxy might destroy. If you cannot ship the third metric alongside the first, you are flying blind.
Pick any real AI product — a recommendation engine, a generative tool, a fraud detection system, a hiring tool, a health chatbot. Design a three-layer evaluation framework: capability metrics, deployment metrics, and human impact metrics.
The assistant will push you to identify Goodhart failure risks for each metric you propose and help you write the "anti-metric" that measures how each proxy could be gamed. Complete at least 3 exchanges.
In December 2022, Anthropic published its Constitutional AI paper, describing a training approach where the model's behavior was governed by a written set of principles — a "constitution" — rather than purely by RLHF from human raters. The practical effect: Claude's refusals were more consistent and its reasoning more transparent than comparable models. Anthropic's enterprise customers reported that Constitutional AI's predictability was itself a product advantage — they could build downstream products knowing the behavior envelope in advance. Ethical design had become a product specification.
The traditional product development model treats ethics as a review gate: build the product, then have a compliance or legal team review it for problems. This model fails for AI products for two reasons. First, ethical problems in AI products are often structural — embedded in training data, reward functions, or model architecture — and cannot be patched after the fact. Second, the review gate model creates adversarial dynamics between product teams and ethics reviewers, generating minimum-viable compliance rather than genuine design improvement.
The alternative — treating ethical constraints as design inputs from the first day of product conception — has produced measurable commercial advantages for teams that have adopted it. When Google DeepMind published its Sparrow paper in 2022, describing a conversational AI model trained to support its claims with sources, the design constraint (cite your reasoning) was simultaneously an ethical requirement (reduce misinformation) and a product differentiator (users could verify outputs). The constraint generated the feature.
Microsoft's partnership with OpenAI for Bing Chat included an explicit content policy requirement before launch: the system would not generate political content, sexually explicit material, or content that could assist self-harm. These were ethical constraints. They were also, functionally, scope constraints that focused the product on use cases it could execute well — web search augmentation — rather than use cases that would generate immediate regulatory backlash.
Teams at Anthropic, OpenAI, and DeepMind have all published accounts of ethical constraints generating product clarity: defining what the system would not do clarified what it would do well, reducing scope creep and focusing engineering resources.
Transparency about AI identity. The EU AI Act (adopted 2024) requires that AI systems interacting with humans identify themselves as AI. For builders who designed this in from day one, the constraint produced UX clarity: explicit AI identity disclosure reduced user confusion, lowered support ticket volume, and created more accurate user expectations. For builders who treated it as a last-minute compliance requirement, it required expensive UX retrofits.
Data minimization. Apple's App Tracking Transparency framework (April 2021) required apps to request explicit user permission for cross-app tracking. The constraint reduced the data available for AI personalization. But it also created a trust signal: apps that asked explicitly for only the data they needed reported higher opt-in rates than those that had previously collected everything silently. The constraint improved the quality of the feedback loop by making it consensual.
Human override capability. Medical AI products operating in the EU under the Medical Device Regulation must support human override of AI recommendations. The design constraint — no AI decision is final — produced better products: doctors who knew they could override the AI engaged with its reasoning more carefully, catching errors the AI made, which improved clinical outcomes. The constraint generated a safer feedback loop.
Explainability requirements. The EU's GDPR Article 22 gives individuals the right to an explanation for automated decisions that significantly affect them. The constraint forced designers of credit, hiring, and insurance AI products to build explainability into the model architecture. These explanations, originally compliance outputs, became customer trust features — borrowers who received explanations for credit decisions reported higher satisfaction even when declined, compared to those who received opaque denials.
At the organizational level, teams that have successfully operationalized ethics as design input share three structural characteristics. First, they have pre-mortem reviews: structured sessions before launch where teams imagine the product has failed for ethical reasons and work backward to identify the design decision that caused it. Second, they maintain impact registries: living documents listing every population affected by the AI product, the harm mechanisms available to each, and the design safeguards in place. Third, they conduct annual red team reviews rather than one-time pre-launch audits, because deployment contexts and adversarial techniques evolve.
Salesforce's Office of Ethical and Humane Use, established in 2019, created a product review process requiring any AI feature to pass a documented harm assessment before engineering resources were allocated. The process added friction — teams reported it extended roadmap planning by two to four weeks. The same teams reported it eliminated multiple expensive post-launch remediation cycles and improved stakeholder trust in AI feature announcements.
For every ethical constraint applied to your AI product, write a sentence describing what the constraint makes your product better at. If you cannot write that sentence, the constraint is compliance overhead. If you can, it is a design principle — and design principles build better products than compliance checklists do.
You are designing an AI product in one of these domains: healthcare, hiring, financial services, education, or content recommendation. Name your domain and a specific AI feature you want to build.
The assistant will present you with three ethical constraints relevant to your domain (regulatory, reputational, or user wellbeing). Your job is to rewrite each constraint as a product design principle — describing what the constraint makes your product better at. Complete at least 3 exchanges.