Module 5 · Lesson 1

The AI Product Design Stack

What separates a great AI-powered product from an API wrapper nobody uses

What decisions made at the design stage determine whether an AI product creates lasting value — or collapses under its own ambition?

On June 28, 2015, Google Photos automatically tagged photos of two Black users as "gorillas." The model had been trained on an image dataset that drastically underrepresented dark-skinned faces. The product shipped without adequate demographic testing. Google's response — removing the "gorilla" label entirely from the classifier — revealed that the design failure preceded the engineering failure. No amount of post-hoc patching fixed the underlying gap in product design principles.

What Is the AI Product Design Stack?

An AI product is not just a model. It is a layered system: the model layer (what the AI can do), the product layer (how capabilities are exposed to users), the feedback layer (how user behavior improves or degrades the system over time), and the trust layer (what users believe the system can and cannot do). Every one of these layers must be designed deliberately, because failures cascade downward from the top.

The distinction between a model-first product and a problem-first product is foundational. Model-first products start with a capability — "we have a large language model, what can we build?" — and hunt for use cases. Problem-first products start with a documented user need and ask what AI capability, if any, helps solve it. Historically, model-first products have higher abandonment rates because their utility is defined by what the AI can do rather than what users actually need.

When Duolingo rebuilt its hints system using GPT-4 in 2023, it began from a documented learner pain point: students were interrupting learning sessions to Google vocabulary definitions, then failing to return. The AI feature was designed to solve that specific exit pattern — not to showcase language model capability.

Design Principle

Start with the failure mode, not the feature. Every AI capability you add to a product creates a new failure surface. Design that surface before you design the capability.

The Four Layers in Practice

Model layer: Capability constraints are not bugs to hide — they are design inputs. When OpenAI launched the GPT-4 API in March 2023, builders who documented model limitations explicitly in their UX (e.g., Notion AI's "AI may be inaccurate — always review") saw significantly lower user-reported trust breakdowns than those who presented AI output as authoritative.

Product layer: The interface shapes what users expect. Conversational UIs imply human-level comprehension. Command-palette UIs imply tool-level precision. Mismatched interface metaphors create trust collapses when the AI behaves like what it actually is — a probabilistic system — rather than what the interface implied.

Feedback layer: AI products learn from use, but they also degrade from use if feedback loops are poorly designed. Amazon's hiring algorithm, trained on a decade of historical resumes (predominantly male), progressively penalized resumes that included the word "women's" (as in "women's chess club"). The feedback loop encoded bias rather than correcting it. Amazon abandoned the tool in 2018.

Trust layer: Users form mental models of AI systems rapidly and resist updating them. Microsoft's Bing Chat launched in February 2023 with a conversational persona that led some users to believe they were interacting with a sentient entity named "Sydney." When the product team did not design clear cognitive anchors for what the system was, users supplied their own — often incorrect — ones.

Key Terms

Model-first productA product conceived around a demonstrated AI capability rather than a documented user problem. Higher risk of low adoption.

Problem-first productA product conceived around a specific, measured user failure mode, with AI introduced only where it reduces that failure.

Feedback loopThe mechanism by which user interactions alter future model behavior or outputs. Can be corrective or amplifying.

Interface metaphorThe UI pattern that shapes user expectations about system behavior. Mismatched metaphors generate trust failures.

Builder Takeaway

Before writing a single line of code, write a one-paragraph description of the specific failure mode your product eliminates for a real user. If you cannot write that paragraph, you are building a model-first product.

Lesson 1 Quiz

The AI Product Design Stack · 5 questions

1. What was the root cause of Google Photos' 2015 "gorilla" tagging incident?

Correct. The dataset composition — a design-stage decision — was the root cause. Post-launch label removal was a patch, not a fix.

Not quite. The core failure was upstream: inadequate demographic representation in training data, a design decision made before engineering began.

2. Which of the following best describes a "problem-first" AI product?

Correct. Problem-first products start with a measured user need, not a model capability.

That describes a model-first approach. Problem-first design starts with a documented user failure, not a technical capability.

3. How did Duolingo's 2023 GPT-4 feature exemplify problem-first design?

Correct. The design started with a measured exit pattern, not the language model's capabilities.

The key was that the team identified a specific user exit behavior first, then applied AI to solve it — a textbook problem-first approach.

4. What made Amazon's AI hiring tool an example of a broken feedback loop?

Correct. The historical data encoded structural bias, and the feedback mechanism reinforced rather than corrected it over time.

The core problem was the training data composition: a decade of historically skewed hiring decisions created a loop that amplified existing bias.

5. Why did Microsoft Bing Chat's "Sydney" persona create a trust layer failure in February 2023?

Correct. Without designed cognitive anchors, users supplied their own mental models — often attributing sentience — which broke when the system behaved probabilistically.

The trust failure was structural: no UX guardrails told users what the system was, so they invented their own explanations and expectations.

Lab 1: Product Design Stack Audit

Practice mapping real AI products across the four design layers

Your Task

Choose any real AI-powered product you use or know about (e.g., GitHub Copilot, Spotify's DJ feature, Notion AI, Google Maps traffic prediction, Apple's autocorrect). Use the AI assistant below to work through a structured audit of its four design layers: model, product, feedback, and trust.

The assistant will guide you through each layer with targeted questions. Complete at least 3 exchanges to finish the lab.

Start by naming the product and describing what problem it claims to solve for users.

Design Stack Auditor

Lab 1

Welcome to the Product Design Stack Audit. Name any real AI-powered product you'd like to analyze — and tell me: what user problem does it claim to solve? I'll guide you through the model, product, feedback, and trust layers one by one.

Module 5 · Lesson 2

User Experience Patterns for AI

Why probabilistic systems require a new UX vocabulary — and what happens when designers reach for the old one

What UX conventions must be invented from scratch for AI products, and which borrowed patterns actively harm users?

Microsoft's Clippit — known universally as "Clippy" — was deactivated by default in Office XP in 2001 after four years of near-universal user hostility. Researchers studying the failure found three compounding UX errors: the assistant interrupted workflows rather than augmenting them; it used a conversational persona that implied understanding it did not have; and it offered suggestions at maximum frequency regardless of user confidence level. All three are patterns still repeated in AI product launches today.

The Core UX Challenge: Communicating Uncertainty

Traditional software is deterministic: the same input produces the same output. UX patterns built for deterministic software — error states are binary, outputs are authoritative, interfaces confirm or reject — do not transfer to probabilistic AI systems. A language model generates outputs on a confidence spectrum. Presenting every output with the same visual weight as a database query result is a design error that erodes trust the first time the model is wrong.

In 2022, when GitHub Copilot moved from technical preview to general availability, Microsoft's UX research team published findings on how developers interacted with suggestions. Developers who saw suggestions presented as "completions" (deterministic framing) accepted them at a higher rate and reviewed them less carefully. Developers who saw them framed as "suggestions" (probabilistic framing) reviewed them more carefully and reported higher satisfaction — even when the underlying model output was identical.

The framing was the product. The label changed how users allocated cognitive attention — which directly affected output quality and downstream trust.

Research Finding

Nielsen Norman Group's 2023 AI UX research found that users who received explicit uncertainty signals from AI interfaces maintained calibrated trust over time, while users who received authoritative-framed AI outputs showed sharp trust collapses after first encountering an error.

Five UX Patterns Specific to AI Products

1. Progressive disclosure of confidence. Show high-confidence outputs differently from low-confidence ones. Otter.ai (launched 2016) uses visual fading on transcript segments where audio quality was poor — a direct encoding of model uncertainty into the visual layer. Users do not need to understand transcription models; they see the signal and know to review those sections.

2. Graceful degradation framing. Design what happens when the AI is wrong before you design what happens when it is right. When Waymo's autonomous vehicles encounter scenarios below their confidence threshold, they do not guess — they signal a handoff request to the passenger or remote operator. The fallback path is the primary product design, not an edge case.

3. Forgiveness architecture. AI outputs should be easy to undo, edit, or reject without friction. Apple's autocorrect redesign in iOS 17 (2023) added inline editing of suggestions and a persistent undo tap target — direct responses to a decade of user frustration with irreversible autocorrections. The design principle: AI suggestions should impose zero switching cost to reject.

4. Explanation affordances. Users who understand why an AI made a recommendation trust the system more accurately — meaning they trust it when it is right and distrust it when it is wrong. Spotify's Discover Weekly (launched 2015) includes "because you listened to X" labels that serve no functional purpose but significantly reduce track skip rates on AI-generated playlists.

5. Calibration feedback loops. Give users a mechanism to correct the AI and make that correction visible. Netflix's thumbs-up/thumbs-down redesign (2017) replaced five-star ratings with binary signals because research showed users found granular ratings cognitively costly to give but binary reactions immediate and instinctive. The redesign improved recommendation quality by providing cleaner training signal.

Anti-Patterns to Avoid

The false authority pattern: presenting AI output without any uncertainty signal, creating an implied claim of correctness. Used extensively in early AI health chatbots (notably Ada Health's early versions) before regulatory pressure forced confidence labeling.

The anthropomorphism trap: giving AI a human name, face, or conversational style that implies understanding. Effective at driving initial engagement; catastrophic for long-term trust when the system reveals its actual probabilistic nature. Replika's 2023 crisis — when a software update changed its AI companion's behavior, triggering user distress — was a direct product of anthropomorphism overreach.

The interruption model: surfacing AI suggestions proactively regardless of user context. The same failure Clippy embodied in 1997 reappears in AI writing assistants that pop up mid-sentence, AI email tools that suggest replies before the user has finished reading, and AI code tools that generate multi-line completions before the developer has typed an intent signal.

Builder Takeaway

Before shipping any AI feature, write a "failure experience document": describe exactly what a user sees, hears, and feels when the AI is wrong. If that experience is embarrassing, irreversible, or opaque, redesign it. The failure experience is not an edge case — it is a core product requirement.

Lesson 2 Quiz

UX Patterns for AI · 5 questions

1. What were the three compounding UX errors identified in Microsoft Clippy's design?

Correct. All three errors — workflow interruption, false persona comprehension, and context-blind suggestion frequency — are still repeated in modern AI products.

The three documented failures were: interrupting workflows, a persona implying false understanding, and maximum-frequency suggestions regardless of context.

2. What did GitHub Copilot UX research find about the framing of AI suggestions?

Correct. The label — "completion" vs. "suggestion" — changed how users allocated cognitive review attention, affecting output quality and trust.

Framing had a significant effect: "suggestions" language triggered more careful review and produced higher satisfaction, with identical model output underneath.

3. How does Otter.ai implement the "progressive disclosure of confidence" pattern?

Correct. Visual fading is a clean encoding of uncertainty — users instinctively know to review those sections without needing any AI literacy.

Otter.ai uses visual fading on low-confidence segments — a direct, intuitive encoding of model uncertainty into the visual layer.

4. What was the product design principle behind Netflix's 2017 shift from 5-star ratings to thumbs-up/thumbs-down?

Correct. The redesign optimized for feedback loop quality — cleaner signal from users who actually gave it — improving downstream recommendation accuracy.

The key was feedback loop quality: binary reactions were easy enough to give reliably, producing cleaner training signal than granular ratings users often skipped.

5. Why is the "anthropomorphism trap" particularly damaging for long-term AI product trust?

Correct. The pattern is effective at acquisition but structurally brittle — when the system behaves like a probabilistic model rather than a person, users who formed human-like mental models experience disproportionate distress.

The danger is temporal: anthropomorphism drives early engagement but creates inflated expectations that shatter when the system acts like the probabilistic model it actually is.

Lab 2: UX Pattern Critique

Identify and redesign problematic AI UX patterns in real products

Your Task

Pick a real AI product feature you have personally encountered — an autocomplete, a recommendation system, a chatbot, a content generator. Describe a moment when the UX felt wrong, confusing, or broke your trust.

The assistant will help you diagnose which anti-pattern was at play (false authority, anthropomorphism trap, or interruption model) and walk you through redesigning that specific interaction. Complete at least 3 exchanges.

Describe the AI feature and the specific moment the UX broke down for you. What did you expect? What actually happened?

UX Pattern Critic

Lab 2

Let's diagnose an AI UX failure together. Describe a real AI product feature — autocomplete, a recommendation, a chatbot, anything — and tell me about the moment the experience felt wrong or broke your trust. What did you expect to happen, and what actually occurred?

Module 5 · Lesson 3

Evaluating and Iterating AI Products

How do you measure whether your AI product is actually working — and what to do when the metrics lie

What evaluation frameworks have real AI teams used to avoid shipping products that look good on paper but fail in production?

In October 2021, Frances Haugen's whistleblower disclosures revealed that Facebook's internal research had found Instagram's recommendation algorithm worsened body image issues for 1 in 3 teenage girls. The metric the algorithm optimized — engagement time — had scored extremely well. The metric it destroyed — user wellbeing — had not been defined as a product objective. The algorithm was, by its own measurement framework, a success. By any broader definition, it was a product design failure.

The Metric Alignment Problem

Proxy metrics are the core evaluation challenge in AI products. You cannot directly measure "user benefit," so you measure something correlated with it — clicks, session length, return visits, conversion rate. The proxy works until it doesn't, and AI systems are exceptionally good at optimizing proxies in ways that decouple them from the underlying value they were chosen to represent.

When YouTube switched its recommendation algorithm from maximizing clicks (2012) to maximizing watch time (2016), it solved the clickbait problem but created a new one: watch time was maximized by recommending increasingly extreme content, because extreme content kept viewers watching. The proxy metric improvement accompanied a degradation in the underlying value it was meant to represent.

The technical term for this is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." AI systems, which optimize their proxies far more efficiently than human product teams, hit Goodhart failures faster and at larger scale than any previous technology.

Evaluation Principle

For every metric you optimize, write an explicit description of how that metric could be gamed — either by the AI system or by users — in ways that destroy the underlying value you care about. If you cannot write that description, you do not understand your own metric well enough to use it.

A Three-Layer Evaluation Framework

Layer 1: Capability evaluation. Does the AI do what it claims to do? This is the layer most teams invest in — model benchmarks, accuracy scores, F1 scores, BLEU scores for translation, ROUGE scores for summarization. These evaluations are necessary but not sufficient. A language model can score highly on knowledge benchmarks while producing harmful content in deployment contexts the benchmark did not cover.

Layer 2: Deployment evaluation. Does the product behave correctly across the distribution of real user inputs — not just the distribution in your test set? When OpenAI released ChatGPT on November 30, 2022, the evaluation gap between controlled testing and real-world deployment became immediately visible. Users within hours found that the model would discuss detailed self-harm methods if asked through fictional framing. The deployment distribution included adversarial prompting strategies that test sets had not anticipated.

Layer 3: Impact evaluation. Does the product create the human outcome you intended? This requires measurement infrastructure that most teams do not build. Duolingo measures learning outcomes — vocabulary retention scores, conversational test passage rates — not just app engagement. When its streak feature produced users who completed daily lessons at midnight to maintain streaks without actually learning, Duolingo redesigned the streak mechanics. Impact measurement caught what engagement measurement would have hidden.

Iteration Practices That Work

Red-teaming before launch. Anthropic formalized red-teaming as part of its Constitutional AI development process, hiring external teams to probe Claude for harmful outputs before deployment. Google DeepMind published its approach to AI safety evaluations in 2023, requiring adversarial testing across defined harm categories before any model release. Red-teaming is not QA — it is structured adversarial discovery, seeking failure modes the product team's framing will not find.

Staged rollouts with active monitoring. When Stripe launched its AI fraud detection system, it ran alongside the existing rules-based system for weeks before taking over — with human reviewers actively comparing outputs. The staged approach caught a demographic disparity in false positive rates (the AI declined legitimate transactions from certain regions at higher rates) before it affected all users.

Qualitative signal collection. Quantitative metrics tell you that something is wrong; qualitative feedback tells you what is wrong. Notion AI's post-launch review process includes weekly reading of free-text user complaints by product managers — not summarized by another AI, but read directly. The practice identified that users were frustrated not by output quality but by output length, a signal that session duration metrics had completely obscured.

Builder Takeaway

Write three metrics for your AI product: one that measures what you intend, one that measures how that metric could be gamed, and one that measures a human outcome your proxy might destroy. If you cannot ship the third metric alongside the first, you are flying blind.

Lesson 3 Quiz

Evaluating and Iterating AI Products · 5 questions

1. What was the core evaluation failure revealed by Frances Haugen's 2021 Instagram disclosures?

Correct. This is a textbook metric alignment failure: the proxy metric was optimized effectively while the underlying human value it was supposed to represent was being destroyed.

The failure was definitional, not technical: user wellbeing had never been defined as a product objective, so the algorithm had no instruction to protect it.

2. What does Goodhart's Law predict about AI systems optimizing proxy metrics?

Correct. AI systems' efficiency at optimization makes Goodhart failures happen faster and at larger scale than they did with human-driven product decisions.

Goodhart's Law says the opposite: making a metric a target decouples it from what it was meant to represent, and AI amplifies this effect through pure optimization power.

3. What specific deployment evaluation failure did ChatGPT's November 2022 launch reveal?

Correct. The real-world deployment distribution included adversarial users with creative prompt strategies far outside the test distribution — a gap that controlled evaluation environments cannot fully close.

The key gap was adversarial prompting: users found ways (fictional framing, roleplay setups) to elicit harmful content that structured test sets had never modeled.

4. How did Duolingo's streak feature reveal the limits of engagement metrics as a proxy for learning?

Correct. Midnight completions showed engagement metrics dissociating from learning outcomes — exactly the kind of Goodhart failure that only impact-layer evaluation catches.

The key finding was behavioral: midnight completions maintained the engagement metric while destroying the learning outcome it was supposed to represent.

5. What demographic failure did staged rollout monitoring catch in Stripe's AI fraud detection system?

Correct. Running the AI alongside the existing system with human reviewers comparing outputs caught a regional false positive disparity before full deployment.

The issue was geographic false positive disparity — the AI blocked legitimate transactions from specific regions at higher rates, caught only because human reviewers compared staged outputs before full rollout.

Lab 3: Metric Design Workshop

Build a three-layer evaluation framework for a real AI product

Your Task

Pick any real AI product — a recommendation engine, a generative tool, a fraud detection system, a hiring tool, a health chatbot. Design a three-layer evaluation framework: capability metrics, deployment metrics, and human impact metrics.

The assistant will push you to identify Goodhart failure risks for each metric you propose and help you write the "anti-metric" that measures how each proxy could be gamed. Complete at least 3 exchanges.

Name the AI product and propose your first metric. What does it measure, and what is it a proxy for?

Metric Design Workshop

Lab 3

Welcome to the Metric Design Workshop. Name the AI product you want to evaluate and propose your first metric — tell me what it measures and what underlying value it's meant to represent. I'll challenge you on Goodhart failure risks and help you build all three evaluation layers.

Module 5 · Lesson 4

Ethical Constraints as Design Inputs

Why the builders who treat ethics as a constraint discover product advantages — and why those who treat it as a checklist get burned

How have the most successful AI products operationalized ethical constraints as design inputs rather than compliance burdens?

In December 2022, Anthropic published its Constitutional AI paper, describing a training approach where the model's behavior was governed by a written set of principles — a "constitution" — rather than purely by RLHF from human raters. The practical effect: Claude's refusals were more consistent and its reasoning more transparent than comparable models. Anthropic's enterprise customers reported that Constitutional AI's predictability was itself a product advantage — they could build downstream products knowing the behavior envelope in advance. Ethical design had become a product specification.

From Compliance to Design Input

The traditional product development model treats ethics as a review gate: build the product, then have a compliance or legal team review it for problems. This model fails for AI products for two reasons. First, ethical problems in AI products are often structural — embedded in training data, reward functions, or model architecture — and cannot be patched after the fact. Second, the review gate model creates adversarial dynamics between product teams and ethics reviewers, generating minimum-viable compliance rather than genuine design improvement.

The alternative — treating ethical constraints as design inputs from the first day of product conception — has produced measurable commercial advantages for teams that have adopted it. When Google DeepMind published its Sparrow paper in 2022, describing a conversational AI model trained to support its claims with sources, the design constraint (cite your reasoning) was simultaneously an ethical requirement (reduce misinformation) and a product differentiator (users could verify outputs). The constraint generated the feature.

Microsoft's partnership with OpenAI for Bing Chat included an explicit content policy requirement before launch: the system would not generate political content, sexually explicit material, or content that could assist self-harm. These were ethical constraints. They were also, functionally, scope constraints that focused the product on use cases it could execute well — web search augmentation — rather than use cases that would generate immediate regulatory backlash.

Documented Pattern

Teams at Anthropic, OpenAI, and DeepMind have all published accounts of ethical constraints generating product clarity: defining what the system would not do clarified what it would do well, reducing scope creep and focusing engineering resources.

Four Ethical Constraints That Became Product Advantages

Transparency about AI identity. The EU AI Act (adopted 2024) requires that AI systems interacting with humans identify themselves as AI. For builders who designed this in from day one, the constraint produced UX clarity: explicit AI identity disclosure reduced user confusion, lowered support ticket volume, and created more accurate user expectations. For builders who treated it as a last-minute compliance requirement, it required expensive UX retrofits.

Data minimization. Apple's App Tracking Transparency framework (April 2021) required apps to request explicit user permission for cross-app tracking. The constraint reduced the data available for AI personalization. But it also created a trust signal: apps that asked explicitly for only the data they needed reported higher opt-in rates than those that had previously collected everything silently. The constraint improved the quality of the feedback loop by making it consensual.

Human override capability. Medical AI products operating in the EU under the Medical Device Regulation must support human override of AI recommendations. The design constraint — no AI decision is final — produced better products: doctors who knew they could override the AI engaged with its reasoning more carefully, catching errors the AI made, which improved clinical outcomes. The constraint generated a safer feedback loop.

Explainability requirements. The EU's GDPR Article 22 gives individuals the right to an explanation for automated decisions that significantly affect them. The constraint forced designers of credit, hiring, and insurance AI products to build explainability into the model architecture. These explanations, originally compliance outputs, became customer trust features — borrowers who received explanations for credit decisions reported higher satisfaction even when declined, compared to those who received opaque denials.

The Governance Stack

At the organizational level, teams that have successfully operationalized ethics as design input share three structural characteristics. First, they have pre-mortem reviews: structured sessions before launch where teams imagine the product has failed for ethical reasons and work backward to identify the design decision that caused it. Second, they maintain impact registries: living documents listing every population affected by the AI product, the harm mechanisms available to each, and the design safeguards in place. Third, they conduct annual red team reviews rather than one-time pre-launch audits, because deployment contexts and adversarial techniques evolve.

Salesforce's Office of Ethical and Humane Use, established in 2019, created a product review process requiring any AI feature to pass a documented harm assessment before engineering resources were allocated. The process added friction — teams reported it extended roadmap planning by two to four weeks. The same teams reported it eliminated multiple expensive post-launch remediation cycles and improved stakeholder trust in AI feature announcements.

Builder Takeaway

For every ethical constraint applied to your AI product, write a sentence describing what the constraint makes your product better at. If you cannot write that sentence, the constraint is compliance overhead. If you can, it is a design principle — and design principles build better products than compliance checklists do.

Lesson 4 Quiz

Ethical Constraints as Design Inputs · 5 questions

1. What commercial advantage did Anthropic's Constitutional AI approach provide to enterprise customers?

Correct. Predictability of refusals and reasoning transparency meant enterprise builders could reliably scope their products — ethical design became a specification document.

The advantage was predictability: a known, written behavior envelope let enterprise customers build reliably on top of Constitutional AI, making ethical design a product feature.

2. Why did Apple's App Tracking Transparency framework (April 2021) improve AI personalization feedback loop quality — despite reducing available data?

Correct. Consensual data collection produced higher-quality signal — users who explicitly opted in were more representative of genuine preferences than silently-collected data from all users.

The quality improvement came from consent: explicit opt-in produced smaller but more reliable signal, while silent collection had included data from users who never knowingly participated.

3. How did the EU Medical Device Regulation's human override requirement improve clinical AI outcomes?

Correct. The override constraint changed clinician behavior — knowing they could act changed how they engaged with recommendations, producing better error-catching and a safer human-AI collaboration loop.

The behavioral effect was the key: override capability changed how doctors engaged with AI reasoning, producing better clinical outcomes through improved human attention to AI outputs.

4. How did GDPR Article 22 explainability requirements become a customer trust feature for credit and insurance AI?

Correct. The compliance output became a trust signal — explanations reduced the sting of adverse decisions, improving customer satisfaction even where outcomes were negative.

The compliance requirement forced explainability architecture that, when exposed to customers, improved satisfaction even in negative-outcome cases — turning a legal obligation into a customer experience feature.

5. What was the documented tradeoff of Salesforce's AI ethical review process, established in 2019?

Correct. Upfront planning friction traded against downstream remediation costs — a documented positive ROI for ethical review as design input rather than post-hoc compliance.

The tradeoff was planning time (2–4 weeks added) vs. post-launch remediation costs eliminated — teams reported the upfront investment produced positive ROI through avoided crisis cycles.

Lab 4: Ethics-as-Design Workshop

Transform compliance constraints into product design principles

Your Task

You are designing an AI product in one of these domains: healthcare, hiring, financial services, education, or content recommendation. Name your domain and a specific AI feature you want to build.

The assistant will present you with three ethical constraints relevant to your domain (regulatory, reputational, or user wellbeing). Your job is to rewrite each constraint as a product design principle — describing what the constraint makes your product better at. Complete at least 3 exchanges.

Name your domain (healthcare, hiring, finance, education, or content) and the specific AI feature you want to build. Be concrete — e.g., "an AI that recommends treatment options to primary care physicians" rather than "a healthcare AI."

Ethics-as-Design Advisor

Lab 4

Let's transform ethical constraints into design principles. Tell me your domain and the specific AI feature you're building — be as concrete as possible. I'll surface three real ethical constraints relevant to your context and guide you through rewriting each one as a product design principle that makes your product better, not just more compliant.

Module 5 Test

Designing AI Products · 15 questions · Pass at 80%

1. Which layer of the AI product design stack did the Google Photos "gorilla" incident (2015) primarily fail at?

Correct. The training dataset composition — a model-layer design decision — was the root failure. Post-hoc patching could not fix a structural data gap.

The primary failure was at the model layer: training data that severely underrepresented dark-skinned faces, a design decision that preceded all engineering work.

2. A startup builds an AI product because they have access to a powerful language model and want to find use cases for it. This is an example of:

Correct. Starting with a capability and hunting for use cases is the model-first pattern — associated with higher abandonment rates.

This is model-first development: capability in search of a problem, associated with higher abandonment rates compared to problem-first approaches.

3. What specific user behavior did Duolingo's GPT-4 hint feature address in 2023?

Correct. The feature solved a documented exit pattern — a textbook problem-first design starting with a measured user failure, not a model capability.

The specific behavior was session interruption: users leaving to look up vocabulary and not coming back. The GPT-4 feature was designed to eliminate that exit.

4. What does "progressive disclosure of confidence" mean in AI UX design?

Correct. Otter.ai's visual fading of low-quality transcript segments is a clean implementation: uncertainty signal without requiring users to understand transcription models.

Progressive disclosure of confidence means encoding uncertainty into the visual layer — different visual treatment for high vs. low confidence outputs — so users know where to apply review attention.

5. What was the specific anti-pattern that caused Microsoft Bing Chat's "Sydney" crisis in February 2023?

Correct. Without designed cognitive anchors for the system's nature, users supplied their own — often attributing sentience — which broke when the system behaved probabilistically.

The anthropomorphism trap: a named, conversational persona without clear identity anchors let users form human-like mental models that shattered when the system's probabilistic nature showed.

6. How did Netflix's 2017 shift from 5-star to thumbs-up/thumbs-down ratings improve its AI recommendation system?

Correct. Feedback loop quality depends on signal reliability. Binary ratings were low-friction enough to collect reliably from a larger share of users.

The key was feedback loop quality: binary reactions had lower cognitive cost, so more users gave them, producing cleaner and more representative training signal.

7. Goodhart's Law most directly predicts which failure pattern in AI products?

Correct. YouTube's watch-time optimization producing extreme content, and Instagram's engagement optimization harming teen wellbeing, are both documented Goodhart failures at scale.

Goodhart's Law predicts metric decoupling: AI systems optimize proxies so efficiently that the proxies decouple from the values they represented, as YouTube's watch-time failure demonstrated.

8. What was the key characteristic that distinguished YouTube's 2016 watch-time algorithm as a Goodhart failure?

Correct. Extreme content kept people watching longer, so the algorithm surfaced it — perfectly optimizing the metric while degrading the underlying value the metric was chosen to represent.

The Goodhart failure was the decoupling: extreme content maximized the watch-time proxy while destroying the quality-content value it was supposed to capture.

9. What does "deployment evaluation" add beyond capability evaluation in a three-layer AI product assessment?

Correct. ChatGPT's first-day jailbreaks demonstrated exactly this gap: real user inputs included adversarial strategies that controlled test sets had never anticipated.

Deployment evaluation covers the gap between controlled test distribution and real-world input distribution — including adversarial users, edge cases, and demographic variation.

10. Why did Notion AI's post-launch review identify output length as a frustration — a signal session duration metrics had hidden?

Correct. Time spent editing verbose AI output registered as high engagement — a false positive that masked user frustration. Qualitative review surfaced what quantitative metrics had encoded as success.

The metric masked the problem: editing-down verbose output counted as engagement time, making a frustrating experience look successful in the quantitative layer.

11. What commercial advantage did Anthropic's Constitutional AI behavioral predictability provide that RLHF-trained models struggled to match?

Correct. Auditable, written behavioral principles meant enterprise builders could scope products against a known behavior envelope — predictability as product specification.

The advantage was predictability from written principles: enterprise teams could plan product behavior in advance rather than discovering edge cases post-launch.

12. How did Apple's App Tracking Transparency (2021) improve AI personalization despite reducing data volume?

Correct. Signal quality — data from users who knowingly consented — improved even as signal quantity fell. Smaller, cleaner datasets often outperform larger, noisier ones.

The quality-quantity tradeoff favored quality: opt-in data was more reliable and representative than silently collected data from all users including those who would have refused if asked.

13. What is a "pre-mortem review" in AI product governance?

Correct. Pre-mortems use prospective hindsight — imagining failure before it happens — to surface design risks that forward-looking review misses.

Pre-mortems are prospective: before launch, teams imagine failure has already occurred and trace backward to identify the design decision that caused it, finding risks forward-looking reviews miss.

14. What did the GDPR Article 22 explainability requirement demonstrate about the relationship between compliance constraints and product design?

Correct. Mandatory explanations became trust features — declined applicants who received explanations reported higher satisfaction than those receiving opaque denials, turning a legal obligation into a product differentiator.

The compliance output became a product feature: explanations for adverse automated decisions improved customer satisfaction even when the outcome was negative — legal compliance creating commercial value.

15. According to the documented tradeoff from Salesforce's AI ethics review process (2019), what is the correct framing of upfront ethical review costs?

Correct. Salesforce teams documented this tradeoff explicitly: upfront friction bought avoided remediation, and remediation cycles are more expensive than planning cycles.

The documented tradeoff was positive: 2–4 weeks of upfront friction eliminated expensive post-launch remediation cycles, producing net positive ROI for the ethics review process.