Evaluation & Testing for AI · Introduction

The test passed. The system failed. That gap is what this course is about.

Traditional software either does what it's told or it doesn't. AI does something plausible — and plausible is the hardest thing to catch.

In November 2022, Meta released Galactica — a language model trained on 48 million scientific papers. Researchers called it extraordinary. Three days later, it was gone. The model had been generating citations to papers that didn't exist, summarizing studies with invented results, producing chemistry that sounded right but wasn't. Every sentence was fluent. No test had caught any of it, because no test had been written to ask: is this true?

This is the central challenge of AI evaluation. Conventional software testing is built on specifications — a function receives inputs, produces outputs, and you check whether those outputs match what you defined. AI systems don't work that way. They produce outputs that are grammatically correct, contextually coherent, and frequently wrong in ways that are invisible to any automated check that doesn't already know what "wrong" looks like.

This course teaches you to close that gap. You'll learn why AI breaks the assumptions baked into software testing, how benchmark scores mislead teams into shipping systems that fail in the wild, where failures actually concentrate (and why those are rarely true edge cases), and how to build monitoring that catches degradation before your users do. The goal isn't to be skeptical of AI — it's to be rigorous about it.

Lesson 1 · Module 1

The Specification Problem

Software passes or fails. AI systems drift, hallucinate, and confabulate — and none of that shows up in a unit test.

Why can't we just run the same tests we use for ordinary software?

When Meta released Galactica — a large language model trained on 48 million scientific papers — researchers initially praised its fluency. Three days later, they pulled it offline. The model had been confidently generating citations to papers that did not exist, summarizing studies with fabricated results, and producing plausible-sounding but factually wrong chemistry. Every output was grammatically flawless. No conventional software test had caught any of it, because the model had no specification to violate — it produced what looked like valid text in every case.

What Makes Software Testing Work (and Why It Breaks Here)

Classical software testing rests on a simple assumption: for a given input, there is a correct output you can specify in advance. A function that computes sales tax either returns 8.875% of $40.00 or it does not. You write the assertion. You run the test. You get a binary result.

AI systems — particularly large language models and machine-learning pipelines — shatter this assumption. The output space is effectively infinite. A question like "Summarize this contract" has thousands of acceptable responses and thousands of unacceptable ones, and the boundary between them is neither sharp nor stable across contexts. You cannot write an assertion for "good summary."

This is called the oracle problem: in AI testing, there is often no reliable oracle — no ground-truth function you can call to verify whether the output is correct.

Oracle Problem The absence of a reliable reference function that can definitively classify any AI output as correct or incorrect without significant human judgment.

Specification Gap The inability to fully describe, in advance, all the behaviors an AI system should and should not exhibit across its entire input space.

Three Structural Differences

1. Non-determinism. The same prompt, sent twice, can produce different outputs. Temperature, sampling, and randomness mean that traditional regression testing — re-running the same inputs and expecting identical outputs — fails as a reliability signal. Galactica would produce a different fake citation each time.

2. Emergent failure modes. AI failures are often not crashes or exceptions. They are subtle: a medical chatbot that gives mostly correct advice but systematically under-refers women for cardiac events; a code-completion model that introduces SQL injection vulnerabilities at a 3% rate; a translation model that subtly shifts the sentiment of diplomatic documents. These failures are invisible unless you specifically look for them.

3. Distribution shift. A model trained on data from one period or population can degrade invisibly when deployed on a different population. The model still "runs" — it still produces confident outputs — but its reliability has collapsed. Traditional software does not have this property: a sorting algorithm does not become wrong because the cultural context of the data changed.

Documented Case · Amazon Recruiting Tool · 2018

Amazon's machine-learning hiring tool, trained on a decade of resumes, learned to penalize resumes containing the word "women's" — as in "women's chess club." The system passed all its unit tests. It returned ranked candidate lists without errors. The discriminatory pattern only surfaced during an internal audit comparing outcomes by gender. Amazon scrapped the tool in 2018 after Reuters reported the findings. No crash. No exception. No failed assertion.

What "Correct" Even Means for AI

When we test software, correctness is defined by a specification written by humans before the code was built. When we train AI systems, the "specification" is implicit — embedded in the training data, the loss function, and the choices made during fine-tuning. Nobody writes down "do not hallucinate." Nobody writes "do not encode historical hiring bias." The system learns proxies for human intent, and those proxies are imperfect.

This means AI testing must itself define what correctness looks like — a task that is simultaneously philosophical, empirical, and domain-specific. A response that is correct for a general-purpose assistant may be dangerously incomplete for a clinical decision-support tool. Testing strategy must begin by asking: what does this system need to do, and what does it absolutely must not do?

Core Principle

AI evaluation is not a phase that happens after development. It is a continuous discipline that begins before a model is trained — at the moment you decide what it is for — and continues indefinitely after deployment.

Lesson 1 Quiz

The Specification Problem · 3 questions

What is the "oracle problem" in AI testing?

Correct. The oracle problem refers to the absence of a ground-truth verification function — you can't always know whether an AI output is "correct" without substantial human evaluation.

Not quite. The oracle problem specifically concerns the inability to verify outputs automatically — there's no reliable ground-truth function to call.

Amazon's 2018 recruiting tool failed in a way that illustrates which core AI testing challenge?

Correct. The tool returned ranked lists without errors — it "passed" any functional test. The discriminatory pattern only surfaced through an audit of outcomes, illustrating how AI failures can be subtle and invisible to conventional testing.

Not quite. Amazon's tool produced outputs without crashing. Its failure was an emergent bias baked into the model's learned patterns — invisible to unit tests, visible only in outcome analysis.

Which property of AI systems means that traditional regression testing — expecting identical outputs for identical inputs — is insufficient?

Correct. Non-determinism means the same input produces different outputs across runs, so a test that checks for exact output matching will fail even when the model is working correctly.

Not quite. AI models don't update weights during inference. The relevant property is non-determinism: stochastic sampling means identical prompts produce different outputs, breaking exact-match regression testing.

Lab 1 · The Specification Problem

Conversation lab — complete 3 exchanges to unlock next lesson

Your Task

You're going to stress-test the idea that AI outputs can't be verified with simple pass/fail checks. Ask the AI assistant about real scenarios where standard software tests would miss AI failures — or challenge it with cases where you think traditional testing might actually work.

Try: "Give me a concrete example where an AI medical system could fail without raising any software error." — or challenge it with — "Couldn't we just use output hashing to detect hallucinations?"

AI Testing Lab

Specification Problem

Welcome to Lab 1. I'm here to explore why AI testing is structurally different from software testing. Ask me about the oracle problem, non-determinism, specification gaps, or bring a scenario you want to stress-test. What's on your mind?

Lesson 2 · Module 1

Benchmarks Are Not Reality

A model that tops every leaderboard can fail catastrophically the moment it leaves the benchmark's controlled vocabulary.

How do AI teams end up shipping systems that score perfectly on tests but fail badly in the wild?

In 2023, researchers at the University of California Berkeley demonstrated that GPT-4's performance on the bar exam — often cited as evidence of legal reasoning — relied heavily on the exam's multiple-choice format. When the same legal scenarios were presented in open-ended form, performance dropped substantially. The benchmark score was real. The underlying capability it was taken to imply was not. The problem had a name: benchmark contamination — training data that includes the test set's questions or answers, producing scores that measure memorization, not generalization.

The Benchmark Lifecycle Problem

Benchmarks are created to measure something specific: reading comprehension, mathematical reasoning, commonsense inference. Within months of release, competitive model developers train on data that overlaps with the benchmark — sometimes inadvertently, because the internet contains discussions of the benchmark's questions; sometimes deliberately. The benchmark becomes a target rather than a measurement.

This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure. By the time MMLU (Massive Multitask Language Understanding) was saturated — with top models scoring above 90% — researchers had already documented that high-scoring models failed on trivially rephrased versions of the same questions.

Benchmark Contamination When training data includes benchmark questions or answers, causing test scores to reflect memorization rather than generalizable capability.

Goodhart's Law When a measure becomes a target, it ceases to be a good measure — because optimization pressure corrupts the metric's relationship with the underlying construct.

The Gap Between Benchmark and Deployment

Even a genuinely uncontaminated benchmark only measures what its creators thought to measure. Medical AI benchmarks typically test diagnosis given a formatted clinical vignette. Real clinical use involves unstructured notes, abbreviations, physician shorthand, multi-morbidity, and social context. The benchmark measures a proxy for the real task, not the task itself.

A 2021 study published in Nature Medicine found that AI diagnostic models evaluated on curated benchmark datasets showed dramatic performance drops when evaluated on data from a different hospital — even within the same country, using the same imaging modality. The benchmark performance was real and reproducible. The deployment performance was not. The gap was caused by differences in scanner hardware, imaging protocols, and patient demographics that the benchmark did not represent.

Documented Case · Stanford Chest X-Ray Model · 2019

Stanford's CheXNet model, trained and evaluated on the ChestX-ray14 dataset, achieved radiologist-level performance on that benchmark. Independent evaluation by external researchers, published in PLOS Medicine, found that the model's advantage disappeared when tested on different patient populations and X-ray equipment. The benchmark was valid within its own distribution. It predicted almost nothing about real-world generalization across institutions.

What Valid Evaluation Looks Like Instead

Effective AI evaluation uses benchmarks as one signal among many — not as a verdict. The complementary methods include: held-out test sets from the deployment distribution (data that matches the actual population and context where the model will be used); adversarial probing (deliberately crafted inputs designed to find failure modes); behavioral testing (checking model behavior across systematically varied inputs, not just a fixed test set); and ongoing monitoring in production (treating deployment as a continuation of evaluation, not its end).

The 2023 HELM (Holistic Evaluation of Language Models) framework from Stanford's Center for Research on Foundation Models was a direct response to single-benchmark gaming — it evaluated models across 42 scenarios and 7 metrics simultaneously, making it much harder to optimize for the evaluation itself without genuinely improving across all dimensions.

Core Principle

Benchmark scores answer a narrow question: how does this model perform on these specific items under these conditions? The question that matters for deployment is completely different: how does this model perform on real tasks, for real users, in real conditions? Closing that gap requires evaluation work that benchmarks cannot do alone.

Lesson 2 Quiz

Benchmarks Are Not Reality · 3 questions

What is "benchmark contamination"?

Correct. Contamination occurs when training data overlaps with benchmark test sets — models then score high by recalling answers rather than demonstrating genuine capability.

Not quite. Contamination isn't about malicious tampering — it refers to training data that includes benchmark content, making high scores reflect memorization rather than generalization.

The Stanford CheXNet case illustrates that high benchmark performance can fail to generalize because:

Correct. CheXNet's benchmark was valid within its own distribution, but that distribution — specific hardware, imaging protocols, patient population — did not predict performance at other institutions with different equipment and demographics.

Not quite. The failure was a distribution mismatch: the benchmark's controlled conditions (scanner type, protocol, population) didn't match the varied conditions found at other hospitals.

Goodhart's Law, as applied to AI benchmarks, means:

Correct. Goodhart's Law: when a measure becomes a target, optimization pressure corrupts its relationship with the underlying construct. Models learn to score well on the benchmark rather than genuinely improving on the capability the benchmark was designed to measure.

Not quite. Goodhart's Law is about optimization corrupting measurement — once teams target the benchmark score directly, the score stops reflecting genuine capability.

Lab 2 · Benchmarks Are Not Reality

Conversation lab — complete 3 exchanges to unlock next lesson

Your Task

Probe the limits of benchmarks. Ask the assistant about specific benchmarks, how contamination could happen in practice, or what a genuinely deployment-predictive evaluation would look like for a use case you care about.

Try: "How would I detect whether a model's high MMLU score is contaminated?" — or — "Design an evaluation for a customer-service AI that would actually predict real-world performance."

AI Testing Lab

Benchmarks & Evaluation

Welcome to Lab 2. Let's dig into benchmark validity. Ask me about contamination detection, Goodhart's Law in practice, the HELM framework, or how to design evaluations that actually predict deployment performance. What would you like to explore?

Lesson 3 · Module 1

Failure at the Edges

AI systems don't fail uniformly. They fail precisely on the users you forgot to test for — and that invisibility is the danger.

Why do AI failures concentrate in edge cases, and why are those cases often not edge cases at all for the people experiencing them?

In June 2015, Google Photos launched its auto-labeling feature. Within days, a Black software developer named Jacky Alcine noticed that the system had labeled photos of him and his girlfriend as "gorillas." Google's image classifier had achieved high average accuracy across its test set — but its test set had not adequately represented darker-skinned faces. The "high average accuracy" masked catastrophic failure on a specific demographic. Google's response — disabling the "gorilla" label entirely rather than fixing the underlying classification — remained the workaround eight years later, as documented by Wired in 2023.

Why Averages Hide the Problem

Aggregate accuracy metrics are a weighted average across all test cases. If your test set is 90% majority-group examples, a model that fails completely on the minority 10% can still achieve 90% overall accuracy — and look excellent on every standard report. This is not a theoretical concern. It is the mechanism behind nearly every documented case of AI demographic disparity.

A 2019 study by MIT Media Lab researcher Joy Buolamwini and Timnit Gebru — the "Gender Shades" project — evaluated commercial facial analysis systems from IBM, Microsoft, and Face++ against a benchmark they constructed to include darker-skinned faces. Error rates for darker-skinned women were up to 34 percentage points higher than for lighter-skinned men. The vendors' published accuracy figures had used test sets that underrepresented darker skin tones, making their averages meaningless as a predictor of real-world performance across the full population.

Aggregate Accuracy Masking When high average performance on a test set hides severe underperformance on specific subgroups that are underrepresented in the test set.

Subgroup Analysis Evaluating model performance separately for distinct demographic or behavioral subgroups, rather than only reporting aggregate metrics.

The "Edge Case" That Isn't

Testing teams often frame subgroup failures as "edge cases" — implying rarity and acceptable risk. But a failure that affects 13% of the U.S. population is not an edge case for the people in that 13%. The framing of edge cases reflects who was in the room when the test set was designed, not the actual frequency of the scenario in deployment.

The COMPAS recidivism algorithm, used by courts in several U.S. states to predict the likelihood of reoffending, was shown by ProPublica in 2016 to produce systematically different false positive rates by race — Black defendants were nearly twice as likely to be incorrectly flagged as high-risk. The algorithm's designers measured aggregate predictive accuracy. They did not measure subgroup-specific false positive rates. The failure was invisible in their evaluation framework.

Documented Case · Dermatology AI · 2020

A study published in Nature Medicine evaluated a deep-learning skin cancer detection system against 58 dermatologists. The AI matched or exceeded dermatologist performance on the benchmark dataset. A follow-up analysis found that the training and test data contained overwhelmingly light-skinned examples. The AI's performance on darker skin tones was substantially lower — a failure that the original evaluation, using aggregate metrics on a non-representative dataset, could not detect.

Testing for Failure, Not Just Success

Effective AI testing explicitly designs for subgroup analysis from the start. This means stratifying test sets by relevant demographic variables (age, gender, race, dialect, geography, device type, literacy level) and reporting disaggregated metrics — not just aggregate accuracy. It means conducting adversarial audits: actively constructing inputs designed to surface failures, not just inputs that represent average usage.

The practice of behavioral testing — popularized by Marco Tulio Ribeiro and colleagues with the CheckList framework in 2020 — applies software testing principles like minimum functionality testing, invariance testing, and directional expectation testing to NLP models. The key insight: you need to check that the model fails the right way in adversarial cases, not just that it succeeds on typical ones.

Core Principle

The people most likely to experience AI failure are often the people least represented in the test data. Rigorous AI testing demands disaggregated evaluation: you must measure performance separately for every user group that matters, not just report the average and move on.

Lesson 3 Quiz

Failure at the Edges · 3 questions

The Gender Shades project found error rate gaps of up to 34 percentage points between darker-skinned women and lighter-skinned men. What evaluation practice would have detected this before deployment?

Correct. Disaggregated analysis — measuring performance for each subgroup separately — would have revealed that aggregate accuracy masked severe underperformance on darker-skinned women.

Not quite. More training or a larger but similarly composed test set wouldn't surface the disparity. The fix is disaggregated evaluation: measuring and reporting accuracy separately for each demographic group.

Why is framing AI subgroup failures as "edge cases" potentially misleading?

Correct. "Edge case" implies rarity and acceptable risk, but it often reflects the composition of the testing team, not the actual frequency of the scenario for the affected population.

Not quite. The problem with "edge case" framing is that it encodes the test designer's perspective. A failure affecting millions of people isn't an edge case for those people — the label creates false permission to ignore it.

The CheckList behavioral testing framework tests AI models using which approach?

Correct. CheckList uses structured behavioral tests modeled on software testing principles — checking not just that a model succeeds typically, but that it fails in expected ways on adversarial inputs, and maintains consistent behavior under input variations.

Not quite. CheckList applies software testing concepts to NLP: minimum functionality tests, invariance tests (outputs shouldn't change when irrelevant words change), and directional tests (output should change predictably when relevant inputs change).

Lab 3 · Failure at the Edges

Conversation lab — complete 3 exchanges to unlock next lesson

Your Task

Design a subgroup testing strategy. Pick any AI application — hiring, medical diagnosis, content moderation, translation, credit scoring — and work with the assistant to figure out which subgroups need separate evaluation and what metrics matter for each.

Try: "I'm testing a resume-screening AI. Walk me through which subgroups I need to evaluate separately and what failure modes I should look for." — or challenge it with — "Is subgroup analysis always necessary, or only in high-stakes domains?"

AI Testing Lab

Subgroup & Edge Case Analysis

Welcome to Lab 3. We're focusing on the users and scenarios that standard evaluations miss. Tell me about an AI system you're thinking about — or pick one from the lesson — and we'll work through a disaggregated evaluation design together. Where would you like to start?

Lesson 4 · Module 1

Evaluation as Continuous Practice

Deployment is not the end of evaluation. It is the beginning of the hardest evaluation you will ever run.

What happens to AI system reliability after deployment, and how do teams build monitoring that catches degradation before users do?

In early 2020, Epic Systems deployed a sepsis prediction model — trained before the COVID-19 pandemic — across hospital systems nationwide. A study published in JAMA Internal Medicine in 2021 found that after deployment, the model's performance deteriorated substantially in hospitals where COVID-19 patients were present. The model had been trained on pre-pandemic patients; pandemic-era patients had different vital sign patterns, different lab result distributions, and different treatment histories. The model's score still ran. It still produced predictions. Nothing in the system flagged the degradation. Clinicians relying on the score were receiving predictions of unknown validity. The failure was invisible until researchers specifically went looking for it.

Distribution Drift: The Silent Degrader

AI models are trained on a snapshot of the world. The world does not stay still. When the population using the system, the data it receives, or the context it operates in changes — even without any code change — the model's reliability can degrade. This is called distribution drift, and it is one of the most significant and underappreciated risks in deployed AI.

There are two types: covariate drift (the inputs change, but the relationship between inputs and outputs stays the same — for example, image quality improves over time) and concept drift (the relationship between inputs and outputs actually changes — for example, new fraud patterns emerge that differ from historical fraud). Both require monitoring, but concept drift is more dangerous because the model can appear to be running normally while being systematically wrong.

Distribution Drift The change over time in the statistical properties of the data a model receives in deployment, relative to the data it was trained on — causing reliability to degrade without any change to the model itself.

Concept Drift A change in the true relationship between inputs and outputs in the real world, making a model's learned mapping progressively incorrect even if the inputs look similar to training data.

What Production Monitoring Actually Requires

Monitoring a deployed AI system is not the same as application performance monitoring. CPU utilization and response latency can be fine while the model is producing systematically wrong outputs. Effective AI monitoring requires tracking signals that indicate model behavior quality, not just system health.

These signals include: input distribution monitoring (are the inputs the model is receiving similar in statistical character to its training data?); output distribution monitoring (are the outputs the model is producing similar in distribution to its calibration period?); proxy metric tracking (in cases where true labels are delayed — like predicting loan default — are early-signal proxies trending correctly?); and slice-level performance sampling (periodic human evaluation of model outputs on subgroup samples to catch degradation that aggregate metrics miss).

Documented Case · Twitter Image Cropping Algorithm · 2020

Twitter's saliency model — which automatically cropped images to highlight the "most interesting" part — was deployed without ongoing behavioral monitoring. In September 2020, users observed that the algorithm consistently cropped away Black faces when images contained both Black and white faces. Twitter's internal investigation, published in 2021, confirmed systematic bias. The model had been deployed for over a year without the bias being caught by any monitoring system. It was caught by user reports. Post-deployment evaluation caught what pre-deployment testing had missed.

Building the Evaluation Loop

Leading AI teams treat evaluation as a closed loop, not a checkpoint. The loop has four components: pre-deployment evaluation (rigorous testing before launch, using representative data and subgroup analysis); staged rollout (releasing to a small fraction of users first, monitoring closely before expanding); continuous production monitoring (automated signals plus scheduled human review); and retraining triggers (explicit criteria that cause the model to be retrained or retired when performance signals fall below threshold).

Google's model cards framework, introduced in 2019, formalized the practice of documenting evaluation results at launch — including intended use, performance across subgroups, and known limitations — creating accountability for post-deployment monitoring against those baseline claims. Model cards do not guarantee good monitoring, but they create a documented baseline against which drift can be measured.

Core Principle

AI evaluation is a continuous practice, not a one-time certification. Every deployed AI system is an ongoing experiment whose results you are obligated to observe. The infrastructure for observing those results — monitoring, sampling, auditing — is as important as the model itself.

Lesson 4 Quiz

Evaluation as Continuous Practice · 3 questions

In the Epic sepsis model case, what caused the model's performance to degrade after COVID-19 patients entered the hospitals?

Correct. The model trained on pre-pandemic data encountered pandemic patients with systematically different physiological patterns — a textbook distribution drift case that degraded reliability without triggering any system error.

Not quite. No code changed. The model degraded because the patients it was predicting on — with COVID-19-related vital sign patterns — were statistically different from the population it was trained on. That's distribution drift.

What distinguishes concept drift from covariate drift?

Correct. Concept drift is the more dangerous of the two: the world changes in a way that makes the model's learned patterns wrong, even if the inputs look familiar. New fraud patterns look like old legitimate transactions at first.

Not quite. Covariate drift: inputs change, but the input→output relationship stays the same. Concept drift: the underlying relationship between inputs and outputs changes — what used to predict fraud no longer does, because fraud has evolved.

Twitter's image cropping algorithm bias was discovered primarily through:

Correct. The algorithm was live for over a year without any monitoring system catching the bias. Users discovered it themselves and reported it publicly — demonstrating what happens when post-deployment evaluation infrastructure is absent.

Not quite. No monitoring system caught it — users did, after more than a year of deployment. This is a real-world demonstration of why continuous post-deployment evaluation matters: the failure was invisible to Twitter until users made it visible.

Lab 4 · Continuous Evaluation

Conversation lab — complete 3 exchanges to unlock the module test

Your Task

Design a post-deployment monitoring plan for a real AI system. Pick a context — a recommendation algorithm, a medical prediction model, a fraud detection system — and work through what signals you'd monitor, how often you'd sample outputs for human review, and what thresholds would trigger retraining.

Try: "Help me design a monitoring plan for a loan approval AI that uses a model trained two years ago." — or — "What are the practical limits of automated drift detection, and when does human review become necessary?"

AI Testing Lab

Production Monitoring

Welcome to Lab 4. We're designing evaluation systems that work in production — not just at launch. Tell me about an AI system you want to monitor, or start with a question about drift detection, monitoring signals, or when to retrain. What would you like to work through?

Module 1 Test

Why AI Testing Is Different · 15 questions · Pass at 80%

1. What is the "oracle problem" in AI testing?

Correct.

The oracle problem is about verifying outputs — there's no reliable ground-truth function that tells you whether an AI output is correct without human judgment.

2. Meta's Galactica model was taken offline three days after launch primarily because:

Correct.

Galactica produced fluent, grammatically correct text that was factually fabricated — including fake citations. No test caught it because there was no specification against "plausible-sounding confabulation."

3. Why does non-determinism in AI systems break traditional regression testing?

Correct.

Non-determinism means identical prompts produce different outputs — so a test that checks for a specific output will fail even when the model is behaving correctly, and pass even when it isn't.

4. Amazon's recruiting tool, scrapped in 2018, demonstrated which type of AI failure?

Correct.

The tool produced outputs without errors. Its failure — penalizing resumes mentioning "women's" organizations — was an emergent bias caught only by auditing outcomes, not by any functional test.

5. Goodhart's Law, applied to AI benchmarks, predicts that:

Correct.

Goodhart's Law: when a measure becomes a target, optimization pressure corrupts its relationship to the underlying construct. The benchmark score rises; the capability it was supposed to measure may not.

6. Benchmark contamination occurs when:

Correct.

Contamination means the test set leaked into training — scores reflect the model having seen the answers, not genuine reasoning capability.

7. The Stanford CheXNet chest X-ray model achieved high benchmark accuracy but failed to generalize across institutions because:

Correct.

The failure was distribution mismatch — the benchmark's controlled conditions (specific scanner, protocol, patient population) didn't predict performance across hospitals with different equipment and demographics.

8. The Gender Shades project found error rate gaps of up to 34 percentage points. What caused vendors' published accuracy figures to miss this?

Correct. Aggregate accuracy on a skewed test set hid catastrophic subgroup failure — the classic aggregate accuracy masking problem.

The issue was aggregate accuracy masking: test sets that were not representative of darker skin tones made aggregate scores meaningless as predictors of performance on those groups.

9. Why is framing AI failures as "edge cases" potentially problematic?

Correct.

The problem is perspective: "edge" implies rarity from the majority's viewpoint, but the same failure can be routine from an affected minority group's viewpoint. The label creates false permission to deprioritize real failures.

10. The COMPAS recidivism algorithm case illustrates which evaluation failure?

Correct.

COMPAS's designers measured aggregate accuracy. They did not evaluate subgroup-specific false positive rates — which revealed that Black defendants were flagged as high-risk incorrectly at nearly twice the rate of white defendants.

11. What is the CheckList framework's key contribution to AI behavioral testing?

Correct.

CheckList brought software testing discipline to NLP: structured behavioral tests that check minimum functionality, invariance (outputs stable under irrelevant changes), and directional expectations (outputs change predictably under relevant changes).

12. Distribution drift refers to:

Correct.

Distribution drift: the world changes, the model doesn't, and reliability degrades without any system error being raised. The Epic sepsis model is the canonical example.

13. The Epic sepsis model's post-pandemic degradation was most dangerous because:

Correct. Silent degradation is the most dangerous failure mode: the system appears operational while producing unreliable outputs clinicians may trust.

The model kept running and producing scores. Nothing flagged the degradation. Clinicians received predictions of unknown validity with no warning — the canonical danger of AI systems that fail silently.

14. Twitter's image cropping algorithm bias operated undetected for over a year because:

Correct.

The absence of post-deployment behavioral monitoring meant users discovered the bias before Twitter did — a direct consequence of treating deployment as the end of evaluation rather than the beginning of ongoing monitoring.

15. Google's model cards framework, introduced in 2019, primarily contributes to evaluation by:

Correct.

Model cards create documented accountability: they record what evaluation found at launch, creating a baseline that makes subsequent degradation or newly discovered failures measurable against a known reference point.