In November 2022, Meta released Galactica — a language model trained on 48 million scientific papers. Researchers called it extraordinary. Three days later, it was gone. The model had been generating citations to papers that didn't exist, summarizing studies with invented results, producing chemistry that sounded right but wasn't. Every sentence was fluent. No test had caught any of it, because no test had been written to ask: is this true?
This is the central challenge of AI evaluation. Conventional software testing is built on specifications — a function receives inputs, produces outputs, and you check whether those outputs match what you defined. AI systems don't work that way. They produce outputs that are grammatically correct, contextually coherent, and frequently wrong in ways that are invisible to any automated check that doesn't already know what "wrong" looks like.
This course teaches you to close that gap. You'll learn why AI breaks the assumptions baked into software testing, how benchmark scores mislead teams into shipping systems that fail in the wild, where failures actually concentrate (and why those are rarely true edge cases), and how to build monitoring that catches degradation before your users do. The goal isn't to be skeptical of AI — it's to be rigorous about it.
When Meta released Galactica — a large language model trained on 48 million scientific papers — researchers initially praised its fluency. Three days later, they pulled it offline. The model had been confidently generating citations to papers that did not exist, summarizing studies with fabricated results, and producing plausible-sounding but factually wrong chemistry. Every output was grammatically flawless. No conventional software test had caught any of it, because the model had no specification to violate — it produced what looked like valid text in every case.
Classical software testing rests on a simple assumption: for a given input, there is a correct output you can specify in advance. A function that computes sales tax either returns 8.875% of $40.00 or it does not. You write the assertion. You run the test. You get a binary result.
AI systems — particularly large language models and machine-learning pipelines — shatter this assumption. The output space is effectively infinite. A question like "Summarize this contract" has thousands of acceptable responses and thousands of unacceptable ones, and the boundary between them is neither sharp nor stable across contexts. You cannot write an assertion for "good summary."
This is called the oracle problem: in AI testing, there is often no reliable oracle — no ground-truth function you can call to verify whether the output is correct.
1. Non-determinism. The same prompt, sent twice, can produce different outputs. Temperature, sampling, and randomness mean that traditional regression testing — re-running the same inputs and expecting identical outputs — fails as a reliability signal. Galactica would produce a different fake citation each time.
2. Emergent failure modes. AI failures are often not crashes or exceptions. They are subtle: a medical chatbot that gives mostly correct advice but systematically under-refers women for cardiac events; a code-completion model that introduces SQL injection vulnerabilities at a 3% rate; a translation model that subtly shifts the sentiment of diplomatic documents. These failures are invisible unless you specifically look for them.
3. Distribution shift. A model trained on data from one period or population can degrade invisibly when deployed on a different population. The model still "runs" — it still produces confident outputs — but its reliability has collapsed. Traditional software does not have this property: a sorting algorithm does not become wrong because the cultural context of the data changed.
Amazon's machine-learning hiring tool, trained on a decade of resumes, learned to penalize resumes containing the word "women's" — as in "women's chess club." The system passed all its unit tests. It returned ranked candidate lists without errors. The discriminatory pattern only surfaced during an internal audit comparing outcomes by gender. Amazon scrapped the tool in 2018 after Reuters reported the findings. No crash. No exception. No failed assertion.
When we test software, correctness is defined by a specification written by humans before the code was built. When we train AI systems, the "specification" is implicit — embedded in the training data, the loss function, and the choices made during fine-tuning. Nobody writes down "do not hallucinate." Nobody writes "do not encode historical hiring bias." The system learns proxies for human intent, and those proxies are imperfect.
This means AI testing must itself define what correctness looks like — a task that is simultaneously philosophical, empirical, and domain-specific. A response that is correct for a general-purpose assistant may be dangerously incomplete for a clinical decision-support tool. Testing strategy must begin by asking: what does this system need to do, and what does it absolutely must not do?
AI evaluation is not a phase that happens after development. It is a continuous discipline that begins before a model is trained — at the moment you decide what it is for — and continues indefinitely after deployment.
You're going to stress-test the idea that AI outputs can't be verified with simple pass/fail checks. Ask the AI assistant about real scenarios where standard software tests would miss AI failures — or challenge it with cases where you think traditional testing might actually work.
In 2023, researchers at the University of California Berkeley demonstrated that GPT-4's performance on the bar exam — often cited as evidence of legal reasoning — relied heavily on the exam's multiple-choice format. When the same legal scenarios were presented in open-ended form, performance dropped substantially. The benchmark score was real. The underlying capability it was taken to imply was not. The problem had a name: benchmark contamination — training data that includes the test set's questions or answers, producing scores that measure memorization, not generalization.
Benchmarks are created to measure something specific: reading comprehension, mathematical reasoning, commonsense inference. Within months of release, competitive model developers train on data that overlaps with the benchmark — sometimes inadvertently, because the internet contains discussions of the benchmark's questions; sometimes deliberately. The benchmark becomes a target rather than a measurement.
This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure. By the time MMLU (Massive Multitask Language Understanding) was saturated — with top models scoring above 90% — researchers had already documented that high-scoring models failed on trivially rephrased versions of the same questions.
Even a genuinely uncontaminated benchmark only measures what its creators thought to measure. Medical AI benchmarks typically test diagnosis given a formatted clinical vignette. Real clinical use involves unstructured notes, abbreviations, physician shorthand, multi-morbidity, and social context. The benchmark measures a proxy for the real task, not the task itself.
A 2021 study published in Nature Medicine found that AI diagnostic models evaluated on curated benchmark datasets showed dramatic performance drops when evaluated on data from a different hospital — even within the same country, using the same imaging modality. The benchmark performance was real and reproducible. The deployment performance was not. The gap was caused by differences in scanner hardware, imaging protocols, and patient demographics that the benchmark did not represent.
Stanford's CheXNet model, trained and evaluated on the ChestX-ray14 dataset, achieved radiologist-level performance on that benchmark. Independent evaluation by external researchers, published in PLOS Medicine, found that the model's advantage disappeared when tested on different patient populations and X-ray equipment. The benchmark was valid within its own distribution. It predicted almost nothing about real-world generalization across institutions.
Effective AI evaluation uses benchmarks as one signal among many — not as a verdict. The complementary methods include: held-out test sets from the deployment distribution (data that matches the actual population and context where the model will be used); adversarial probing (deliberately crafted inputs designed to find failure modes); behavioral testing (checking model behavior across systematically varied inputs, not just a fixed test set); and ongoing monitoring in production (treating deployment as a continuation of evaluation, not its end).
The 2023 HELM (Holistic Evaluation of Language Models) framework from Stanford's Center for Research on Foundation Models was a direct response to single-benchmark gaming — it evaluated models across 42 scenarios and 7 metrics simultaneously, making it much harder to optimize for the evaluation itself without genuinely improving across all dimensions.
Benchmark scores answer a narrow question: how does this model perform on these specific items under these conditions? The question that matters for deployment is completely different: how does this model perform on real tasks, for real users, in real conditions? Closing that gap requires evaluation work that benchmarks cannot do alone.
Probe the limits of benchmarks. Ask the assistant about specific benchmarks, how contamination could happen in practice, or what a genuinely deployment-predictive evaluation would look like for a use case you care about.
In June 2015, Google Photos launched its auto-labeling feature. Within days, a Black software developer named Jacky Alcine noticed that the system had labeled photos of him and his girlfriend as "gorillas." Google's image classifier had achieved high average accuracy across its test set — but its test set had not adequately represented darker-skinned faces. The "high average accuracy" masked catastrophic failure on a specific demographic. Google's response — disabling the "gorilla" label entirely rather than fixing the underlying classification — remained the workaround eight years later, as documented by Wired in 2023.
Aggregate accuracy metrics are a weighted average across all test cases. If your test set is 90% majority-group examples, a model that fails completely on the minority 10% can still achieve 90% overall accuracy — and look excellent on every standard report. This is not a theoretical concern. It is the mechanism behind nearly every documented case of AI demographic disparity.
A 2019 study by MIT Media Lab researcher Joy Buolamwini and Timnit Gebru — the "Gender Shades" project — evaluated commercial facial analysis systems from IBM, Microsoft, and Face++ against a benchmark they constructed to include darker-skinned faces. Error rates for darker-skinned women were up to 34 percentage points higher than for lighter-skinned men. The vendors' published accuracy figures had used test sets that underrepresented darker skin tones, making their averages meaningless as a predictor of real-world performance across the full population.
Testing teams often frame subgroup failures as "edge cases" — implying rarity and acceptable risk. But a failure that affects 13% of the U.S. population is not an edge case for the people in that 13%. The framing of edge cases reflects who was in the room when the test set was designed, not the actual frequency of the scenario in deployment.
The COMPAS recidivism algorithm, used by courts in several U.S. states to predict the likelihood of reoffending, was shown by ProPublica in 2016 to produce systematically different false positive rates by race — Black defendants were nearly twice as likely to be incorrectly flagged as high-risk. The algorithm's designers measured aggregate predictive accuracy. They did not measure subgroup-specific false positive rates. The failure was invisible in their evaluation framework.
A study published in Nature Medicine evaluated a deep-learning skin cancer detection system against 58 dermatologists. The AI matched or exceeded dermatologist performance on the benchmark dataset. A follow-up analysis found that the training and test data contained overwhelmingly light-skinned examples. The AI's performance on darker skin tones was substantially lower — a failure that the original evaluation, using aggregate metrics on a non-representative dataset, could not detect.
Effective AI testing explicitly designs for subgroup analysis from the start. This means stratifying test sets by relevant demographic variables (age, gender, race, dialect, geography, device type, literacy level) and reporting disaggregated metrics — not just aggregate accuracy. It means conducting adversarial audits: actively constructing inputs designed to surface failures, not just inputs that represent average usage.
The practice of behavioral testing — popularized by Marco Tulio Ribeiro and colleagues with the CheckList framework in 2020 — applies software testing principles like minimum functionality testing, invariance testing, and directional expectation testing to NLP models. The key insight: you need to check that the model fails the right way in adversarial cases, not just that it succeeds on typical ones.
The people most likely to experience AI failure are often the people least represented in the test data. Rigorous AI testing demands disaggregated evaluation: you must measure performance separately for every user group that matters, not just report the average and move on.
Design a subgroup testing strategy. Pick any AI application — hiring, medical diagnosis, content moderation, translation, credit scoring — and work with the assistant to figure out which subgroups need separate evaluation and what metrics matter for each.
In early 2020, Epic Systems deployed a sepsis prediction model — trained before the COVID-19 pandemic — across hospital systems nationwide. A study published in JAMA Internal Medicine in 2021 found that after deployment, the model's performance deteriorated substantially in hospitals where COVID-19 patients were present. The model had been trained on pre-pandemic patients; pandemic-era patients had different vital sign patterns, different lab result distributions, and different treatment histories. The model's score still ran. It still produced predictions. Nothing in the system flagged the degradation. Clinicians relying on the score were receiving predictions of unknown validity. The failure was invisible until researchers specifically went looking for it.
AI models are trained on a snapshot of the world. The world does not stay still. When the population using the system, the data it receives, or the context it operates in changes — even without any code change — the model's reliability can degrade. This is called distribution drift, and it is one of the most significant and underappreciated risks in deployed AI.
There are two types: covariate drift (the inputs change, but the relationship between inputs and outputs stays the same — for example, image quality improves over time) and concept drift (the relationship between inputs and outputs actually changes — for example, new fraud patterns emerge that differ from historical fraud). Both require monitoring, but concept drift is more dangerous because the model can appear to be running normally while being systematically wrong.
Monitoring a deployed AI system is not the same as application performance monitoring. CPU utilization and response latency can be fine while the model is producing systematically wrong outputs. Effective AI monitoring requires tracking signals that indicate model behavior quality, not just system health.
These signals include: input distribution monitoring (are the inputs the model is receiving similar in statistical character to its training data?); output distribution monitoring (are the outputs the model is producing similar in distribution to its calibration period?); proxy metric tracking (in cases where true labels are delayed — like predicting loan default — are early-signal proxies trending correctly?); and slice-level performance sampling (periodic human evaluation of model outputs on subgroup samples to catch degradation that aggregate metrics miss).
Twitter's saliency model — which automatically cropped images to highlight the "most interesting" part — was deployed without ongoing behavioral monitoring. In September 2020, users observed that the algorithm consistently cropped away Black faces when images contained both Black and white faces. Twitter's internal investigation, published in 2021, confirmed systematic bias. The model had been deployed for over a year without the bias being caught by any monitoring system. It was caught by user reports. Post-deployment evaluation caught what pre-deployment testing had missed.
Leading AI teams treat evaluation as a closed loop, not a checkpoint. The loop has four components: pre-deployment evaluation (rigorous testing before launch, using representative data and subgroup analysis); staged rollout (releasing to a small fraction of users first, monitoring closely before expanding); continuous production monitoring (automated signals plus scheduled human review); and retraining triggers (explicit criteria that cause the model to be retrained or retired when performance signals fall below threshold).
Google's model cards framework, introduced in 2019, formalized the practice of documenting evaluation results at launch — including intended use, performance across subgroups, and known limitations — creating accountability for post-deployment monitoring against those baseline claims. Model cards do not guarantee good monitoring, but they create a documented baseline against which drift can be measured.
AI evaluation is a continuous practice, not a one-time certification. Every deployed AI system is an ongoing experiment whose results you are obligated to observe. The infrastructure for observing those results — monitoring, sampling, auditing — is as important as the model itself.
Design a post-deployment monitoring plan for a real AI system. Pick a context — a recommendation algorithm, a medical prediction model, a fraud detection system — and work through what signals you'd monitor, how often you'd sample outputs for human review, and what thresholds would trigger retraining.