Module 4 · Lesson 1

Hallucination: When AI Invents Reality

Confident, fluent, and completely made up

Why does an AI that "knows" so much fabricate facts it couldn't possibly have looked up?

A New York attorney named Steven Schwartz filed a legal brief in federal court. He had used ChatGPT to research case citations. The brief referenced six cases — Varghese v. China Southern Airlines, Martinez v. Delta Air Lines, and four others — each with specific court names, dates, and ruling summaries. Every single one was completely fabricated. The cases had never existed. When the judge demanded copies, Schwartz asked ChatGPT to confirm the citations were real. The AI said yes. He faced sanctions and a $5,000 fine.

What Hallucination Actually Is

The word "hallucination" in AI doesn't mean the model is confused or malfunctioning. It means the model generated text that sounds authoritative but has no grounding in real fact. This happens because of how language models work at a fundamental level.

A language model doesn't store facts like a database. It learns statistical patterns: which words follow which other words, in what combinations, at what probabilities. When you ask it a question, it generates the response that is most statistically likely to follow your prompt — not the response that is most factually accurate. Truth is not the objective. Plausible text completion is.

When the model encounters a topic where its training data was thin, contradictory, or absent, it doesn't say "I don't know." It generates the most statistically coherent continuation — which can look exactly like a real citation, a real person's biography, or a real scientific study.

Why It Sounds So Confident

Confidence in language model output isn't a separate variable from accuracy. The same mechanism that makes fluent text also makes fabricated text fluent. There's no internal "uncertainty meter" attached to the output. A hallucinated legal case sounds exactly as polished as a real one.

Documented Hallucination Cases

Law — 2023

Mata v. Avianca

ChatGPT fabricated six case citations filed in SDNY. Judge Castel fined attorneys $5,000 and required them to notify the judges of the non-existent cases cited in their brief. Source: SDNY Order, June 2023.

Hallucination

Academia — 2023

AI-Generated References in Published Papers

A 2023 Nature analysis found hundreds of peer-reviewed papers containing references that don't exist — generated by AI tools and inserted without verification by researchers under publication pressure. Source: Nature, 2023.

Hallucination

Journalism — 2023

CNET's AI Article Errors

CNET published dozens of AI-written financial explainers. A review by Futurism found that more than half contained factual errors — incorrect interest calculations, wrong historical dates, fabricated regulatory details. CNET had to correct or retract many articles. Source: Futurism, January 2023.

Hallucination

The Training Data Gap Problem

Models hallucinate most often in three situations: when asked about very recent events after their training cutoff, when asked about niche topics with little training data, and when asked to produce specific formatted outputs like citations, references, or code where structure matters more than truth.

The legal citations failure is a perfect example of the third type. The model had seen thousands of legal briefs. It knew what a citation looks like, what a case name sounds like, what ruling language sounds like. When asked to generate citations, it generated statistically plausible legal citation text — complete with credible-sounding names, years, and courts.

Confabulation The technical term for when a model fills gaps in its knowledge with plausible-sounding but invented information — the same phenomenon seen in certain neurological conditions in humans.

Grounding Connecting model outputs to verifiable external sources. Retrieval-Augmented Generation (RAG) is one technique to improve grounding by fetching real documents at query time.

The Core Insight

Hallucination is not a bug that will be simply patched out. It is a structural property of how language models generate text. Understanding this changes how you use AI: you verify claims, you don't use AI for high-stakes citation work without external verification, and you treat fluency as completely separate from accuracy.

Lesson 1 Quiz

Hallucination — test your understanding

In the Mata v. Avianca case, what was the primary failure?

Correct. ChatGPT fabricated six entirely fictional case citations — names, courts, dates — none of which existed. The attorney filed them in federal court without verification.

Not quite. The AI didn't search the web — it generated statistically plausible-sounding legal citations from its training data patterns, and every one was invented.

Why does a language model produce hallucinated content that sounds confident?

Exactly right. The mechanism that makes text fluent and confident is the same one that makes hallucinated text fluent and confident. Truth verification is not part of the generation process.

That's a common misconception. There is no internal "uncertainty flag" that produces confident-sounding text. Confidence is simply what fluent, grammatically coherent text looks like — whether it's true or not.

Which situation makes a language model most likely to hallucinate?

Right. Specific structured outputs (citations, code, statistics) on thin-data topics are the highest-risk combination. The model knows what the format should look like and fills in plausible content.

General, well-documented questions with lots of training data are actually lower-risk. The danger is specific, structured outputs where the model fills format-shaped gaps with invented content.

Lab 1: Catch a Hallucination

Use the AI assistant to explore how and why hallucinations occur

Your Mission

Your lab partner has read about AI hallucinations and wants to understand the mechanics. Ask it questions about how hallucinations happen, why the Mata v. Avianca case matters, what grounding means, or how you could detect hallucinated content in practice. Have at least 3 exchanges to complete the lab.

Try asking: "What's the difference between a hallucination and a lie?" — or — "Why did ChatGPT confirm its own fake citations when the lawyer asked again?" — or — "How would RAG help prevent what happened in the Mata case?"

AI Hallucination Lab Lesson 1

Welcome to Lab 1. I'm here to help you understand AI hallucination — what causes it, famous cases where it went wrong, and how to detect or mitigate it. What would you like to explore?

Module 4 · Lesson 2

Bias Baked In: What Training Data Teaches

The prejudices in the internet became the prejudices of the model

If an AI learned from human-generated data, did it also learn human prejudices — and how badly does that matter?

In 2015, Google Photos' image recognition tagged photos of Black people as "gorillas." Google apologized and issued a fix — but the fix was to remove the gorilla category entirely from the classifier, rather than actually solve the underlying bias. In 2018, a Wired investigation confirmed the labels "gorilla," "chimp," and "chimpanzee" were still blocked in Google Photos. The bias had not been corrected. It had been hidden.

Where AI Bias Comes From

Machine learning models learn from data. If that data reflects historical inequalities, underrepresentation, or stereotyped associations, the model will reproduce them — sometimes amplifying them. This is not a values failure by the engineers; it is a mathematical consequence of learning from biased distributions.

There are three primary bias entry points: training data bias (the dataset overrepresents certain groups), label bias (humans who labelled training examples applied stereotyped judgments), and measurement bias (the metric used to evaluate the model favours certain groups over others).

Hiring — 2018

Amazon's Recruiting AI

Amazon built a machine learning tool to screen resumes from 2014–2018. It learned from 10 years of hiring decisions — mostly men. It systematically downgraded resumes mentioning "women's" (as in women's chess club) and penalised graduates of all-women's colleges. Amazon scrapped the tool. Source: Reuters, October 2018.

Bias

Healthcare — 2019

Racial Bias in Patient Prioritization

A widely-used healthcare algorithm used historical healthcare cost as a proxy for health need. Because Black patients historically received less care, the algorithm consistently ranked them as needing less intervention than equally sick white patients — affecting 200+ million people. Source: Obermeyer et al., Science, 2019.

Bias

Criminal Justice — 2016

COMPAS Recidivism Scores

ProPublica analysed the COMPAS algorithm used by courts to predict reoffending risk. Black defendants were nearly twice as likely to be falsely flagged as high risk compared to white defendants. White defendants were more often incorrectly labelled low risk. Source: ProPublica, May 2016.

Bias

Amplification, Not Just Reflection

Research from the University of Virginia (2017) showed that image captioning models trained on the MS-COCO dataset didn't just reflect gender stereotypes from the data — they amplified them. If cooking images were 33% men in the training data, the model attributed cooking to women at a rate of 84%. Models can become more biased than their training data through the optimization process itself.

This happens because gradient descent finds the path of least prediction error. Stereotypes are statistically reliable shortcuts. The model learns to use them because, mathematically, they reduce average error — even while causing catastrophic errors for individuals who don't fit the stereotype.

Proxy Variable A feature the model uses as a stand-in for something it shouldn't measure. Using zip code as a proxy for race, or healthcare cost as a proxy for health need — the Amazon and healthcare algorithm cases both involved proxy bias.

Disparate Impact When a facially neutral algorithm produces outcomes that systematically disadvantage a protected group — even without explicit intent to discriminate. The COMPAS case is a canonical example.

The Fix Is Hard

Google's response to the gorilla tagging — removing the category rather than solving the bias — illustrates why bias is hard to fix: the problem is in the distribution of training data and the pattern-matching nature of learning, not in a single parameter you can edit. Real mitigation requires diverse data, adversarial testing, and ongoing auditing — not a one-time patch.

Lesson 2 Quiz

AI Bias — test your understanding

What was wrong with Amazon's recruiting AI tool, and why was it scrapped?

Correct. The tool trained on historical hiring data that skewed heavily male. It reproduced and codified that bias — penalising "women's chess club," women's colleges — forcing Amazon to abandon it.

The problem was gender bias embedded from historical training data. The model learned that past successful hires were mostly men and used that pattern to rank women lower, regardless of qualifications.

In the 2019 healthcare algorithm study (Obermeyer et al.), what was the proxy variable that caused racial bias?

Exactly. The algorithm used cost as a proxy for need. Since Black patients historically had less access to care, they had lower costs — the algorithm interpreted this as lower need, systematically under-prioritising them.

The subtle danger of proxy variables: the model didn't use race directly, but used spending, which correlated with race due to historical inequalities in healthcare access. The bias was indirect but just as harmful.

According to the University of Virginia image captioning research, what happened to gender stereotypes when models were trained on biased data?

Correct. This amplification effect is important: a model can emerge more biased than its inputs because optimization finds stereotypes are statistically reliable shortcuts that reduce average error.

Models don't neutrally mirror training data — they can amplify biases because statistical shortcuts (stereotypes) reduce prediction error during training. The output was more biased than the input data.

Lab 2: Unpacking AI Bias

Explore the mechanics of how training data produces biased models

Your Mission

Discuss AI bias with your lab partner. Ask about the cases from the lesson, about proxy variables, about why amplification happens, or about what real mitigation looks like. Push into the uncomfortable specifics — bias is a topic people often keep vague. Have at least 3 exchanges to complete the lab.

Try asking: "Why is it so hard to fix bias once it's in a model?" — or — "Could a model be biased even if you removed all demographic information from the training data?" — or — "What's the difference between the Amazon hiring bias and the COMPAS recidivism bias?"

AI Bias Lab Lesson 2

Welcome to Lab 2. I'm here to dig into AI bias with you — where it comes from, how it amplifies, and what real cases look like. What do you want to explore?

Module 4 · Lesson 3

Brittleness: Why AI Breaks on Small Changes

A tiny sticker. A few random pixels. An unbeatable adversary.

If an AI can identify thousands of objects flawlessly, how can a single sticker on a stop sign make it think the sign says 45 mph?

Researchers at the University of Washington, the University of Michigan, and Google Brain published a paper in 2017 showing that physical-world adversarial examples could fool autonomous vehicle computer vision. They placed small, carefully designed stickers on a stop sign. From a human perspective, the stop sign was obviously still a stop sign. To the neural network classifier, it was consistently identified as a 45 mph speed limit sign — at multiple distances, angles, and lighting conditions. The attack was robust and repeatable in the real world.

What Brittleness Means in Neural Networks

Human visual recognition is robust to perturbation. You recognise a coffee cup whether it's upside down, partially hidden, photographed in bad lighting, or drawn in a cartoon style. Neural networks achieve superhuman accuracy on standardised benchmarks — but they learn very different features than humans do.

Instead of learning "round rim + cylindrical body + handle = cup," a convolutional neural network often learns which specific pixel patterns in training images are statistically associated with the label "cup." These patterns are not interpretable to humans. They are often texture features, not shape features.

This means small, targeted changes to an image — imperceptible to a human — can completely flip the network's prediction. These are called adversarial examples.

Computer Vision — 2014

Adversarial Examples (Goodfellow et al.)

Ian Goodfellow et al. at Google showed that adding imperceptible noise to an image of a panda caused GoogLeNet to classify it as a gibbon with 99.3% confidence. No human could see the difference. The field of adversarial machine learning was born. Source: Goodfellow et al., ICLR 2015.

Brittleness

Medical AI — 2019

Skin Cancer Classifier Fooled by Ruler

A skin cancer detection AI trained on dermoscopy images was found to systematically associate the presence of a ruler (used to show scale) with malignancy — because dermatologists photographed suspicious lesions with rulers. The model learned the ruler, not the lesion. Source: Narla et al., NPJ Digital Medicine, 2018.

Brittleness

NLP — 2020

BERT Fooled by Negation

Research at MIT and elsewhere showed that state-of-the-art NLP models like BERT failed systematically on negation — "The patient does NOT have diabetes" was classified the same as "The patient has diabetes" in clinical NLP tasks, because the model learned co-occurrence patterns, not logical structure. Source: Ettinger, 2020.

Brittleness

Distribution Shift: The Real-World Gap

Beyond adversarial attacks, AI systems routinely fail when deployed on data that differs from their training distribution. This is called distribution shift. The system was never adversarially attacked — reality just looked different from training data.

In 2020, multiple COVID-19 chest X-ray AI systems trained during the pandemic were found to be classifying X-rays based on metadata artifacts — certain hospital sites used specific X-ray equipment that produced particular visual signatures, and those signatures correlated with early pandemic data. The models learned the scanner fingerprint, not the COVID-19 pathology.

When deployed on new hospital data, accuracy collapsed. The models weren't brittle to adversarial attack — they were brittle to the simple change of using different equipment.

Adversarial Example An input deliberately modified — often imperceptibly — to cause a model to misclassify it. Adversarial examples reveal that neural networks learn different features than human intuition suggests.

Distribution Shift When the statistical properties of real-world deployment data differ from the training data. Even without adversarial intent, this can cause catastrophic failure as the model applies learned patterns that no longer apply.

Why Benchmark Accuracy Isn't Enough

A model can achieve 99% accuracy on an ImageNet benchmark and still be profoundly brittle to real-world variation. The stop sign sticker attack, the skin cancer ruler artifact, the COVID-19 scanner fingerprint — all exposed systems that performed well in testing but failed on the structured gap between training and reality. Robustness testing is a separate and essential discipline from accuracy evaluation.

Lesson 3 Quiz

Brittleness & Adversarial Examples — test your understanding

In the 2017 stop sign adversarial attack by University of Washington researchers, what made the attack remarkable?

Correct. The physical-world robustness was the key finding. The attack wasn't a digital manipulation of image files — actual printed stickers on a real sign produced reliable misclassification under real-world conditions.

The attack specifically demonstrated physical-world robustness — real printed stickers, not digital manipulation, fooled the classifier at multiple angles, distances, and lighting conditions relevant to autonomous vehicle deployment.

Why did the skin cancer AI systematically associate rulers with malignancy?

Exactly. The ruler was a dataset artifact — a feature that correlated with the label in training data due to clinical photography practice, not because rulers cause cancer. The model learned the correlation, not the pathology.

This is a classic spurious correlation case. The model found a statistical shortcut: rulers appeared more often with malignant lesions because of how clinicians document them — and learned that relationship instead of the actual visual pathology.

What is "distribution shift" and why does it cause AI failure?

Correct. Distribution shift doesn't require adversarial intent — just a gap between training and deployment reality. The COVID-19 X-ray case is a clear example: models learned to recognise hospital equipment signatures, not the disease.

Distribution shift is more fundamental than adversarial attack. It happens whenever deployment data differs statistically from training data — different equipment, different populations, different time periods. The model applies patterns that no longer hold.

Lab 3: Probing Brittleness

Investigate adversarial examples and distribution shift with your AI lab partner

Your Mission

Explore AI brittleness and adversarial examples. Ask why neural networks are vulnerable to small perturbations, what makes the stop sign sticker attack so alarming for self-driving cars, or what organisations should do before deploying AI systems in safety-critical settings. Have at least 3 exchanges to complete the lab.

Try asking: "Why would a human never be fooled by the stop sign sticker but an AI consistently is?" — or — "What does it mean that neural networks learn 'texture features' instead of 'shape features'?" — or — "How should hospitals test AI before deploying it on real patients?"

Brittleness Lab Lesson 3

Welcome to Lab 3. I'm your guide to AI brittleness — adversarial examples, distribution shift, and the gap between benchmark performance and real-world reliability. What would you like to dig into?

Module 4 · Lesson 4

Overconfidence and the Calibration Problem

When 95% confidence means something very different than you think

Why does an AI say it's 99% sure about something it's completely wrong about — and what does that mean for every high-stakes decision it touches?

During IBM Watson for Oncology's deployment at several major cancer centers, including MD Anderson Cancer Center in Texas and hospitals in India, Watson recommended cancer treatments that oncologists described as unsafe and incorrect. Internal IBM documents obtained by STAT News in 2017 showed that Watson had been trained primarily on a small number of hypothetical cases from Memorial Sloan Kettering rather than real patient data. Watson nonetheless recommended treatments with high confidence scores. MD Anderson spent $62 million on the project before cancelling it.

What Calibration Means

A well-calibrated model is one where its stated confidence matches its actual accuracy. If a model says "I'm 90% confident" about 100 predictions, roughly 90 of them should be correct. If only 60 are correct, the model is overconfident — its confidence scores are higher than its actual accuracy.

Most large neural networks are overconfident. A 2017 paper by Guo et al. at Cornell (one of the most cited papers in the field) documented that modern deep neural networks are significantly overconfident compared to older, shallower models. The improvement in accuracy with depth came with a degradation in calibration.

This matters enormously in high-stakes settings. A doctor who trusts a 95% confidence score from an AI diagnostic tool is making decisions based on a number that may not reflect reality at all.

Oncology — 2017

Watson for Oncology — Unsafe Recommendations

IBM Watson recommended treatments that oncologists called dangerous. It had been trained on hypothetical scenarios, not real cases, yet produced high-confidence outputs. MD Anderson cancelled after $62M spent. Source: STAT News, 2017; The Verge, 2017.

Overconfidence

Autonomous Vehicles — 2016

Tesla Autopilot Fatal Crash

In May 2016, a Tesla Model S operating on Autopilot struck a white tractor-trailer against a bright sky. The camera-based object detection failed to identify the trailer. The NTSB investigation noted the system's limitations in novel scenarios not represented in training data. The driver was killed. Source: NTSB Report HWY16FH018, 2017.

Overconfidence

NLP — Ongoing

Large Language Model Confidence

Studies consistently show that LLMs will state incorrect facts with fluency and apparent confidence indistinguishable from correct facts. The Mata v. Avianca case is a direct consequence: when re-asked to confirm fabricated citations, ChatGPT confirmed them confidently. Source: Multiple NLP calibration studies, 2022–2024.

Overconfidence

Why Deep Networks Are Overconfident

Standard neural network training uses cross-entropy loss with softmax output. Softmax converts raw network scores into probabilities — but those probabilities are not inherently calibrated to real-world accuracy. The optimization process pushes the network toward high confidence to reduce training loss, but this confidence is not epistemically grounded.

Techniques like temperature scaling and Platt scaling can post-hoc recalibrate model outputs. Bayesian neural networks and ensembles offer architectural approaches. But calibration is rarely a default property of deployed systems — it requires deliberate engineering.

Calibration The match between a model's stated confidence and its empirical accuracy. A perfectly calibrated model's 70% confidence predictions are correct 70% of the time.

Temperature Scaling A post-processing technique that adjusts the sharpness of a model's probability outputs using a learned scalar parameter, improving calibration without changing the model's predictions.

The Watson Lesson

Watson for Oncology was trained on a small set of hypothetical cases created by a few experts at one institution — not on the messy diversity of real patient data at scale. Its confident outputs in deployment were an artefact of a narrow training distribution. High stated confidence from AI in medical settings requires not just good accuracy but verified calibration on relevant patient populations.

Connecting All Four Failure Modes

Hallucination, bias, brittleness, and overconfidence are not separate bugs — they're four expressions of the same underlying reality: AI models learn statistical patterns from training data, and those patterns break down in structured ways when reality departs from that distribution. Understanding these failure modes isn't about distrust of AI — it's the foundation of using AI well.

Lesson 4 Quiz

Overconfidence & Calibration — test your understanding

What was the fundamental data problem with IBM Watson for Oncology?

Correct. Watson was trained on a narrow set of hypothetical cases authored by experts at Memorial Sloan Kettering — not on diverse real patient outcomes. Its confident outputs were not grounded in representative data.

The opposite: Watson was trained on too narrow a base — hypothetical cases from one institution. The overconfidence problem was that its confidence scores didn't reflect this narrow, unrepresentative foundation.

What does it mean for a model to be "well-calibrated"?

Exactly. Calibration is the alignment between confidence and accuracy. Most deep neural networks are overconfident — their probability outputs are systematically higher than their actual accuracy rates.

Calibration is specifically about whether confidence scores are meaningful. A model saying "95% confident" that is only correct 60% of the time is dangerously overconfident, regardless of overall accuracy.

According to the Guo et al. (2017) study on neural network calibration, what happened to calibration as models got deeper and more accurate?

Correct. This is a critical finding: accuracy and calibration are not the same thing, and they can move in opposite directions. The very architectural changes that improved accuracy also degraded calibration in modern deep networks.

The counterintuitive finding: deeper, more accurate models were less well-calibrated than older, shallower ones. Better accuracy came with worse confidence scores — a trade-off that has major implications for deployment in high-stakes settings.

Lab 4: Interrogating AI Confidence

Probe what overconfidence means for real-world AI deployment

Your Mission

Explore AI overconfidence and calibration. Ask about why Watson for Oncology failed, what calibration means in practice, or how you would design an AI system for a hospital that properly communicates its uncertainty. Connect this lesson's ideas back to earlier lessons — how does overconfidence relate to hallucination or brittleness? Have at least 3 exchanges to complete the lab.

Try asking: "If a doctor can't trust a model's confidence score, how should they use AI at all?" — or — "How does temperature scaling actually fix overconfidence?" — or — "Is overconfidence in AI fundamentally different from overconfidence in humans?"

Calibration Lab Lesson 4

Welcome to Lab 4. I'm here to discuss AI overconfidence and calibration — what the Watson failure teaches us, what calibration actually means, and what well-designed AI uncertainty communication looks like. What do you want to explore?

Module 4 Test

15 questions across all four lessons — pass at 80% to complete

1. What happened in the Mata v. Avianca case of 2023?

Correct. ChatGPT invented six completely fictional case citations — names, courts, dates, rulings — that attorney Steven Schwartz filed in SDNY. None of the cases had ever existed.

In Mata v. Avianca, ChatGPT fabricated six entirely fictional legal case citations that attorney Steven Schwartz filed in federal court without verification.

2. Why does a language model produce hallucinated content that sounds confident?

Correct. Language models generate statistically plausible text completions. The fluency of hallucinated text is identical to real text because both emerge from the same process — there is no truth-checking layer.

There is no separate accuracy verification step. The same statistical text completion mechanism that produces correct fluent text also produces hallucinated fluent text — confidence is not a signal of accuracy.

3. What did the CNET AI article investigation reveal in January 2023?

Correct. A Futurism investigation found that more than half of the AI-written financial articles contained factual errors including incorrect interest calculations, wrong dates, and fabricated regulatory details.

Futurism found that the majority of CNET's AI-written financial articles contained factual errors — demonstrating that AI fluency in financial content does not guarantee accuracy.

4. Amazon's recruiting AI downgraded candidates for what reason?

Correct. Training on 10 years of predominantly male successful hires caused the model to learn that male-associated features predicted success — leading it to penalise specifically female-associated language.

The model learned from historical hiring data that skewed heavily male and applied that learned pattern to new candidates — systematically penalising women's organisations and all-women's colleges.

5. In the 2019 healthcare algorithm study, what proxy variable caused racial bias in patient prioritisation?

Correct. Cost was used as a proxy for need. Because Black patients had historically less access to care, they had lower costs — the model misread this as lower need, affecting prioritisation for 200+ million patients.

Healthcare spending was the proxy — it correlated with race because of historical inequalities in access to care, not because Black patients had lower health needs. The proxy introduced structural racial bias.

6. What does the University of Virginia image captioning research (2017) show about bias amplification?

Correct. The model was more biased than its training data because statistical stereotypes are reliable shortcuts that reduce average training error — the optimization process finds and reinforces them.

The key finding was amplification: the model emerged from training more biased than its input data. Stereotypes are efficient statistical shortcuts — optimization tends to find and exploit them.

7. What did the 2017 physical-world adversarial attack research show about stop signs?

Correct. Physical stickers — not digital modification — caused reliable real-world misclassification. The attack was robust to angle, distance, and lighting changes relevant to real autonomous vehicle deployment.

The attack used real physical stickers on a real sign and worked in the real world across multiple conditions. This was the key concern — it wasn't just a lab artefact.

8. Why did the skin cancer detection AI associate rulers with malignancy?

Correct. This is a dataset artifact: the model found a real statistical correlation in training data (rulers appeared more with malignant lesions) and learned it as a feature, bypassing the actual pathological signal.

Clinical photography practice created the correlation — suspicious lesions were photographed with scale rulers. The model learned the instrument of documentation as a proxy for the clinical judgment, not the pathology itself.

9. What is "distribution shift" in the context of AI failure?

Correct. Distribution shift doesn't require adversarial intent. The COVID-19 X-ray case showed how models learning scanner fingerprints rather than pathology failed completely when deployed on different hospital equipment.

Distribution shift is a natural deployment hazard: when the real world differs from training data in systematic ways, even well-functioning models can fail catastrophically because their learned patterns no longer apply.

10. What did Guo et al. (2017) find about calibration in modern deep neural networks?

Correct. This is a fundamental finding: accuracy and calibration are separate properties that can move in opposite directions. The depth and regularisation techniques that improved accuracy degraded calibration.

The counterintuitive result: accuracy improved with depth but calibration got worse. The techniques that made models more accurate (batch normalisation, weight decay) also made their confidence scores less reliable.

11. What was the fundamental problem with IBM Watson for Oncology's training data?

Correct. Watson trained on a narrow set of hypothetical cases from Memorial Sloan Kettering — not on the diverse, messy reality of actual patient outcomes across populations. Its high-confidence outputs lacked this grounding.

Watson's narrow training base — hypothetical cases from one expert institution — meant its confident recommendations were not grounded in representative patient data, leading to unsafe treatment suggestions.

12. What does a "well-calibrated" AI model mean in practice?

Correct. Calibration is the alignment between stated confidence and empirical accuracy. Most deployed deep networks are overconfident — their 90% confidence is not 90% accurate in practice.

Calibration specifically means the match between confidence scores and actual accuracy rates. A model stating 90% confidence that is only correct 60% of the time is dangerously overconfident.

13. How does Google Photos' response to the 2015 gorilla tagging incident illustrate the difficulty of fixing bias?

Correct. The "fix" was deletion of the category, not correction of the bias. Three years later the term was still blocked — revealing that addressing the symptom (the harmful label) is far easier than addressing the cause (biased training distributions).

The response was to delete the category entirely — not fix the underlying bias. This illustrates that bias correction requires changing training data and distributions, which is far harder than removing specific offensive outputs.

14. What do adversarial examples reveal about what neural networks actually learn?

Correct. Adversarial examples exploit the gap between human vision (shape/semantic understanding) and neural network vision (texture/statistical patterns). The perturbations that fool networks are imperceptible to humans because humans use different features.

Adversarial examples expose that models learn different features than humans — often texture statistics rather than shapes. A panda image perturbed by noise imperceptible to humans fools a network because the network was using different visual features.

15. Which statement best describes the relationship between the four failure modes covered in this module?

Correct. Hallucination, bias, brittleness, and overconfidence are all consequences of the same fundamental architecture: pattern learning from finite training distributions. Understanding this unity is the key insight of the module.

These failure modes share a common root: models learn statistical patterns from training data, and those patterns fail in predictable ways at the boundaries of the training distribution. They cannot be fully "patched" — they require systemic mitigation.