A New York attorney named Steven Schwartz filed a legal brief in federal court. He had used ChatGPT to research case citations. The brief referenced six cases — Varghese v. China Southern Airlines, Martinez v. Delta Air Lines, and four others — each with specific court names, dates, and ruling summaries. Every single one was completely fabricated. The cases had never existed. When the judge demanded copies, Schwartz asked ChatGPT to confirm the citations were real. The AI said yes. He faced sanctions and a $5,000 fine.
The word "hallucination" in AI doesn't mean the model is confused or malfunctioning. It means the model generated text that sounds authoritative but has no grounding in real fact. This happens because of how language models work at a fundamental level.
A language model doesn't store facts like a database. It learns statistical patterns: which words follow which other words, in what combinations, at what probabilities. When you ask it a question, it generates the response that is most statistically likely to follow your prompt — not the response that is most factually accurate. Truth is not the objective. Plausible text completion is.
When the model encounters a topic where its training data was thin, contradictory, or absent, it doesn't say "I don't know." It generates the most statistically coherent continuation — which can look exactly like a real citation, a real person's biography, or a real scientific study.
Confidence in language model output isn't a separate variable from accuracy. The same mechanism that makes fluent text also makes fabricated text fluent. There's no internal "uncertainty meter" attached to the output. A hallucinated legal case sounds exactly as polished as a real one.
Models hallucinate most often in three situations: when asked about very recent events after their training cutoff, when asked about niche topics with little training data, and when asked to produce specific formatted outputs like citations, references, or code where structure matters more than truth.
The legal citations failure is a perfect example of the third type. The model had seen thousands of legal briefs. It knew what a citation looks like, what a case name sounds like, what ruling language sounds like. When asked to generate citations, it generated statistically plausible legal citation text — complete with credible-sounding names, years, and courts.
Hallucination is not a bug that will be simply patched out. It is a structural property of how language models generate text. Understanding this changes how you use AI: you verify claims, you don't use AI for high-stakes citation work without external verification, and you treat fluency as completely separate from accuracy.
Your lab partner has read about AI hallucinations and wants to understand the mechanics. Ask it questions about how hallucinations happen, why the Mata v. Avianca case matters, what grounding means, or how you could detect hallucinated content in practice. Have at least 3 exchanges to complete the lab.
In 2015, Google Photos' image recognition tagged photos of Black people as "gorillas." Google apologized and issued a fix — but the fix was to remove the gorilla category entirely from the classifier, rather than actually solve the underlying bias. In 2018, a Wired investigation confirmed the labels "gorilla," "chimp," and "chimpanzee" were still blocked in Google Photos. The bias had not been corrected. It had been hidden.
Machine learning models learn from data. If that data reflects historical inequalities, underrepresentation, or stereotyped associations, the model will reproduce them — sometimes amplifying them. This is not a values failure by the engineers; it is a mathematical consequence of learning from biased distributions.
There are three primary bias entry points: training data bias (the dataset overrepresents certain groups), label bias (humans who labelled training examples applied stereotyped judgments), and measurement bias (the metric used to evaluate the model favours certain groups over others).
Research from the University of Virginia (2017) showed that image captioning models trained on the MS-COCO dataset didn't just reflect gender stereotypes from the data — they amplified them. If cooking images were 33% men in the training data, the model attributed cooking to women at a rate of 84%. Models can become more biased than their training data through the optimization process itself.
This happens because gradient descent finds the path of least prediction error. Stereotypes are statistically reliable shortcuts. The model learns to use them because, mathematically, they reduce average error — even while causing catastrophic errors for individuals who don't fit the stereotype.
Google's response to the gorilla tagging — removing the category rather than solving the bias — illustrates why bias is hard to fix: the problem is in the distribution of training data and the pattern-matching nature of learning, not in a single parameter you can edit. Real mitigation requires diverse data, adversarial testing, and ongoing auditing — not a one-time patch.
Discuss AI bias with your lab partner. Ask about the cases from the lesson, about proxy variables, about why amplification happens, or about what real mitigation looks like. Push into the uncomfortable specifics — bias is a topic people often keep vague. Have at least 3 exchanges to complete the lab.
Researchers at the University of Washington, the University of Michigan, and Google Brain published a paper in 2017 showing that physical-world adversarial examples could fool autonomous vehicle computer vision. They placed small, carefully designed stickers on a stop sign. From a human perspective, the stop sign was obviously still a stop sign. To the neural network classifier, it was consistently identified as a 45 mph speed limit sign — at multiple distances, angles, and lighting conditions. The attack was robust and repeatable in the real world.
Human visual recognition is robust to perturbation. You recognise a coffee cup whether it's upside down, partially hidden, photographed in bad lighting, or drawn in a cartoon style. Neural networks achieve superhuman accuracy on standardised benchmarks — but they learn very different features than humans do.
Instead of learning "round rim + cylindrical body + handle = cup," a convolutional neural network often learns which specific pixel patterns in training images are statistically associated with the label "cup." These patterns are not interpretable to humans. They are often texture features, not shape features.
This means small, targeted changes to an image — imperceptible to a human — can completely flip the network's prediction. These are called adversarial examples.
Beyond adversarial attacks, AI systems routinely fail when deployed on data that differs from their training distribution. This is called distribution shift. The system was never adversarially attacked — reality just looked different from training data.
In 2020, multiple COVID-19 chest X-ray AI systems trained during the pandemic were found to be classifying X-rays based on metadata artifacts — certain hospital sites used specific X-ray equipment that produced particular visual signatures, and those signatures correlated with early pandemic data. The models learned the scanner fingerprint, not the COVID-19 pathology.
When deployed on new hospital data, accuracy collapsed. The models weren't brittle to adversarial attack — they were brittle to the simple change of using different equipment.
A model can achieve 99% accuracy on an ImageNet benchmark and still be profoundly brittle to real-world variation. The stop sign sticker attack, the skin cancer ruler artifact, the COVID-19 scanner fingerprint — all exposed systems that performed well in testing but failed on the structured gap between training and reality. Robustness testing is a separate and essential discipline from accuracy evaluation.
Explore AI brittleness and adversarial examples. Ask why neural networks are vulnerable to small perturbations, what makes the stop sign sticker attack so alarming for self-driving cars, or what organisations should do before deploying AI systems in safety-critical settings. Have at least 3 exchanges to complete the lab.
During IBM Watson for Oncology's deployment at several major cancer centers, including MD Anderson Cancer Center in Texas and hospitals in India, Watson recommended cancer treatments that oncologists described as unsafe and incorrect. Internal IBM documents obtained by STAT News in 2017 showed that Watson had been trained primarily on a small number of hypothetical cases from Memorial Sloan Kettering rather than real patient data. Watson nonetheless recommended treatments with high confidence scores. MD Anderson spent $62 million on the project before cancelling it.
A well-calibrated model is one where its stated confidence matches its actual accuracy. If a model says "I'm 90% confident" about 100 predictions, roughly 90 of them should be correct. If only 60 are correct, the model is overconfident — its confidence scores are higher than its actual accuracy.
Most large neural networks are overconfident. A 2017 paper by Guo et al. at Cornell (one of the most cited papers in the field) documented that modern deep neural networks are significantly overconfident compared to older, shallower models. The improvement in accuracy with depth came with a degradation in calibration.
This matters enormously in high-stakes settings. A doctor who trusts a 95% confidence score from an AI diagnostic tool is making decisions based on a number that may not reflect reality at all.
Standard neural network training uses cross-entropy loss with softmax output. Softmax converts raw network scores into probabilities — but those probabilities are not inherently calibrated to real-world accuracy. The optimization process pushes the network toward high confidence to reduce training loss, but this confidence is not epistemically grounded.
Techniques like temperature scaling and Platt scaling can post-hoc recalibrate model outputs. Bayesian neural networks and ensembles offer architectural approaches. But calibration is rarely a default property of deployed systems — it requires deliberate engineering.
Watson for Oncology was trained on a small set of hypothetical cases created by a few experts at one institution — not on the messy diversity of real patient data at scale. Its confident outputs in deployment were an artefact of a narrow training distribution. High stated confidence from AI in medical settings requires not just good accuracy but verified calibration on relevant patient populations.
Hallucination, bias, brittleness, and overconfidence are not separate bugs — they're four expressions of the same underlying reality: AI models learn statistical patterns from training data, and those patterns break down in structured ways when reality departs from that distribution. Understanding these failure modes isn't about distrust of AI — it's the foundation of using AI well.
Explore AI overconfidence and calibration. Ask about why Watson for Oncology failed, what calibration means in practice, or how you would design an AI system for a hospital that properly communicates its uncertainty. Connect this lesson's ideas back to earlier lessons — how does overconfidence relate to hallucination or brittleness? Have at least 3 exchanges to complete the lab.