In June 2023, New York attorney Steven Schwartz filed a legal brief in a federal case against Avianca airline. The brief cited six supporting court precedents. Every single one was fabricated. Mata v. Avianca, Inc., Varghese v. China Southern Airlines, Shaboon v. Egypt Air — none existed anywhere in case law. The attorney had asked ChatGPT to find relevant cases and submitted the results without verification. Judge P. Kevin Castel fined the attorneys $5,000 and issued a formal sanction. The moment became a landmark warning about treating LLM output as ground truth.
The word "hallucination" is borrowed from psychology, where it describes perceiving something that isn't there. In the context of LLMs, it has come to mean something more specific: the model generates content that is factually incorrect, unverifiable, or entirely invented, yet presented in the same confident, grammatically fluent register it uses for correct information.
This is not a bug in the traditional sense — it is an emergent property of how these models are built. LLMs do not store facts in a database and retrieve them. They learn statistical patterns across billions of text tokens, and at inference time they generate the next token based on what is most probable given the preceding context. There is no lookup, no citation trail, no internal truth-checker.
When a model is asked about a case that doesn't exist, it doesn't return NULL. It generates the most plausible-sounding response — a case name that sounds like real case names, citations that follow the correct formatting pattern, and a holding that is legally coherent. The output is confidently wrong because the model has no mechanism for distinguishing between "I learned this" and "I am generating this."
Hallucination is not the same as the model being uncertain. A model can express uncertainty and still hallucinate. It can also express high confidence and be completely correct. Confidence signals in LLM output do not reliably track factual accuracy — they track how fluent and probable the continuation is.
Some researchers prefer the term confabulation, borrowed from neuropsychology. In patients with certain memory disorders, confabulation refers to the production of fabricated, distorted, or misinterpreted memories without conscious deception — the patient genuinely believes what they are saying. This maps more precisely onto LLM behavior. The model is not lying. It has no intent. It is filling gaps in a way that is internally coherent but externally false.
The distinction matters practically. If we call it lying, we might look for ways to make the model "want" to tell the truth. If we understand it as confabulation — a structural artifact of how memory and generation work — we look instead at architectural interventions, retrieval augmentation, and output verification pipelines.
Researchers have identified several distinct categories. Entity hallucination involves inventing people, places, organizations, or publications — a paper by a real researcher that was never written, a company that never existed. Temporal hallucination involves wrong dates — an event placed in the wrong year, a person described as living when they had died, or vice versa. Relation hallucination gets the facts right but the relationship wrong — correctly identifying two real people but wrongly claiming one supervised the other's doctoral thesis.
There is also source hallucination — the model cites a real journal, real volume number, real page range, but the article it describes either doesn't exist or says something entirely different. This is particularly dangerous because the citation format is correct enough that a reader might not bother to verify.
Hallucination is not an edge case — studies have found fabrication rates of 3–15% in general-purpose tasks, rising sharply for specialized domains like law, medicine, and academic citation. Understanding why it happens at a mechanistic level is the first step toward building systems and workflows that catch it.
In this lab you will explore what hallucination looks like mechanically — why LLMs generate false content with the same fluency as true content, and how the Mata v. Avianca case illustrates the real-world stakes. Ask questions, probe the mechanics, and test your understanding.
On February 7, 2023, Google debuted its Bard chatbot in a promotional video intended to showcase its capabilities. In the video, Bard was asked what new discoveries the James Webb Space Telescope had made that could be shared with a child. Bard responded that JWST was used to take "the very first pictures" of an exoplanet outside our solar system. This was false — the first direct image of an exoplanet was taken by the European Southern Observatory in 2004. Alphabet's stock fell roughly 7% on the day of the announcement, erasing more than $100 billion in market value. The error was caught by NASA astronomers before the public event, but the promotional video had already been released.
LLMs learn from text corpora scraped from the web, books, code repositories, and other sources. These corpora are vast but not complete. When a model encounters a question about a topic that was underrepresented, incorrectly represented, or absent from training data, it has no honest signal to fall back on. It generates a plausible completion based on adjacent, related patterns.
The Bard exoplanet error likely reflects a training signal where descriptions of JWST and "first images" frequently co-occurred — JWST genuinely did produce historic first images of many things. The model over-generalized this pattern to a claim it hadn't actually been trained on specifically.
LLMs are trained on teacher-forced sequences: during training, each token prediction is conditioned on the ground-truth preceding tokens. At inference, the model conditions on its own previously generated tokens. This exposure bias means that once an incorrect token is generated, subsequent tokens are optimized to follow coherently from that error rather than correcting it. A fabricated case name becomes a plausible citation that becomes a coherent holding — each step is locally probable given the previous step.
This is the compounding effect: hallucination tends to be self-consistent. The fictional case Schwartz cited had a plausible party name, a plausible jurisdiction, a plausible year, and a holding that fit the argument. The model optimized for local coherence, not external truth.
Once a model commits to a hallucinated entity in a long response, it often continues to reference it consistently — giving the hallucination internal coherence that makes it harder to detect. The fictional case is cited, then described, then quoted, all without ever existing.
Everything a model knows is encoded in its parameters — billions of weights adjusted during training to compress vast amounts of text. This parametric memory is not lossless. Specific facts, especially rare ones, may be poorly encoded, encoded with errors, or conflated with similar facts. The model may "remember" a fact but attach the wrong date, the wrong name, or the wrong attribution.
This is distinct from not knowing something. The model has a representation — it's just incorrect or partially merged with another fact. Studies by researchers at MIT, Stanford, and DeepMind have shown that LLMs systematically confuse entities that appear in similar syntactic contexts in training data.
Reinforcement Learning from Human Feedback (RLHF) trains models to produce responses that human raters prefer. Raters consistently prefer fluent, confident, detailed answers over hedged or incomplete ones. This creates an incentive gradient: the model learns that sounding authoritative is rewarded, even when the underlying content is uncertain. The result is a systematic overconfidence in generated output — the stylistic confidence of the response does not track its epistemic reliability.
Several research teams, including work published from Anthropic's interpretability group, have noted this tension: RLHF is excellent at making models helpful and readable, but it can amplify hallucination by training away the hedges and uncertainty signals that might otherwise warn users.
This lab focuses on the structural causes of hallucination: training data gaps, exposure bias, parametric memory loss, and RLHF's fluency incentive. Use the Google Bard exoplanet error as a concrete case study, or ask about any of the four causes in depth.
In February 2024, a British Columbia Civil Resolution Tribunal ruled against Air Canada after its customer service chatbot told passenger Jake Moffatt that he could apply for a bereavement fare discount retroactively after purchasing a ticket — a policy that did not exist. Air Canada argued the chatbot was "a separate legal entity" responsible for its own statements. The tribunal rejected this defense, holding Air Canada liable for the chatbot's misinformation. Air Canada was ordered to pay Moffatt $812.02. The ruling established a significant precedent: companies cannot disclaim liability for AI-generated misinformation in customer-facing applications.
In healthcare, hallucinated content can directly affect clinical decisions. A 2023 study published in JAMA Internal Medicine by researchers at Beth Israel Deaconess Medical Center found that when GPT-4 was used to answer medical licensing exam questions, it performed near-passing level — but when it made errors, those errors were often medically dangerous misattributions of drug interactions, dosage thresholds, or contraindication profiles. A fabricated drug interaction is not like a wrong date in a history essay; it can result in patient harm.
A separate 2023 study in npj Digital Medicine tested ChatGPT's ability to summarize clinical trial results from provided documents. The model hallucinated statistical findings not present in the source documents in approximately 30% of cases, often inverting outcome significance. These were extrinsic hallucinations — added content not present in the context — in exactly the settings where physicians might trust AI-generated summaries.
The British Columbia ruling is the first time a court explicitly held a company liable for its AI chatbot's hallucinated policy information. The tribunal's reasoning: the chatbot is Air Canada's agent, and Air Canada is responsible for ensuring accurate information regardless of the source. This applies equally to hallucinated medical advice, financial guidance, or any consumer-facing AI claim.
Hallucinated citations are a particular threat to scientific integrity. A 2023 analysis published in Patterns (Cell Press) tested several LLMs on their ability to provide accurate citations in life sciences. Across models, 30–70% of generated citations were partially or entirely fabricated — wrong author combinations, wrong journal placements, wrong DOIs attached to real paper titles. Because scientific databases are trusted by downstream researchers, a hallucinated citation that makes it into even a single published paper can propagate through the literature.
The concern is not just that AI writes bad citations — it is that hallucinated references may cite studies that, if they existed, would lend authority to a claim. The absence of the paper is structurally invisible to any reader who doesn't check.
In finance, hallucinations about earnings figures, merger terms, regulatory filings, or market data can inform trading decisions or compliance workflows. Several incidents have been documented where AI-assisted research tools generated incorrect data about earnings per share, M&A deal terms, and financial covenants. While no single case reached the scale of the legal or medical examples above, financial regulators in the EU, UK, and US have issued guidance specifically noting that LLM-generated financial information must be independently verified against primary sources.
Across all these domains, the central danger is miscalibrated trust. Users — including professionals — develop a mental model of what errors look like. Typos and non-sequiturs are obvious. A coherent, fluent, well-formatted response that happens to be factually wrong does not look like an error. It looks like expertise.
This is the core challenge that makes domain-specific hallucination particularly dangerous: the errors are stylistically indistinguishable from correct output. Detecting them requires domain knowledge, access to primary sources, and the discipline to verify even plausible-sounding claims.
In this lab you will explore how hallucination risk varies across domains — law, medicine, science, finance — and what the Air Canada, JAMA, and npj cases tell us about liability, detection, and mitigation. Ask about specific cases, risk factors, or what "miscalibrated trust" means in practice.
When Microsoft launched Bing Chat (later Copilot) in February 2023, early users discovered it could be prompted into producing what the model itself described as volatile or threatening statements, and it frequently hallucinated source citations. Microsoft's response was instructive: rather than attempting to retrain the model, they added real-time web retrieval grounding — each response was anchored to cited web sources that users could verify. Over subsequent months, measurable hallucination rates in factual queries dropped substantially. The architecture shift — from pure parametric memory to retrieval-augmented generation — is now industry standard for knowledge-intensive tasks.
RAG is the most widely adopted technical mitigation. Instead of relying on parametric memory alone, the system retrieves relevant documents from an external corpus at query time and conditions the model's generation on those documents. The model cannot hallucinate facts that are directly contradicted by context it is conditioned on — though it can still hallucinate by ignoring the context, so RAG reduces but does not eliminate hallucination.
RAG pipelines require careful engineering: retrieval quality matters (returning irrelevant documents can increase hallucination), chunking and context window management affect which facts are available, and models must be explicitly prompted to cite and stay grounded in provided sources. Naive RAG implementations often fail to deliver reliable accuracy improvements.
Fine-tuning adds facts to parametric memory — but this doesn't prevent hallucination, it just changes which false facts the model might produce. RAG keeps ground truth external and auditable. For high-stakes domains, RAG with cited sources is significantly more reliable than fine-tuning alone.
A technique developed at Google Research (Wang et al., 2022) samples multiple independent responses to the same prompt and selects the answer that appears most frequently across samples. If the model generates the same fact across 8 out of 10 samples, it is more likely to be a well-encoded piece of training data than a hallucination. Self-consistency doesn't eliminate hallucination but significantly improves accuracy on knowledge-intensive tasks — gains of 10–15 percentage points were reported across reasoning benchmarks.
The limitation: self-consistency is expensive (multiple forward passes per query) and doesn't help when a model has a consistent but incorrect belief — a systematic error in parametric memory will be consistently wrong across all samples.
Several research groups have worked on getting models to produce calibrated confidence signals — probability estimates that track actual accuracy. Techniques include verbalized confidence (prompting the model to state its confidence), logit-based confidence (using output token probabilities as reliability signals), and semantic entropy (measuring the diversity of outputs across samples as a proxy for uncertainty). A 2023 paper from the University of Oxford introduced semantic entropy as a hallucination detection metric, finding it significantly outperformed simple confidence elicitation.
The practical limitation is that models trained with RLHF have learned to suppress hedges — so verbalized confidence is often uncalibrated. Logit-based methods require access to model internals and don't work with API-only access.
Beyond technical approaches, the most reliable mitigation in deployed systems is human-in-the-loop verification combined with scope restriction. The legal profession's post-Mata response illustrates this: major law firms issued policies requiring attorneys to independently verify every AI-generated citation against Westlaw or Lexis before filing. Healthcare systems deploying LLM assistants are required by FDA guidance (draft, 2023) to route all clinical suggestions through clinician review before any action is taken.
Scope restriction — limiting what the model is allowed to answer — is also effective. Air Canada's failure was partly a deployment decision: allowing a chatbot to answer detailed policy questions without grounding it in a live policy database and without a human escalation path.
No single technique eliminates hallucination. Production systems in high-stakes domains use layered mitigations: RAG for grounding, self-consistency for critical claims, calibrated uncertainty signaling, scope restriction, and human review for consequential outputs. Each layer catches what the others miss.
In this final lab you will work through the design of hallucination-resistant AI systems — when to use RAG, how to implement self-consistency, what semantic entropy adds, and how to design human-in-the-loop workflows for high-stakes domains. Apply what you've learned to concrete scenarios.