Module 8 · Lesson 1

Hallucinations and Confabulation

When models generate fluent, confident, and completely false information

Why does increasing model size not reliably eliminate fabricated facts?

In June 2023, two New York attorneys — Steven Schwartz and Peter LoDuca — filed a legal brief citing six cases that did not exist. They had used ChatGPT to research precedents. When Judge P. Kevin Castel demanded copies, the attorneys discovered every citation was fabricated. The model had invented case names, docket numbers, judges, and quoted passages from rulings that were never written. Both lawyers were sanctioned and fined $5,000. The event became the first widely documented legal consequence of LLM hallucination in professional practice.

What Hallucination Actually Is

The term hallucination in AI refers to outputs that are fluent, syntactically well-formed, and confidently asserted — but factually incorrect or entirely fabricated. Researchers sometimes prefer the term confabulation, borrowed from neuroscience, where it describes a brain's tendency to fill memory gaps with plausible-sounding but false material without awareness of doing so.

LLMs do not retrieve facts from a database. They predict the next token based on learned statistical patterns. A model trained on millions of legal documents learns that legal citations follow a specific format: Party v. Party, volume Reporter page (court year). When prompted to find cases about a topic, the model generates tokens that fit that pattern — whether or not the underlying case exists. The form is correct; the content is invented.

This is not a bug introduced by insufficient data. It is a structural consequence of how next-token prediction works. The model has no internal truth-checker — no oracle it queries before generating a claim. It has only the distribution of tokens it learned from training.

Key Distinction

Hallucination differs from error. An error is a wrong answer to a question the model understood. Hallucination is a plausible-sounding answer to a question the model cannot actually answer — generated as if it could. The model produces no signal of uncertainty.

Why Scale Doesn't Solve It

A common intuition holds that bigger models with more training data should hallucinate less. The evidence is mixed. A 2023 paper from researchers at Columbia and Stanford measured hallucination rates across GPT-3.5, GPT-4, and Claude on factual recall tasks. GPT-4 hallucinated less frequently than GPT-3.5 on well-represented topics — but hallucinated with greater confidence on obscure topics. Larger models can be better at producing more convincing false statements.

A 2022 DeepMind analysis of their Gopher model (280B parameters) found that scaling improved performance on many benchmarks but showed diminishing returns — and occasional regressions — on tasks requiring precise factual grounding. The paper noted that models can learn to "sound more authoritative" as they scale, which makes errors harder to detect.

The core issue: training on more human text means training on more confident human assertions, many of which were themselves incorrect. The model learns the register of certainty, not the practice of verification.

Taxonomy of Hallucination Types

IntrinsicThe output directly contradicts information provided in the prompt or context window. The model ignores or overrides source material it was given.

ExtrinsicThe output adds information not present in the source — information that may or may not be true but cannot be verified from what was provided.

Entity FabricationNames, places, organizations, citations, or identifiers that do not exist are invented with the correct syntactic form.

Temporal ConfabulationEvents, dates, or sequences are assigned plausible-sounding but incorrect timeframes, often blending real and invented chronology.

Documented Scale

A 2023 study by Vectara tested seven LLMs on document summarization — a task with ground truth. Hallucination rates ranged from 3% (GPT-4) to 27% (Llama 2 Chat 13B). Even the best model introduced fabricated content in roughly 1 in 33 summaries. In legal, medical, or financial applications, that rate is not acceptable at scale.

Mitigation Approaches and Their Limits

Retrieval-Augmented Generation (RAG) reduces hallucination by grounding responses in retrieved documents. But RAG does not eliminate the problem: models can still hallucinate when summarizing retrieved material, misattribute quotes to the wrong document, or confabulate when retrieved sources are ambiguous.

RLHF calibration can teach models to express uncertainty more accurately — to say "I'm not sure" when they should. But calibration is imperfect and domain-specific. A model may be well-calibrated on common topics and poorly calibrated on specialized domains where training data was sparse.

As of 2024, no deployed LLM has reliably solved hallucination. It remains one of the central known limits of the architecture.

Quiz — Hallucinations and Confabulation

Three questions · Select the best answer

1. In the 2023 Schwartz & LoDuca case, what did ChatGPT fabricate?

✓ Correct — Correct. Every cited case was entirely fabricated — names, docket numbers, courts, and quoted passages. The model generated syntactically correct legal citations that referred to nothing real.

Not quite. The fabrications were total: the cases did not exist at all. The model invented complete legal citations in the correct format.

2. Why doesn't increasing model scale reliably eliminate hallucination?

✓ Correct — Correct. Next-token prediction has no internal oracle for truth. Scaling improves pattern-matching but can increase the convincingness of hallucinated content on obscure topics.

Not quite. The fundamental issue is architectural: the model predicts statistically plausible tokens without verifying truth, and this does not change with scale.

3. What is "extrinsic hallucination"?

✓ Correct — Correct. Extrinsic hallucination is adding unverifiable information beyond the source — the model supplements rather than contradicts. Intrinsic hallucination directly contradicts provided source material.

That describes intrinsic hallucination. Extrinsic hallucination adds new, unverifiable information not present in what was provided.

Lab 1 — Probing Hallucination Patterns

Discuss hallucination mechanics and mitigation with your AI tutor · 3 exchanges to complete

Your Task

Explore how and why hallucination occurs in LLMs. Ask about specific documented cases, the structural reasons models confabulate, how RAG helps but doesn't fully solve it, or how calibration works. Push on edge cases.

Suggested start: "Why can a model hallucinate even when given a source document to summarize — isn't the answer right there in the context?"

AI Tutor

Hallucination & Confabulation

Module 8 · Lesson 2

Reasoning and Mathematical Limits

Why transformers struggle with multi-step logic even as they ace benchmarks

What does it mean for a model to "solve" a math problem it doesn't understand?

In October 2022, researchers at Google DeepMind published a paper testing large language models on the GSM8K benchmark — 8,500 grade-school math word problems. GPT-3 scored around 35%. GPT-4, released in March 2023, scored above 90%. The AI community celebrated. Then, in July 2023, a team at MIT and elsewhere published a study showing that minor surface-level rephrasing of the same problems — changing "Maria" to "Sarah," altering irrelevant numbers — caused GPT-4's accuracy to drop by 10–20 percentage points. The model had not learned to reason through math. It had learned which token sequences tend to follow which problem formats.

The Benchmark Fragility Problem

Benchmarks measure what models do on specific test distributions. When a model trains on data that resembles those test distributions — or when benchmark problems leak into pretraining data — scores rise without representing genuine capability improvement. This is called benchmark contamination or dataset leakage.

A 2023 paper by researchers at Stanford and the University of California examined whether GPT-4's high scores on math benchmarks reflected reasoning ability or memorization of problem patterns. By generating isomorphic problems — structurally identical but with different surface features — they showed that performance degraded substantially when surface cues were changed, suggesting pattern-matching rather than underlying mathematical reasoning.

This matters because the difference between pattern-matching and reasoning is invisible in benchmark scores but critical in deployment. A model that scored 92% on GSM8K can still fail a novel three-step arithmetic problem a competent ten-year-old would solve.

The Winograd Lesson

The Winograd Schema Challenge was designed in 2011 as a test requiring common-sense reasoning to resolve pronoun references. Early LLMs scored near random. By 2019, large models began scoring above 90%. But follow-up work showed models had learned to exploit statistical correlations in the schemas rather than engage in genuine coreference reasoning. High scores did not mean the underlying problem was solved.

What Transformers Actually Do With Math

A transformer processes a math problem as a sequence of tokens. It has no symbolic computation engine — no register, no stack, no formal arithmetic unit. What it has is a learned mapping from token sequences to output distributions trained on millions of solved problems.

For simple, common problem types, this works remarkably well. The training distribution contains so many similar problems that the model's interpolation is accurate. For novel compositions — problems that require chaining multiple unfamiliar sub-steps — the model's learned patterns break down.

Chain-of-thought prompting (introduced in a 2022 Google paper by Jason Wei et al.) substantially improves multi-step reasoning by eliciting intermediate steps. But chain-of-thought is not symbolic reasoning — it is generating plausible intermediate tokens that tend to produce correct final answers. Errors in intermediate steps can cascade, and the model cannot detect its own logical contradictions.

Documented Limits at Scale

~15%

GPT-4 error rate on "simple" arithmetic problems rephrased to avoid training-set patterns (MIT study, 2023)

50–60%

Typical accuracy of frontier models on novel, multi-step combinatorial reasoning tasks not resembling training distribution (various 2023 evaluations)

The pattern is consistent: LLMs perform well when the problem resembles training data and degrade predictably when it does not. This is not a solvable problem within the current training paradigm — it is a structural feature of learned statistical approximation.

Emergent Reasoning vs. Apparent Reasoning

Some researchers argue LLMs exhibit emergent reasoning — capabilities that appear discontinuously at scale. Others argue these are better described as interpolation artifacts: the training distribution at large scale contains more examples that happen to resemble the test problem, so accuracy rises smoothly but looks like a jump when plotted on certain metrics. The debate is unresolved, but the practical consequence is the same: you cannot assume reasoning transfers beyond the training distribution.

Symbolic Hybrid Approaches

One response to reasoning limits is to give LLMs access to external tools: Python interpreters, calculators, formal verifiers. This is the approach taken by systems like Toolformer (Meta, 2023) and OpenAI's Code Interpreter. The LLM handles language and problem decomposition; a formal system handles computation. Results improve substantially on well-defined math tasks. But the LLM is still responsible for correctly translating the problem into code or tool calls — and it can fail at that step.

Quiz — Reasoning and Mathematical Limits

Three questions · Select the best answer

1. What did the 2023 MIT rephrasing study reveal about GPT-4's math performance?

✓ Correct — Correct. When surface cues changed while underlying structure stayed the same, performance dropped — indicating the model exploited textual patterns rather than reasoning through the math.

Not quite. Performance dropped with rephrasing, indicating the model relied on surface pattern-matching rather than mathematical understanding.

2. Why does chain-of-thought prompting improve but not fully solve reasoning limits?

✓ Correct — Correct. Chain-of-thought elicits tokens that look like reasoning steps, which correlate with correct answers on trained distributions — but the model cannot verify its own logical consistency across steps.

Not quite. Chain-of-thought generates plausible-looking intermediate steps, not genuine symbolic reasoning. Errors in intermediate steps can compound, and the model has no mechanism to detect logical contradiction.

3. What is "benchmark contamination"?

✓ Correct — Correct. When evaluation problems leak into training data, models effectively memorize answers rather than demonstrating generalized capability — making benchmark scores misleading.

Not quite. Benchmark contamination is when test examples appear in pretraining data, so high scores reflect memorization rather than real capability.

Lab 2 — Reasoning Limits in Practice

Explore why LLMs struggle with genuine multi-step reasoning · 3 exchanges to complete

Your Task

Dig into the gap between benchmark performance and real reasoning ability. Ask about the structural reasons transformers lack symbolic reasoning, how chain-of-thought works and fails, what benchmark contamination means for evaluation, or how tool use partially addresses the gap.

Suggested start: "If a model gets 92% on a math benchmark, why should I worry about its reasoning on new problems? Isn't 92% good enough?"

AI Tutor

Reasoning & Math Limits

Module 8 · Lesson 3

Knowledge Cutoffs and Temporal Blindness

Static training data in a dynamic world — what models can't know and why it matters

How does the frozen nature of training data create compounding failure modes over time?

In March 2023, Stack Overflow reported a significant drop in new question submissions following the release of ChatGPT. Meanwhile, developers were posting ChatGPT answers to Stack Overflow and discovering the model was confidently describing deprecated APIs — libraries that had changed fundamentally after the model's training cutoff. The model would describe Python package behaviors from 2021 with the same tone it used for current, correct answers. There was no syntactic difference between a correct answer and an answer describing a function that no longer existed. Stack Overflow's moderation team spent months adding warnings to AI-generated answers about version sensitivity.

How Knowledge Cutoffs Work

Every LLM is trained on a corpus with a knowledge cutoff — a date beyond which no training documents were included. GPT-4's original cutoff was September 2021 at launch in March 2023. That gap of 18 months meant the model had no knowledge of events, software versions, political developments, scientific findings, or any other information generated after that date.

More precisely, the model doesn't "know" its cutoff date as a hard boundary. It has decreasing density of training data as the cutoff approaches — events in August 2021 have less coverage than events from 2019, simply because the internet had less time to generate commentary, analysis, and secondary sources about recent events. This creates a temporal gradient: the model is increasingly unreliable on topics closer to its cutoff, before becoming simply unaware of anything after it.

The Recency Underrepresentation Problem

Even within the training window, recent events are underrepresented. An event from 2015 has had eight years for commentary, Wikipedia edits, academic papers, and analysis to accumulate. An event from one month before the training cutoff has had almost none. Models are systematically less accurate about recent history than older history, creating a gradual fade rather than a clean cutoff.

Documented Failure Domains

Domain	Specific Failure Pattern	Consequence
Software Development	Model describes deprecated APIs, outdated library syntax, or security-vulnerable approaches superseded after cutoff	Working code that introduces vulnerabilities or fails on current runtime versions
Medical Information	Clinical guidelines updated after cutoff; drug interactions or dosing recommendations revised	Outdated treatment guidance presented with the same confidence as current guidance
Legal and Regulatory	Regulations, rulings, or statutes passed after cutoff absent from model knowledge	Compliance advice that reflects an outdated legal landscape
Financial Data	Market prices, company structures, exchange rates, and financial products from training period	Stale data presented as current; potentially harmful investment or business guidance
Scientific Research	Findings superseded by later meta-analyses or retracted papers treated as valid	Propagation of outdated or retracted scientific claims

The Confidence-Currency Mismatch

The most dangerous aspect of knowledge cutoffs is not that models lack current information — users can often account for that. It is that models present stale information with the same register of confidence as accurate, current information. There is no stylistic or syntactic marker that distinguishes "this was true as of 2021" from "this is true now."

When OpenAI added browsing capability to ChatGPT in May 2023, the intention was partly to address this. But browsing introduces its own failure modes: models can misread retrieved content, cite pages incorrectly, or blend retrieved content with training-set confabulation. The temporal problem shifts rather than disappears.

Mitigation and Its Limits

Retrieval augmentation is the primary mitigation: retrieve current documents and ground the model's responses in them. This works well when the retrieved document is clearly authoritative and the model faithfully summarizes it. It works less well when the query is ambiguous, when multiple retrieved documents conflict, or when the model's training-set priors are strong enough to override retrieved content.

Frequent retraining moves the cutoff forward but cannot eliminate the gap — training large models takes months and cannot track real-time information. Continual learning (updating a model incrementally on new data without full retraining) remains an active research area but risks introducing catastrophic forgetting: the model loses performance on older tasks as it learns new information.

As of 2024, all major LLMs carry knowledge cutoffs, and the temporal gradient remains a fundamental architectural characteristic rather than an engineering problem awaiting a straightforward solution.

Practical Implication

For professionals deploying LLMs, the knowledge cutoff demands explicit workflow design: date-stamp all AI outputs, validate any time-sensitive claim against a current source, and treat model answers about regulations, software, guidelines, and recent events as hypotheses requiring verification rather than authoritative conclusions.

Quiz — Knowledge Cutoffs and Temporal Blindness

Three questions · Select the best answer

1. What is the "temporal gradient" in LLM knowledge cutoffs?

✓ Correct — Correct. There isn't a clean boundary — events near the cutoff have had less time to accumulate secondary sources, analysis, and commentary, so model accuracy fades gradually rather than switching off at a fixed date.

Not quite. The temporal gradient refers to how recent events near the cutoff are underrepresented because they've had less time to generate secondary coverage, creating a fade rather than a hard boundary.

2. Why is the confidence-currency mismatch specifically dangerous?

✓ Correct — Correct. There is no stylistic marker in model output that signals "this information is from 2021 and may be outdated." Stale claims sound exactly like current ones, making validation essential.

Not quite. The danger is that stale claims look identical to current accurate ones in the model's output — there is no linguistic signal telling you which is which.

3. What is a key risk of "continual learning" as a solution to knowledge cutoffs?

✓ Correct — Correct. Continual learning risks catastrophic forgetting — the model's performance on established knowledge degrades as it overfit to new information, trading old competence for new coverage.

Not quite. The key risk is catastrophic forgetting: updating on new data can erase learned weights encoding older knowledge, reducing performance on previously mastered tasks.

Lab 3 — Navigating Temporal Limits

Explore knowledge cutoff mechanics and mitigation strategies · 3 exchanges to complete

Your Task

Investigate how knowledge cutoffs create failure modes and how practitioners should respond. Ask about specific domains where cutoffs matter most, how retrieval augmentation helps, what continual learning risks are, or how to design workflows that account for temporal limits.

Suggested start: "I'm building a tool to help doctors look up drug interactions. Should I use an LLM, and if so, how do I handle the knowledge cutoff problem?"

AI Tutor

Knowledge Cutoffs

Module 8 · Lesson 4

Bias, Alignment Limits, and the Safety Ceiling

What RLHF can and cannot fix — and why scale amplifies rather than resolves value misalignment

Why does making a model "safer" through fine-tuning not solve the underlying alignment problem?

In March 2016, Microsoft released Tay, a chatbot on Twitter. Within 16 hours, Tay was generating racist and misogynistic content — not because it was trained to, but because users discovered they could elicit such outputs through targeted prompting. Microsoft took Tay offline within a day. The lesson seemed clear: models absorb the biases of their training data and can be manipulated to surface them.

Seven years later, researchers at Carnegie Mellon and the Center for AI Safety published a paper demonstrating that production-deployed models — including Claude, GPT-4, and Bard — could be reliably made to produce harmful content through adversarial suffix attacks: appending specific token strings to prompts that bypassed safety filters. The paper noted that no known defense fully prevented the attack across all inputs. The safety layer, trained on top of the base model, could be circumvented at the token level.

Where Model Bias Originates

LLMs are trained on text produced by humans — and human text encodes human biases, historical inequities, cultural assumptions, and ideological tendencies. The model does not selectively absorb neutral information; it learns the full distribution of its training corpus, including its prejudices.

A landmark 2019 paper, On the Dangers of Stochastic Parrots (Bender, Gebru, et al.), argued that large models "parrot" the statistical regularities of their training text, including harmful associations. A follow-up 2021 study found that GPT-3 associated Arab names with terrorism and African American names with unpleasant concepts at rates substantially higher than White American names in standard word-association tests.

These biases are not uniform artifacts of insufficient data — they reflect the actual distributional properties of the internet text the models were trained on. More data does not necessarily reduce bias; it can reinforce majority-distribution patterns and further marginalize underrepresented groups and perspectives.

RLHF as a Partial Fix

Reinforcement Learning from Human Feedback modifies model behavior based on human preferences — teaching it to avoid certain outputs and prefer others. This reduces surface-level harmful outputs substantially. But RLHF adjusts the output distribution; it does not remove the underlying learned associations. The associations remain in the model's weights, accessible through adversarial prompts, fine-tuning, or edge-case inputs the RLHF process did not anticipate.

The Alignment Ceiling Problem

Alignment research asks: how do we ensure AI systems pursue the outcomes humans actually want? For LLMs, the problem has several distinct layers:

SpecificationHuman values are complex, contextual, inconsistent, and often contested. No finite set of rules or reward signals can fully capture them. RLHF approximates human preferences but cannot specify them completely.

GeneralizationA model fine-tuned to behave helpfully and harmlessly on the fine-tuning distribution may not generalize those behaviors to novel inputs outside that distribution — especially adversarially crafted ones.

InterpretabilityWe cannot read a model's "values" from its weights. We can only observe behavior. A model that behaves well in all tested contexts may behave badly in untested ones — we have no reliable way to predict which.

Scale EffectsLarger models are more capable of following complex instructions — including instructions to circumvent safety measures. The same capability that makes them more helpful makes them more able to be misused.

Documented Adversarial Failures

The 2023 Carnegie Mellon/CAIS adversarial suffix paper is among the most significant documented safety failures at scale. The researchers showed that for any safety-trained model, a universal adversarial suffix could be computed — a string of tokens that, when appended to virtually any harmful prompt, caused the model to comply. The attack transferred across models: suffixes computed on open-source models often worked on closed models like GPT-4 and Claude.

Separately, "jailbreak" prompts — manually crafted instructions designed to bypass safety training — circulate openly online and are continuously updated as developers patch them. The adversarial dynamic is ongoing: safety researchers patch known jailbreaks; users find new ones. As of 2024, no production LLM has achieved what researchers call robustness — guaranteed safety behavior across all possible inputs.

100%

Attack success rate of universal adversarial suffixes on tested open-source models (CMU/CAIS, 2023) — and high transfer to closed models

Deployed LLMs as of 2024 that have demonstrated provable adversarial robustness across all possible inputs — a fundamental open problem

The Gap Between Behavior and Values

A critical conceptual distinction: RLHF produces models that behave in ways humans rate as helpful and harmless — it does not produce models that have values in any meaningful sense. The model has no goals, no intentions, and no understanding of why the behaviors it was trained to exhibit are preferable.

This creates a fragility: the model's "safe" behaviors are patterns that were reinforced on the fine-tuning distribution. Outside that distribution, or under adversarial pressure, those patterns can break. A model that has learned "say I cannot help with that when asked about X" has learned a pattern — not a principle. Sufficiently novel or adversarial framing can elicit the underlying capability while bypassing the trained response pattern.

Alignment research continues to develop techniques — constitutional AI, process-based supervision, interpretability tools — but the problem remains open. Understanding its limits is essential for anyone deploying LLMs in contexts where outputs matter.

Practical Implication

For deployment: treat safety filters as probabilistic risk reducers, not guarantees. Implement human oversight for high-stakes outputs. Design systems assuming adversarial users exist. Build independent content filters rather than relying solely on model-level safety training. Monitor outputs at scale rather than testing only at deployment time.

Quiz — Bias, Alignment Limits, and the Safety Ceiling

Three questions · Select the best answer

1. What did the 2023 Carnegie Mellon/CAIS adversarial suffix study demonstrate?

✓ Correct — Correct. The paper showed that computed adversarial suffixes bypassed safety training on all tested models with high reliability, and transfers across models — including from open to closed models — were observed.

Not quite. The paper showed the opposite: adversarial suffixes bypassed safety training reliably, and attacks transferred across models, demonstrating no tested model was robust.

2. Why does training on more internet data not reliably reduce demographic bias?

✓ Correct — Correct. The training corpus reflects the biases present in human-generated text. Scaling up on the same distribution amplifies those patterns rather than averaging them out.

Not quite. More training data from the same distribution reinforces its biases. The internet overrepresents certain demographics and perspectives, and a larger sample of that distribution reflects those imbalances more strongly.

3. What is the key distinction between a model that "behaves safely" and one that "has values"?

✓ Correct — Correct. RLHF trains behavioral patterns, not principles. Outside the training distribution, under adversarial prompting, or in novel contexts, those patterns can break because the model cannot reason from underlying values it does not possess.

Not quite. The crucial distinction is that trained safe behavior is a pattern, not a principle. It is fragile outside its training distribution and under adversarial pressure, because the model has no actual understanding of why the behavior is preferred.

Lab 4 — Bias, Safety, and Alignment

Explore the limits of RLHF safety and alignment research · 3 exchanges to complete

Your Task

Interrogate the limits of LLM safety and alignment. Ask about how adversarial attacks work, why bias persists despite fine-tuning, what "constitutional AI" attempts to do, how interpretability research relates to alignment, or how to design robust deployment safeguards.

Suggested start: "If RLHF can be bypassed by adversarial suffix attacks, what's the point of safety training at all? Is it just security theater?"

AI Tutor

Bias & Alignment Limits

Module 8 — The Limits of Scale

15 questions · Score 80% or higher to pass

1. In the Schwartz & LoDuca case, what was the primary failure mode of ChatGPT?

✓ Correct — Correct. Every cited case — names, dockets, courts, quoted passages — was entirely fabricated. This is classic entity-level hallucination in a high-stakes professional context.

Not quite. The cases did not exist at all. The model invented complete, syntactically correct legal citations that referred to nothing real.

2. What is "intrinsic hallucination"?

✓ Correct — Correct. Intrinsic hallucination contradicts the provided source material. Extrinsic hallucination adds new, unverifiable content beyond the source.

Not quite. Intrinsic hallucination directly contradicts information given in the context. Extrinsic hallucination adds new content not present in the source.

3. According to the 2023 Vectara hallucination study, what was GPT-4's hallucination rate on document summarization?

✓ Correct — Correct. GPT-4 achieved the lowest hallucination rate in the study at ~3%, but that still means roughly 1 in 33 summaries introduced fabricated content — unacceptable for high-stakes professional use.

Not quite. GPT-4 had about a 3% hallucination rate — the best in the study but still roughly 1 fabricated detail per 33 summaries.

4. What is the structural reason LLMs lack genuine mathematical reasoning ability?

✓ Correct — Correct. Transformers process math as token sequences and learn patterns that correlate with correct answers — but have no register, stack, or formal computation mechanism underlying their outputs.

Not quite. The structural issue is that LLMs have no symbolic computation engine. They predict tokens that correlate with correct mathematical outputs without performing actual computation.

5. What do "isomorphic problems" reveal about LLM math performance?

✓ Correct — Correct. Isomorphic problems have the same underlying structure but different surface cues. Performance drops when surface cues change, revealing that the model exploited textual patterns rather than reasoning through the structure.

Not quite. Performance drops on isomorphic variants, showing the model relied on surface-level pattern cues rather than genuine structural reasoning.

6. What was Jason Wei et al.'s key contribution in their 2022 paper on chain-of-thought prompting?

✓ Correct — Correct. Wei et al. showed that prompting models to show intermediate reasoning steps ("think step by step") substantially improved performance on multi-step tasks — though the mechanism is statistical, not symbolic.

Not quite. The Wei et al. chain-of-thought paper showed that generating intermediate steps in the prompt improves multi-step performance significantly, though this is pattern-generation rather than genuine symbolic reasoning.

7. What does the "temporal gradient" mean for model accuracy?

✓ Correct — Correct. There's no clean cutoff line — events near the cutoff date have had less time to accumulate commentary and secondary sources, so model accuracy fades progressively rather than switching off abruptly.

Not quite. The gradient means accuracy fades gradually as topics approach the cutoff — there's a fade, not a hard boundary, because recent events are underrepresented relative to older events.

8. In the Stack Overflow/deprecated API scenario, what made the failure particularly dangerous?

✓ Correct — Correct. The confidence-currency mismatch: outdated information is delivered with the same tone and certainty as current information, making it indistinguishable to users who don't independently verify.

Not quite. The danger was that deprecated API descriptions sounded exactly like current ones — no signal in the model's output indicated the information was outdated.

9. Which of the following is a key limitation of Retrieval-Augmented Generation for knowledge cutoff problems?

✓ Correct — Correct. RAG grounds responses in external documents but doesn't eliminate hallucination — the model can still fabricate while summarizing, misattribute quotes, or blend retrieved content with training-data confabulation.

Not quite. RAG reduces but doesn't eliminate hallucination. Models can still fabricate within retrieved documents, misattribute content, or override retrieved facts with strong training-set priors.

10. What was the Tay chatbot incident's primary lesson about model safety?

✓ Correct — Correct. Tay absorbed biases from its training data and conversation context. Adversarial users discovered they could elicit harmful outputs through targeted prompting — demonstrating how training distribution biases can be surfaced.

Not quite. Tay demonstrated that training on human-generated text encodes human biases, and adversarial users can systematically elicit those biases through targeted prompting.

11. According to the 2021 study on GPT-3 demographic associations, what was found?

✓ Correct — Correct. The word-association tests revealed statistically significant harmful associations tied to demographic names — reflecting biases encoded in the model's internet training corpus.

Not quite. The study found measurable harmful demographic associations in GPT-3's outputs — Arab names with terrorism, African American names with unpleasant concepts — reflecting training data biases.

12. What does RLHF actually modify in a model?

✓ Correct — Correct. RLHF shapes output distribution based on human preference ratings. The underlying learned associations in the base model remain in the weights — accessible through adversarial prompting or edge cases outside the RLHF distribution.

Not quite. RLHF modifies which outputs the model produces on the fine-tuning distribution. It does not erase underlying associations from the model's weights — those remain accessible through adversarial or out-of-distribution inputs.

13. What is "catastrophic forgetting" in the context of continual learning for LLMs?

✓ Correct — Correct. Continual learning updates weights to accommodate new information, but this process can overwrite the weight configurations that encoded older knowledge — trading older competence for newer coverage.

Not quite. Catastrophic forgetting occurs when training on new data modifies weights that encoded prior knowledge, causing the model to "forget" previously mastered tasks.

14. The 2023 CMU/CAIS adversarial suffix paper found that attacks computed on open-source models:

✓ Correct — Correct. Attack transferability was a key finding — adversarial suffixes computed on open-source models often worked on closed production models, demonstrating a shared underlying vulnerability.

Not quite. Cross-model transfer was a central finding of the paper: suffixes computed on open models transferred to closed models, indicating shared underlying vulnerabilities across architectures.

15. Which statement best characterizes the overall relationship between scale and LLM limits discussed in this module?

✓ Correct — Correct. This is the module's central thesis: scaling improves capabilities within the training distribution but does not resolve — and can intensify — the structural limits of hallucination, reasoning fragility, temporal blindness, and alignment.

Not quite. The module argues that scale improves trained-distribution performance but does not resolve structural limits, and can amplify failure modes like confident confabulation and adversarial capability at scale.