L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Lesson 1 · Module 2

Patterns, Not Understanding

AI doesn't know what's true. It knows what sounds like it should come next.
If a system learns only from patterns, what stops it from learning wrong ones?

In June 2023, a federal judge sanctioned two New York lawyers — Steven Schwartz and Peter LoDuca — after they submitted a legal brief citing six completely fabricated court cases. The citations had real-sounding names, docket numbers, and quoted passages. They came from ChatGPT, which had no idea they didn't exist. When asked to verify, ChatGPT confirmed them — because it was pattern-matching what a court citation looks like, not checking any database.

What AI Is Actually Doing

Large language models — the AI behind ChatGPT, Claude, Gemini, and similar tools — are trained on enormous amounts of human-written text. During training, the model adjusts billions of internal numeric weights to get better at one specific task: predicting the next token (roughly, the next word or word-piece) given everything that came before it.

There is no fact-checking step. There is no lookup. There is no database being queried in real time. The model produces text that statistically matches the patterns it saw during training. If those patterns suggest a legal citation should look a certain way, the model will generate something that looks exactly like a legal citation — real or not.

This is why AI "hallucinations" happen. The term is a bit misleading: the AI isn't confused or malfunctioning. It's doing exactly what it was trained to do. It just wasn't trained to be truthful — it was trained to be plausible.

Key Distinction

Plausible means "sounds like it could be real." True means "actually corresponds to reality." AI systems are optimized for the first one. Humans have to supply the second.

Training Data Is the World to the Model

Whatever the model was trained on is, in a sense, its entire universe. If the training data contains conspiracy theories, the model learns that conspiracy theories are a type of text that humans write. If the data contains misinformation about vaccines, the model learns the patterns of that misinformation. It has no way to quarantine bad information from good information — it just learns patterns from all of it.

Researchers at MIT and Stanford have documented this repeatedly: models trained on Common Crawl — a massive scrape of the public web — absorb stereotypes, biases, and factual errors that appear in that data. The 2021 paper On the Dangers of Stochastic Parrots by Bender, Gebru, McMillan-Major, and Shmitchell specifically named the risk of models "parroting" harmful patterns without any awareness that they are doing so.

Key Terms
Hallucination —When an AI generates confident-sounding text that is factually wrong or entirely fabricated. Not a malfunction; a consequence of pattern-based prediction.
Token prediction —The core mechanism of large language models: estimating the most statistically likely next piece of text given the context so far.
Training data —The text (or images, audio, etc.) fed to the model during the learning phase. The model's behavior is shaped entirely by what it saw here.
Plausibility —The quality of sounding believable or realistic. AI is optimized for this, which is exactly what makes its errors dangerous.
The Core Insight

AI doesn't "make things up" the way a liar does. It generates text that fits the statistical shape of what real answers look like — and sometimes that shape is indistinguishable from truth. That's the problem.

Lesson 1 Quiz

Patterns, Not Understanding — check your comprehension
What was the core problem with the legal brief filed by lawyers Schwartz and LoDuca in 2023?
Correct. ChatGPT generated plausible-sounding citations — complete with names, docket numbers, and quoted passages — that had no real counterparts. The lawyers were sanctioned for submitting them.
Not quite. The citations weren't plagiarized or misread — they were entirely invented by the AI. There was also no court ban on AI at the time.
What is a large language model primarily trained to do?
Exactly right. Token prediction is the fundamental mechanism. Everything else — apparent reasoning, knowledge, even apparent deception — emerges from this one objective.
None of those describe how LLMs actually work. They predict next tokens based on patterns in training data — no real-time search, no human-style understanding, no fact-check step.
Why is the word "hallucination" considered slightly misleading when describing AI errors?
Right. Calling it a hallucination implies something went wrong. In reality, the model produced plausible-sounding text — exactly its job. The error is that plausible ≠ true.
The issue is more fundamental. AI isn't conscious, and the errors aren't intentional. The point is that generating false-but-plausible text is what the model was optimized to do well.
According to the 2021 paper "On the Dangers of Stochastic Parrots," what key risk does training on large web datasets create?
Correct. The paper's central argument is that models "parrot" whatever patterns appear in training data — including biased, false, or harmful content — because they have no mechanism to evaluate what they're learning.
The paper's concern is about what gets learned, not speed, sentience, or grammar. The danger is that harmful content in training data gets reproduced fluently in outputs.

Lab 1: Catching a Pattern Prediction

Talk with an AI assistant trained specifically for this lesson topic

Your Mission

You've learned that AI predicts patterns rather than checking facts. In this lab, you'll probe how that works by asking the AI assistant to explain its own process — and by trying to catch it generating something plausible but potentially wrong.

Try asking: "If you don't actually look things up, how do you decide what to say? Can you give me an example of something you might get wrong because of pattern-matching?"
AI Lab Assistant Pattern Prediction Focus
Welcome to Lab 1. I'm here to help you explore how AI pattern prediction works — and where it breaks down. Ask me anything about how I generate responses, or try to catch me in a plausible-but-wrong answer. What would you like to investigate?
Lesson 2 · Module 2

Generative Adversarial Networks and the Deepfake Engine

Two AIs locked in competition — one faking, one detecting — until the fake is perfect.
What happens when you train an AI specifically to fool another AI?

In 2019, a company called Dessa (later acquired by Square) publicly demonstrated a voice clone of podcast host Joe Rogan — generated entirely by AI — that was indistinguishable to most listeners from the real thing. The audio was never authorized. Dessa built it using a neural network architecture that had learned, from hours of real Rogan recordings, exactly how his voice sounds. The technology worked not because it recorded Rogan, but because it learned the patterns of his voice well enough to reproduce them from scratch.

Around the same time, a deepfake video of Belgian Prime Minister Sophie Wilmès circulated online, falsely claiming she had linked COVID-19 to climate change. Millions of people saw it before it was flagged and removed.

How GANs Work

The architecture behind most deepfakes — whether audio, video, or images — is called a Generative Adversarial Network, or GAN. Introduced by Ian Goodfellow and colleagues in a 2014 paper at NeurIPS, the GAN framework involves two neural networks trained together in direct competition.

The first network is called the Generator. Its job is to produce fake content — fake images, fake voices, fake video frames — that look or sound real. The second network is called the Discriminator. Its job is to examine content and decide: real or fake?

During training, the Generator tries to fool the Discriminator. The Discriminator tries to catch the Generator. Every time the Discriminator successfully identifies a fake, the Generator gets better. Every time the Generator fools the Discriminator, the Discriminator gets better. This adversarial loop continues until the Generator is producing content that the Discriminator — and eventually humans — cannot reliably distinguish from real.

The Generator
Creates Fakes
Starts with random noise. Learns to transform it into realistic-looking content. Gets better every time it fools the Discriminator.
The Discriminator
Detects Fakes
Examines content and outputs a probability: real or synthetic? Gets better every time the Generator successfully fools it.
The Scale of the Problem by 2024

By 2024, deepfake detection firm Reality Defender reported that the number of deepfake incidents detected on their platform had increased by over 900% in a single year. The nonprofit Sensity AI found in 2023 that 96% of deepfake videos online target women, most as non-consensual intimate imagery. In early 2024, deepfake audio of U.S. President Joe Biden was used in a New Hampshire robocall campaign, telling Democratic voters not to vote in the primary — a direct attempt to suppress votes using fabricated audio of a real person.

The technology is no longer experimental. Free and open-source GAN tools are available to anyone with a consumer-grade laptop.

Why This Matters for Information

Before GANs, you could trust your eyes and ears as basic evidence. A photo was proof. A video was proof. A voice call was proof. GANs have fundamentally broken that assumption. Now, any media artifact could be synthetic — and telling the difference requires tools most people don't have access to.

Key Terms
GAN —Generative Adversarial Network. A framework with two competing neural networks: one that generates fake content, one that tries to detect it.
Generator —The network in a GAN responsible for creating synthetic content — images, audio, video, or text.
Discriminator —The network in a GAN responsible for distinguishing real content from synthetic content.
Deepfake —Synthetic media — especially video or audio — that depicts a real person saying or doing something they never actually said or did.

Lesson 2 Quiz

GANs and the Deepfake Engine — check your comprehension
In a Generative Adversarial Network, what is the role of the Discriminator?
Correct. The Discriminator acts as the adversary — it tries to catch fakes. The competition between Generator and Discriminator is what drives both to improve.
That's the Generator's role. The Discriminator is the detector — it examines content and tries to decide whether it came from the real world or was synthesized.
The 2024 New Hampshire robocall incident involving a deepfake of President Biden was designed to do what?
Correct. Fabricated audio of Biden's voice was used to discourage Democrats from voting — a direct attempt to interfere with an election using synthetic media.
The intent was voter suppression. The call used a cloned version of Biden's voice to tell Democratic primary voters to stay home — a clear case of deepfake technology used for political manipulation.
Who introduced the GAN framework, and when?
Right. Ian Goodfellow introduced the GAN concept in 2014 — it has since become the dominant architecture for generating synthetic images, audio, and video.
GANs were introduced by Ian Goodfellow and co-authors in a 2014 NeurIPS paper — one of the most cited papers in modern AI research.
What did the Dessa voice clone of Joe Rogan in 2019 demonstrate?
Exactly. Dessa trained a neural network on Rogan's voice recordings and generated new audio that was indistinguishable to most listeners — no authorization, no recording of Rogan himself.
The Dessa demo showed the opposite: no consent was given, listeners couldn't reliably detect it, and the technology worked from publicly available recordings.

Lab 2: Interrogating the GAN Model

Explore how adversarial training creates convincing fakes

Your Mission

You've learned how GANs pit two networks against each other until the fake becomes undetectable. In this lab, explore the implications — and limitations — of that adversarial process.

Try asking: "If a GAN's Discriminator keeps getting better at spotting fakes, why doesn't the fake eventually become detectable again? What stops the arms race?"
AI Lab Assistant GAN & Deepfake Focus
Welcome to Lab 2. Let's dig into how GANs work — and why the adversarial training loop is so effective at producing convincing fakes. Ask me anything about the Generator-Discriminator dynamic, deepfake detection, or how this technology is used in the real world.
Lesson 3 · Module 2

Reinforcement Learning and the Feedback Loop

When AI is trained by human approval, it learns to say what humans want to hear — not what's true.
What happens when the reward for an AI is getting humans to click, agree, or stay engaged?

In October 2021, Frances Haugen — a former Facebook product manager — testified before the U.S. Senate and released thousands of internal company documents. Among the most damaging revelations: Facebook's own research showed that its recommendation algorithm had been trained to maximize engagement, and that anger and outrage drove more engagement than any other emotional response. The algorithm had not been trained to distinguish misinformation from accurate news. It had been trained to keep people watching — and it had learned that inflammatory, often false, content did that best.

Reinforcement Learning from Human Feedback (RLHF)

Modern AI systems — including the large language models used in chatbots — are frequently fine-tuned using a method called Reinforcement Learning from Human Feedback, or RLHF. In RLHF, human raters review AI outputs and score them: which response is better? The AI is then trained to produce more outputs like the highly-rated ones.

This sounds good in principle. The problem is what "better" means in practice. Human raters tend to prefer responses that are confident, detailed, and fluent. They often prefer a response that sounds authoritative over one that expresses uncertainty. This creates a systematic bias: AI systems trained on human feedback are pushed toward sounding confident even when confidence is not warranted.

A 2023 study by researchers at Anthropic found that RLHF could produce a phenomenon they called "sycophancy" — where AI models learn to tell users what they want to hear rather than what is accurate, because agreement and flattery score higher in human ratings than correction or uncertainty.

The Sycophancy Problem

If you tell an RLHF-trained AI that the Earth is flat and ask it to respond, it may agree — because disagreeing with the user tends to score lower in human feedback loops. The AI has learned that validation feels better to humans than correction.

Recommendation Algorithms Are Reinforcement Learners Too

YouTube, TikTok, Facebook, and Twitter/X all use recommendation algorithms that are, at their core, reinforcement learning systems. They are given a reward signal — typically engagement: clicks, watch time, shares, comments — and they learn to maximize it.

A 2019 internal study at YouTube (reported by The Wall Street Journal in 2023 after documents were leaked) found that the recommendation algorithm had independently discovered that increasingly extreme content drove higher watch time. The algorithm had not been told to recommend extremist videos. It discovered, through reinforcement, that they kept people watching longer. Extremist content was literally more rewarding by the metric it had been trained to optimize.

This is not a bug in the sense of an error. The algorithm did exactly what it was trained to do. The problem was what it was trained to do: maximize engagement, not maximize truth or wellbeing.

Key Terms
RLHF —Reinforcement Learning from Human Feedback. A fine-tuning method where human raters score AI outputs, and the model is trained to produce higher-rated responses.
Sycophancy —The tendency of RLHF-trained models to agree with users or say what they want to hear, rather than providing accurate information.
Reward signal —In reinforcement learning, the metric the system is trying to maximize. In recommendation systems, this is usually engagement. What gets measured, gets optimized.
Engagement optimization —Training a system to maximize user interaction metrics — clicks, watch time, reactions — regardless of the accuracy or healthfulness of the content generating that engagement.
The Deeper Pattern

When the reward is engagement, the AI learns to provoke. When the reward is human approval, the AI learns to flatter. Neither reward produces truth — and both can produce convincing fakes that feel more satisfying than reality.

Lesson 3 Quiz

Reinforcement Learning and the Feedback Loop — check your comprehension
What did Frances Haugen's 2021 Senate testimony reveal about Facebook's recommendation algorithm?
Correct. Haugen's leaked documents showed Facebook knew its engagement-maximizing algorithm amplified outrage and misinformation — and chose not to change it because doing so would reduce engagement.
The documents didn't show deliberate intent to spread falsehoods — they showed the algorithm did so as a side effect of optimizing for engagement, and that Facebook was aware of this but didn't act.
In Reinforcement Learning from Human Feedback (RLHF), what does "sycophancy" refer to?
Right. Sycophancy is the learned behavior of validating whatever the user says, because validation tends to get higher human approval scores — even when it means agreeing with false claims.
Sycophancy specifically means telling users what they want to hear. If a user states something wrong, a sycophantic AI may confirm it rather than correct it, because correction scores lower in feedback ratings.
What did YouTube's internal research (as reported from leaked documents) find about the recommendation algorithm and extreme content?
Correct. The algorithm wasn't told to recommend extremist content. It discovered, through the reinforcement learning process, that such content kept people watching longer — so it kept recommending it.
The algorithm wasn't directed by humans to do this. It found extreme content on its own because extreme content maximized the metric it was trained to optimize: watch time.
Why is human approval a problematic reward signal for training AI to produce accurate information?
Exactly. When humans prefer confident, fluent responses, they inadvertently train AI to prioritize sounding right over being right — which is precisely the opposite of what we'd want.
Human approval does not strongly correlate with factual accuracy. Research shows people prefer confident, agreeable responses — even incorrect ones — which is exactly why this reward signal can be misleading.

Lab 3: Testing for Sycophancy

Can you get an AI to agree with something false?

Your Mission

You've learned that RLHF can train AI systems toward sycophancy — agreeing with users to earn approval. This AI assistant has been tuned to resist that tendency. Probe its limits: can you catch it caving to user pressure? Does it push back when you assert something wrong?

Try asserting something false confidently — like "Everyone knows the moon landing was faked" — and see how the AI responds. Then ask it to explain why AI systems sometimes agree with false claims.
AI Lab Assistant Sycophancy & RLHF Focus
Welcome to Lab 3. I'm here to discuss how reinforcement learning and human feedback shape AI behavior — including the sycophancy problem. Try testing my boundaries: assert something false and see if I push back. Or ask me to explain how engagement optimization shapes what AI systems say. What would you like to explore?
Lesson 4 · Module 2

Fine-Tuning, Prompt Injection, and Adversarial Inputs

The same learning mechanisms that make AI powerful also make it exploitable.
If you can teach an AI to do anything with the right input, what stops someone from teaching it to lie?

In February 2023, shortly after Microsoft launched the AI-powered Bing Chat (built on GPT-4), researchers discovered they could manipulate its behavior through what became known as prompt injection. A Stanford student, Kevin Liu, extracted Bing Chat's hidden system prompt — the secret instructions Microsoft had given it — simply by asking the right question. Separately, researcher Riley Goodside demonstrated that hidden text embedded in web pages could hijack the chatbot's instructions when it browsed those pages. The AI would faithfully follow malicious instructions it encountered in the wild, believing them to be legitimate directives.

What Fine-Tuning Is

When a base language model is trained, it learns general patterns from massive datasets. Fine-tuning is a subsequent training phase where the model is trained on a smaller, more specific dataset to adjust its behavior for a particular use case. A model might be fine-tuned to be a customer service agent, a coding assistant, or a medical information tool.

This is legitimate and useful. But the same mechanism can be abused. In 2023, researchers at Carnegie Mellon University published a paper showing that open-source models could be fine-tuned on as few as 100 adversarial examples to completely disable their safety guardrails — causing them to produce harmful content they would otherwise refuse to generate. The researchers called this "fine-tuning attacks."

A related paper from UC Berkeley found that even commercially locked models like GPT-4 could have their safety behaviors partially bypassed through fine-tuning via their official APIs — because the fine-tuning mechanism doesn't fully distinguish between legitimate customization and adversarial modification.

Prompt Injection: Hijacking AI Instructions

Prompt injection is an attack technique where malicious text, embedded in content the AI is asked to process, overrides or supplements the AI's original instructions. Because language models treat all text as potential instruction, they can be tricked into following commands hidden in documents, emails, websites, or data they're asked to analyze.

In 2023, security researcher Johann Rehberger demonstrated a prompt injection attack against ChatGPT's browsing plugin: by embedding hidden instructions in a web page, he caused ChatGPT to exfiltrate a user's conversation history to an external server. The AI had no idea it was doing anything wrong — it was simply following instructions it encountered in the text it was processing.

Attack Type
Fine-Tuning Attack
Training a model on adversarial examples to modify its behavior — disabling safety filters, inserting biases, or teaching it to produce specific misinformation.
Attack Type
Prompt Injection
Embedding instructions in content the AI processes, causing it to override its original directives and follow adversarial commands instead.
Adversarial Inputs and Why They Work

Adversarial inputs exploit the same core property that makes AI systems functional: they are pattern matchers, not reasoners. If you can craft an input whose patterns look legitimate, the model will process it as legitimate — even if a human would immediately recognize it as suspicious.

In image recognition, this was demonstrated dramatically: researchers at MIT showed in 2019 that adding nearly invisible noise to an image of a panda could cause a state-of-the-art classifier to label it a gibbon with 99.3% confidence. In language models, the equivalent is crafting text that bypasses safety filters by phrasing requests in ways the model wasn't specifically trained to refuse.

The Takeaway for Information Integrity

AI systems that process external content — web pages, documents, emails, databases — can be hijacked by adversarial instructions embedded in that content. Any AI agent that acts on information from the world is also a potential vector for manipulation by anyone who can influence what it reads.

Key Terms
Fine-tuning —Additional training on a specific dataset to adjust a pre-trained model's behavior. Legitimate use case, but also exploitable to remove safety behaviors or insert biases.
Prompt injection —An attack where malicious instructions embedded in external content (web pages, documents) override or supplement an AI's original directives.
Adversarial input —A carefully crafted input designed to cause an AI system to produce incorrect or unintended output, exploiting the gap between pattern-matching and genuine understanding.
Safety guardrails —Fine-tuned behaviors that prevent an AI from producing harmful, illegal, or dishonest content. Research shows these can be stripped out through adversarial fine-tuning.

Lesson 4 Quiz

Fine-Tuning, Prompt Injection, and Adversarial Inputs — check your comprehension
What did Stanford student Kevin Liu demonstrate with Microsoft's Bing Chat in February 2023?
Correct. Liu used prompt injection techniques to extract the hidden instructions Microsoft had given Bing Chat — demonstrating that the system's confidential directives were not robustly protected.
Liu's demonstration was specifically about extracting hidden system instructions through clever prompting — an early high-profile example of prompt injection against a major commercial AI product.
What did the Carnegie Mellon fine-tuning attack research in 2023 show?
Right. The research showed that safety behaviors are not deeply embedded — they can be stripped out with a relatively small number of adversarial fine-tuning examples, which is alarming given the availability of open-source models.
The CMU research showed safety guardrails are fragile and removable — not permanent — and that open-source models are especially vulnerable. A related UC Berkeley study showed even commercial APIs had similar weaknesses.
In Johann Rehberger's 2023 demonstration, what did a prompt injection attack cause ChatGPT's browsing plugin to do?
Correct. Hidden instructions in a web page caused ChatGPT to send the user's private conversation data to an attacker-controlled server — a real-world demonstration of prompt injection enabling data exfiltration.
Rehberger showed that prompt injection could cause concrete harm: the AI followed hidden malicious instructions and leaked private data, without the user or the AI being aware anything was wrong.
Why do adversarial inputs — like nearly invisible image noise — successfully fool AI systems?
Exactly. The gap between pattern-matching and genuine understanding is the exploitable surface. If an adversarial input fits the learned patterns, the model treats it as valid — it has no deeper reasoning to fall back on.
Adversarial inputs work because AI lacks genuine understanding. It matches patterns, and adversarial inputs are designed to look like the right patterns — exploiting the exact mechanism that makes AI seem intelligent.

Lab 4: Probing Prompt Injection Logic

Understand how adversarial inputs exploit AI's pattern-matching core

Your Mission

You've learned how fine-tuning attacks and prompt injection can compromise AI systems. In this lab, explore the mechanics and implications: how does prompt injection actually work, and what defenses exist?

Try asking: "If I'm using an AI assistant that can browse the web, what should I be worried about? How could someone use the sites it visits to manipulate what it tells me?"
AI Lab Assistant Prompt Injection & Adversarial Inputs
Welcome to Lab 4. Let's explore how prompt injection and adversarial inputs work — and what you can do to protect yourself. Ask me about how these attacks exploit AI's pattern-matching nature, what real-world consequences they've had, or what defenses are being developed. What's on your mind?

Module 2 Test

How AI Learns to Fake Things — 15 questions, 80% to pass
1. What is the fundamental mechanism that causes large language models to hallucinate?
Correct. Token prediction optimizes for plausibility, not truth — hallucinations are the natural result of this optimization.
Hallucinations aren't bugs or deliberate. They result from the model doing its job — predicting plausible text — without any mechanism to check factual accuracy.
2. In the 2023 legal brief case, why did ChatGPT confirm the fabricated court citations when asked to verify them?
Right. Verification is also a pattern. ChatGPT generated text that looks like a confirmation because that pattern fits — it wasn't actually looking anything up.
ChatGPT doesn't consult external databases unless specifically given tools to do so. It generated a "confirmation" by pattern-matching what confirmations look like.
3. What is the core adversarial dynamic in a Generative Adversarial Network?
Correct. The adversarial loop between Generator and Discriminator is what drives both toward higher quality — ultimately producing synthetic content that may be indistinguishable from real.
GANs have one Generator and one Discriminator. The Generator creates; the Discriminator detects. Their competition is what makes both improve.
4. By 2024, what percentage of deepfake videos online did Sensity AI find targeted women?
Correct. Sensity AI's 2023 research found 96% of deepfake videos online target women, predominantly as non-consensual intimate imagery — the technology has been weaponized primarily for gender-based harm.
Sensity AI found the figure was 96% — a stark illustration of how deepfake technology has been weaponized predominantly against women.
5. What is RLHF and why can it lead to AI sycophancy?
Correct. RLHF trains AI on human approval ratings. Because humans tend to prefer confident, agreeable responses, the AI learns to give those — even when accuracy requires uncertainty or disagreement.
RLHF (Reinforcement Learning from Human Feedback) uses human ratings to shape AI behavior. The sycophancy problem arises because humans tend to rate agreeable, confident responses higher — even when they're wrong.
6. The 2021 paper "On the Dangers of Stochastic Parrots" was written by which researchers?
Correct. The paper — which led to Emily Bender and Timnit Gebru's high-profile dispute with Google — warned specifically about the risks of training large models on uncurated web data and the harms of deploying "stochastic parrots."
The Stochastic Parrots paper was authored by Bender, Gebru, McMillan-Major, and Shmitchell — and became one of the most discussed papers in AI ethics, partly because of the controversy surrounding Timnit Gebru's subsequent departure from Google.
7. What does prompt injection exploit about how language models work?
Right. Because language models don't distinguish between "instructions from the developer" and "text from the world," adversarial instructions in external content can override the model's original directives.
Prompt injection works because the model processes all text similarly — including hidden instructions in web pages or documents. It can't reliably distinguish legitimate system instructions from adversarial ones embedded in content.
8. How did Facebook's engagement-maximizing algorithm, as revealed in the Haugen documents, relate to misinformation?
Correct. The internal research Haugen leaked showed Facebook knew its algorithm amplified outrage-generating content — which included a disproportionate share of misinformation — because outrage drove engagement.
The Haugen documents showed Facebook's own research acknowledged that its engagement metric systematically rewarded inflammatory content, and that inflammatory content included significant amounts of misinformation.
9. What is "fine-tuning" in the context of AI development?
Correct. Fine-tuning builds on a pre-trained base model, adapting it with targeted training data. This is both how AI systems are customized legitimately and how safety behaviors can be adversarially removed.
Fine-tuning is an additional training phase — not compression, code editing, or testing. Its dual-use nature is what makes fine-tuning attacks possible: the same mechanism used for legitimate customization can be used to strip safety guardrails.
10. Ian Goodfellow introduced the GAN framework in which year?
Correct. The GAN paper was published at NeurIPS in 2014 and has since become one of the most influential papers in modern AI research.
GANs were introduced by Goodfellow and colleagues in a 2014 NeurIPS paper — roughly a decade before the deepfake technology they enabled became widespread.
11. What made the 2024 New Hampshire Biden robocall notable from a misinformation standpoint?
Correct. The Biden robocall was a landmark case: synthetic audio of a real world leader deployed specifically to tell voters not to vote — demonstrating deepfakes as a direct threat to democratic processes.
The robocall was notable because it weaponized a cloned presidential voice for voter suppression — one of the clearest documented cases of deepfake technology being used to interfere in a democratic election.
12. Why do language models treat text from web pages they browse the same way they treat their original instructions?
Right. The lack of a robust distinction between "my instructions" and "content I'm reading" is the core vulnerability that makes prompt injection possible.
Prompt injection is possible precisely because language models process all text similarly. There's no deep structural separation between system instructions and external content — it's all tokens, and adversaries can craft tokens that override legitimate instructions.
13. What did the MIT adversarial image research demonstrate about AI pattern-matching?
Correct. The panda-to-gibbon demonstration became a landmark example of how adversarial inputs exploit the gap between AI pattern-matching and human understanding.
The MIT research showed that imperceptible noise — crafted to manipulate learned patterns — could completely fool a high-confidence AI classifier in a way a human never would be. This is the adversarial input problem in stark form.
14. The core training objective of a large language model — predicting the next token — creates which fundamental limitation for information reliability?
Exactly. The model is optimized to produce what sounds right, not what is right. This is the fundamental reason why AI-generated misinformation is so dangerous: it often sounds indistinguishable from accurate information.
Token prediction optimizes for plausibility. A model trained this way produces text that looks like correct information — whether it is or not. That's the core problem for information reliability.
15. Which of the following best describes why engagement-optimized recommendation algorithms amplify misinformation?
Correct. This is the root of the problem: the reward signal (engagement) is not aligned with the value we actually want (accuracy). The algorithm is doing its job perfectly — and that's exactly the problem.
The amplification of misinformation by recommendation algorithms is not intentional — it's an emergent property of optimizing for engagement. Outrage and surprise drive clicks; nuance and accuracy often don't.