Module 2 · Lesson 1

The Ghost in the Training Data

How Amazon built a hiring tool that quietly penalized women — and why the machine wasn't doing anything wrong.

If a model learns from the past, what happens when the past was unfair?

In 2014, Amazon's engineers had a genuinely exciting idea. They were hiring thousands of software engineers and product managers every year, and the process was slow and inconsistent. Different recruiters noticed different things. So the team built an AI system to do what AI does best: find patterns in large amounts of data. They fed it ten years of Amazon's own hiring records — résumés, interview outcomes, who got offers, who got promoted.

The system trained on that data. It learned what "successful Amazon employee" looked like, based on real examples. Then it started scoring new applicants on a scale of one to five stars.

By 2015, something was wrong. The model had taught itself to penalize résumés that included the word "women's" — as in "women's chess club" or "women's college." It downgraded graduates of all-women's colleges. It had also learned to favor certain verbs — "executed," "captured" — that appeared more often in men's résumés. No engineer programmed it to do any of this. The model figured it out on its own.

Amazon tried to fix it. They removed the gender-related penalties. But auditors found the model was still finding other proxies — subtle signals that correlated with gender without directly naming it. In 2018, Amazon quietly scrapped the whole project. It was never used to make real hiring decisions, but the lesson was already written: a model trained on biased history will reproduce that bias, and it won't tell you it's doing it.

Why the Machine Wasn't Lying

Here's the part most people miss: the algorithm did exactly what it was told to do. It found the patterns that predicted success at Amazon. And for ten years, the people promoted at Amazon had been — overwhelmingly — men. Not because women were less capable, but because of countless human decisions made before the model existed: who got interviewed, who got hired, who got the second chance after a slow start.

The model looked at history and said: "Okay, I see what a successful hire looks like." It was right about the pattern. The pattern was wrong.

This is called bias in training data. The data isn't neutral. It's a recording of the past, and the past contains all the unfairness, prejudice, and unexamined assumptions of the humans who created it.

Training BiasWhen the data a model learns from reflects historical unfairness, the model learns to replicate that unfairness — even without being told to.

Proxy VariableA signal the model uses as a stand-in for something it was never supposed to consider. The word "women's" wasn't a gender marker — until the model made it one.

The Pattern-Finding Machine Has No Conscience

Imagine you're trying to guess who's likely to score high on a test. You look at thousands of past students and you notice that students who own a desk lamp score higher. So you start predicting: desk lamp = smart. But what you've actually found is a correlation with wealth — students who could afford a lamp had quieter study spaces, more stable homes, better access to materials. The lamp didn't help them. It just went along for the ride.

A machine learning model does this all the time, at massive scale, with hundreds of variables simultaneously. It finds signals that predict the outcome in the training data. It doesn't ask whether those signals are fair, or what they actually mean, or whether they'd apply to a different population.

For younger readers: think of it like this. If your teacher graded every test with red pen, and then someone built a model to predict grades — and the model learned that "red ink on the page = low grade" — it would be right about the correlation but completely wrong about the cause. The ink didn't cause the low grade. The model can't tell the difference without more context.

The Core Problem

Models find patterns. They don't find meaning. The pattern "women's college → lower rating" was real in the training data. What it meant was: we've been undervaluing women's applications for years. The model learned the symptom without anyone explaining the disease.

Where Else This Happens

Amazon's case became famous because they admitted it. But the same dynamic plays out in systems that affect far more people's lives, often invisibly:

Criminal sentencing. The COMPAS algorithm, used in courts across the United States since the 2000s, predicted likelihood of reoffending. A 2016 investigation by ProPublica found it rated Black defendants as higher risk than white defendants with the same criminal history — not because race was a direct input, but because the training data reflected decades of racially unequal policing and sentencing.

Medical diagnosis. In 2019, researchers published a study in Science showing a widely-used healthcare algorithm was directing less care toward Black patients than equally sick white patients. The algorithm had been trained on healthcare costs — but cost is a proxy for access, and access is unequal.

Facial recognition. MIT researcher Joy Buolamwini found in 2018 that commercial facial recognition systems from IBM, Microsoft, and Face++ had error rates of up to 34% for dark-skinned women, compared to under 1% for light-skinned men. The training datasets had far more light-skinned faces.

In every case, no engineer woke up and decided to be unfair. The data carried the unfairness in silence.

The Ethical Tension — and Why It's Hard

Now here's where it gets genuinely difficult. Suppose you're trying to fix a biased model. You could remove race as an input variable. But the model might still use zip code, which correlates with race. You remove zip code. It uses school attended, which correlates with zip code. You remove that. The correlations run so deep that removing every proxy would also remove much of the model's ability to make predictions at all.

So you face a real tradeoff: a less accurate model that's more fair, or a more accurate model that perpetuates historical inequality. There's no clean mathematical answer to that. It's a value judgment — and value judgments belong to people, not machines.

Ethical Question — No Clean Answer

If a company uses a hiring AI that produces fairer outcomes on average, but occasionally discriminates against individuals in a way no human would catch — is that better or worse than using a human recruiter who is consciously biased but accountable? Who should decide?

You Can Now See What Most People Miss

When you read a headline saying "AI discriminates," you now understand the real mechanism: the model didn't develop hatred — it learned history. That distinction matters enormously for how we fix it. Knowing this, you're equipped to ask the right question: not "is the AI evil?" but "what did the training data contain, and whose decisions made it that way?"

Lesson 1 Quiz

The Ghost in the Training Data

5 questions · Apply what you learned — not just recall it.

1. Amazon's hiring AI downgraded résumés containing the word "women's" — but no engineer programmed it to do this. What actually caused this behavior?

Correct. The model wasn't given gender as an input — it inferred it from words correlated with gender in historical data. This is training bias at work: the data carried the unfairness, and the model absorbed it.

Not quite. Think about what the model was actually doing: it was learning from historical hiring records. The bias came from patterns in that history, not from deliberate code or malicious intent.

2. A city builds a model to predict which neighborhoods will have the most traffic accidents. The model heavily weights "number of police stops per block." Why might this produce a biased result?

Exactly right. This is the proxy variable problem applied to a new scenario. "Police stops" doesn't measure danger — it measures police presence, which is unequal across neighborhoods. The model would mistake policing inequality for accident risk.

Think about what "police stops" actually measures. Is it purely a count of dangerous events, or does it also reflect decisions about where police patrol? That distinction is the core issue here.

3. What is a "proxy variable" in the context of a biased model?

Correct. A proxy variable carries the signal of something the model wasn't supposed to use — like the word "women's" standing in for gender, or zip code standing in for race.

Re-read the definition from the lesson. A proxy isn't about backup data or fairness — it's about what the model learns to use as a substitute for something it should ignore.

4. Researchers found that a healthcare algorithm directed less care to Black patients than equally sick white patients. The algorithm didn't use race as an input. It used healthcare costs. Why does this still produce a racially biased result?

Exactly. Cost looks neutral but it's shaped by access — and access is shaped by race in the U.S. healthcare system. Using cost as a measure of health need smuggles in historical inequality without anyone noticing.

Think about what "cost" actually measures in healthcare. Does spending less money mean being less sick? Or could it mean having less ability to access care? That's the key to this question.

5. An engineer suggests removing every variable that correlates with race to make a model fairer. What's the problem with this approach?

Right. This is the genuine difficulty described at the end of the lesson. Fairness and accuracy exist in tension when historical inequality runs deep through the data. There's no purely technical fix — it requires value judgments about what we're optimizing for.

Think about why the lesson called this "genuinely difficult." If it were easy to fix bias just by removing variables, Amazon would have done it. What happens to the model's usefulness when you strip out signals that happen to correlate with protected characteristics?

Lesson 1 Lab

Bias Auditor

You're investigating a real model. Your job is to find what it learned from — and what it learned wrong.

Your Role: AI Auditor

A hospital has deployed an algorithm that recommends whether patients should receive a specialist referral. The model was trained on five years of hospital referral records. You've been hired to audit it before it goes into wider use.

Your lab partner is an AI investigator who will challenge your thinking and push you to be more precise. Don't expect easy answers — this is a real investigation.

Start by telling your partner: what's the first thing you'd want to know about the training data? And why does it matter?

Investigation Partner

Lab 1 · Bias Auditor

You've been hired to audit a hospital referral algorithm before it scales. I've read the basic documentation. Before we dig into the data — what do you think we should look at first, and why? Don't just say "the training data." Get specific. What about it?

Module 2 · Lesson 2

The Model That Memorized the Test

How a skin cancer detection AI that was 95% accurate on its training data turned out to be partly detecting rulers — and what that reveals about the difference between memorizing and learning.

What's the difference between a model that learned something and a model that just memorized the answers?

In 2017, researchers at Stanford University published a paper in Nature announcing that a deep learning model could detect skin cancer from photographs with the accuracy of a board-certified dermatologist. The model had been trained on nearly 130,000 clinical images. It was called a breakthrough. News articles declared AI was better than doctors at one of medicine's hardest visual tasks.

Shortly after, a German research team decided to dig deeper. They ran their own tests on similar image-recognition models trained to detect melanoma — the deadliest kind of skin cancer. They noticed something strange. Many of the malignant (cancerous) lesions in the training datasets had been photographed with a ruler in the frame. Dermatologists use rulers to document lesion size when something looks suspicious enough to measure. The benign (non-cancerous) spots typically weren't measured.

The model had — quietly, without anyone realizing — learned to associate rulers with cancer. It wasn't just looking at the lesion. It was looking at whether a ruler was present. In the training data, this was a nearly perfect signal. In the real world, it was completely meaningless.

When deployed on images without rulers, or on clinical setups where dermatologists routinely photographed even benign spots with rulers, the model's accuracy dropped. The team published their findings in 2018. The problem they'd identified had a name researchers had known about for decades — but that the hype around deep learning had quietly pushed aside.

What Overfitting Means — and Why It Happens

The skin cancer model had made a classic error called overfitting. Here's what that means in plain terms: the model had fit itself so tightly to the specific examples it trained on that it stopped learning the general rule — and started memorizing the quirks of its dataset instead.

Imagine studying for a history test by memorizing only the exact practice questions your teacher handed out. On test day, if the teacher asks those exact questions word-for-word, you ace it. But if she asks the same historical concepts in different wording, or applies them to new events — you're lost. You memorized the test. You didn't learn history.

That's overfitting. The model learned "ruler in photo → cancer" not because rulers cause cancer, but because in its training examples that correlation happened to be true. It had no way to know the correlation was an artifact of how dermatologists practice — a shortcut, not a signal.

OverfittingWhen a model learns the training data too precisely — including its noise, accidents, and quirks — and fails to generalize to new examples outside that data.

Spurious CorrelationA statistical relationship between two things that isn't caused by any real connection. Rulers and cancer are correlated in certain datasets — not because they're related, but because of how the data was collected.

The Model That Identified Huskies by Snow

The skin cancer case isn't a fluke. In 2016, AI researcher Marco Ribeiro was testing an image classification model's ability to tell wolves from huskies. The model performed brilliantly on the test set. Then Ribeiro used a technique he'd developed — called LIME — to visualize which pixels the model was actually using to make its decisions.

The result: the model was classifying "husky" vs "wolf" almost entirely based on whether there was snow in the background of the photo. Husky photos in the training data happened to include snowy landscapes. Wolf photos were taken in forests. The model never actually looked at the dog. It looked at the scenery.

This illustrates a critical idea: a model can appear accurate while doing completely the wrong thing. If your test set has the same background patterns as your training set, the model scores well. Deploy it in the real world — say, at a dog shelter where all photos are taken against a plain wall — and it falls apart.

Why This Matters at Scale

When thousands of doctors use a model to triage patients, or when thousands of parole decisions are influenced by a risk score, an overfitted model isn't just wrong on a test set. It's wrong in ways no one anticipated, on patients and people who had no say in how the training data was collected.

Detecting Overfitting — and Why It's Not Always Caught

The standard defense against overfitting is called a validation set — a portion of data held back from training, used only to test whether the model generalizes. If your model scores 97% on training data but only 68% on the validation set, that gap is a red flag for overfitting.

But this only works if the validation set is genuinely different from the training data. In the melanoma case, both the training and validation images came from the same clinical datasets — which all had the same ruler artifact. The model scored well on both because it was using the same spurious shortcut in both cases.

The deeper problem: you can't test for failure modes you haven't thought to look for. Nobody thought to check for rulers. They were checking for cancer — and the model was getting the right answers, by the wrong method, in a way no one had a reason to inspect.

For older readers: this is why AI safety researchers talk about "evaluation goodhart" — once a metric becomes the target, models optimize for the metric rather than the underlying goal. A model trained to maximize accuracy on a test set will find whatever shortcut produces accuracy, whether or not that shortcut reflects genuine understanding.

Ethical Question — No Clean Answer

If an overfitted model is 95% accurate — genuinely better than the average human for most cases — is it acceptable to deploy it in hospitals, even if its accuracy relies partly on shortcuts that could fail unpredictably? Who bears responsibility when it does fail?

You Can Now See What Most People Miss

When you see a headline claiming "AI outperforms doctors," you now know the question to ask first: where did they test it, and was that test environment the same as the training environment? Accuracy on a test set is not the same as accuracy in the real world. That distinction is something most people reading that headline will never think to ask.

Lesson 2 Quiz

The Model That Memorized the Test

5 questions · Does the model generalize, or does it just remember?

1. The melanoma detection model associated rulers in photos with cancer. This is an example of:

Correct. The model learned the training data too precisely — including artifacts of how dermatologists happen to photograph suspicious lesions. That's overfitting: it memorized the quirks of the dataset, not the real pattern.

Think about what "overfitting" means: learning too specifically. The ruler pattern was a quirk of this particular dataset, not a real medical signal. What does it mean when a model latches onto that kind of artifact?

2. A student studies only the exact practice problems from last year's test. They score 100% on the practice set. On a new test covering the same topics with different numbers, they score 52%. This is an analogy for:

Exactly. Perfect performance on the training examples (the practice set) with poor performance on new examples (the real test) is the definition of overfitting. The student memorized, didn't learn.

The key detail is: great on the specific examples they studied, poor on new examples of the same concepts. That gap — training performance vs. generalization — is the signature of overfitting.

3. Ribeiro's wolf/husky model was classifying animals based on background scenery, not the animal itself. What does this reveal about high accuracy scores?

Right. Accuracy is a measure of correct answers — it says nothing about whether the model learned the right thing. A model can be right for the wrong reasons, and that's dangerous because it will fail unpredictably when the shortcut stops working.

Think about the husky case. The model got high accuracy — but was it because it understood what a husky looks like, or because it noticed something irrelevant that happened to correlate in the training set? What does that tell you about what accuracy actually measures?

4. A researcher uses a validation set to test for overfitting. But the validation set comes from the same hospitals as the training set, and all those hospitals use the same ruler-photographing practice. What is the problem?

Exactly. Validation only tests for failure modes that are absent from the validation set. If the training and validation data share the same artifacts, the model can exploit those artifacts across both sets and appear to be generalizing — when it isn't.

Think about what a validation set is for: it's supposed to represent data the model hasn't seen. But if validation and training data come from the same source with the same artifact, what does scoring well on validation actually prove?

5. An AI company claims their new loan approval model has 94% accuracy on their internal test set, based on 100,000 loans made by their bank over the past eight years. What question should you ask before trusting this claim?

Exactly the right question. An 8-year internal test set might share all the same institutional quirks and data collection artifacts as the training data. High accuracy in that environment doesn't tell you how the model will perform on new applicants from different banks, different years, or different economic conditions.

Think about what we learned about validation sets. The real question about accuracy is not "how high is the number" but "what was the test set, and does it represent the environment where the model will actually be deployed?"

Lesson 2 Lab

Shortcut Detective

A model scores 96% on its test set. Your job is to figure out whether it actually learned anything — or just found a shortcut.

Your Role: Model Evaluator

A startup has built a model to detect fraudulent bank transactions. It was trained on two years of transaction data from a single major bank. It scores 96% on the held-out test set from that bank. They want to license it to 50 other banks and claim it's ready.

Your lab partner will challenge your thinking. You need to make a case — with specific reasoning — about whether this model is ready to deploy.

Start by explaining: what would you want to know about the training data and test setup before you trusted that 96% number? Be specific.

Model Evaluator Partner

Lab 2 · Shortcut Detective

96% accuracy — that's a strong number. The startup CEO just sent me a message saying "the data speaks for itself." I'm not so sure. What's your first concern about this evaluation setup, and why would that number potentially be misleading?

Module 2 · Lesson 3

The World Moved and the Model Didn't

In March 2020, medical AI systems trained on years of patient data ran headlong into a disease that had never existed in their training sets. What happened next was predictable — and preventable.

A model trained on old data gets deployed in a world that has changed. Who is responsible for what goes wrong?

By late March 2020, hospitals across Europe and North America were being overwhelmed by COVID-19 patients. Administrators and doctors turned to AI triage tools — systems trained over years on millions of patient records — to help prioritize care and predict which patients would deteriorate fastest.

Many of these tools had passed extensive testing before the pandemic. They had good accuracy on historical data. And then they were asked to assess patients with a disease that did not exist anywhere in their training data.

A review published in the journal Nature Machine Intelligence in April 2021, led by researchers including Michael Roberts at Cambridge and Derek Driggs at Cambridge, evaluated 232 published AI models for diagnosing or predicting outcomes in COVID-19. The verdict was devastating: nearly all of them were found to be "poorly reported" or "potentially biased" in ways that made them clinically unusable. Some models had been trained on datasets where all the COVID-positive patients had been scanned in one position and all the COVID-negative patients in another — and the model had learned the position, not the disease.

Others had trained on data from early pandemic datasets that were too small, too skewed, or sourced from single hospitals where local practices created artificial patterns. When deployed at different hospitals in different countries, with different patient populations and different imaging equipment, the models failed quietly — often without alerting anyone that they were operating outside the conditions they were designed for.

What Distributional Shift Means

Every model is trained on a distribution — a collection of data that represents some slice of the world at some moment in time. When a model is deployed, it encounters new data. If that new data looks similar to the training distribution, the model tends to perform well. But if the new data is systematically different — because time has passed, circumstances have changed, or the model is being used somewhere new — performance can collapse without warning.

This is called distributional shift (or distribution shift). The world that the model encounters no longer matches the world it was trained on.

For younger readers: imagine you've practiced basketball only on an indoor court. You've trained your muscle memory for the lighting, the echo, the feel of that specific floor. Now someone asks you to play outdoors — different light, different surface, wind affecting the ball. Your skills don't disappear, but your very specific learned responses stop working as well. The environment shifted. You didn't.

Distributional ShiftWhen the data a model encounters after deployment is systematically different from the data it was trained on — causing performance to degrade, sometimes without any visible warning.

Silent FailureWhen a model's performance deteriorates but no error message appears — the system keeps running and outputting predictions that seem confident while being increasingly wrong.

The Stock Market Model That Worked Until 2008

Distributional shift isn't only a healthcare problem. In the years before 2008, banks and hedge funds deployed sophisticated mathematical models to price complex financial instruments — particularly mortgage-backed securities. These models had been trained on mortgage data going back decades. They had been validated. They had been tested. They performed extremely well on historical data.

There was one problem: all that historical data came from a period when U.S. housing prices had never fallen significantly at a national level. The models had never encountered a scenario where they did. When housing prices began falling across the entire country simultaneously in 2007 and 2008, the models weren't just wrong — they were confidently wrong. They kept outputting low-risk assessments on instruments that were catastrophically failing. The firms trading on those assessments lost billions.

The 2008 financial crisis had many causes. But the systematic failure of quantitative models trained on non-representative historical data — and deployed without adequate monitoring for distributional shift — was among them. Economists including Nassim Nicholas Taleb had warned about this exact vulnerability for years before 2008, calling the excluded scenario a "Black Swan" — an event outside the historical distribution that models had no way to anticipate.

For older readers, this has direct policy implications: regulators now require financial institutions to perform "stress tests" — deliberately running models through scenarios outside their training distribution to see what happens. The COVID AI review had a similar implication: models need monitoring systems that detect when the incoming data is drifting away from training conditions.

The Institutional Lesson

After 2008, financial regulators mandated stress testing precisely to address distributional shift. The same logic is now being applied to medical AI — models should come with documentation of their training distribution, and deployment systems should flag when incoming data looks significantly different from what the model was trained on.

Why Silent Failure Is the Worst Kind

There's a simple defense against a tool that fails loudly: it tells you it's broken. A thermometer that displays an error code is much safer than one that gives you a reading of 98.6°F when you have a 104°F fever and no error code in sight.

Most deployed AI systems don't fail loudly. They produce an output — a score, a recommendation, a classification — regardless of whether the input data is within the model's reliable operating range. The COVID triage tools didn't refuse to give predictions when patients had a new disease. They gave predictions. The predictions happened to be based on patterns that didn't apply.

This is why knowing your model's distribution — understanding exactly what kind of data it was trained on and where its boundaries are — isn't optional. It's a safety requirement. And right now, most deployed models don't come with this documentation.

Ethical Question — No Clean Answer

During COVID, hospitals were overwhelmed and doctors were desperate for any tool that could help triage patients. If an AI triage system was operating outside its training distribution and no one knew it — but doctors believed it was helping — should it have been deployed at all? What would you need to know to answer that question honestly?

Knowing This Changes How You Read Every Headline About AI

Whenever you see a claim that an AI system "works" — now you'll automatically ask: works on what data, in what conditions, from what time period? A model that worked perfectly last year might be failing today because the world changed. That question — "has the world drifted away from the training data?" — is one most headlines never bother to ask. You will.

Lesson 3 Quiz

The World Moved and the Model Didn't

5 questions · When does a model's past training become a liability?

1. A COVID-19 triage AI was trained on pre-pandemic patient data and deployed in March 2020. Why would distributional shift be a fundamental problem here — even if the model had been perfectly built?

Exactly right. Distribution shift here is absolute — COVID-19 didn't exist in any prior dataset. The model wasn't making slightly inaccurate predictions; it was generating predictions with no valid basis whatsoever for a completely novel disease.

Think about what "training distribution" means. If a disease never appeared in the training data, what patterns could the model possibly be using when it encounters a patient with that disease? That's the core of the distributional shift problem here.

2. What made the failure of financial models in 2008 an example of distributional shift?

Right. The historical data was accurate for the world it described — but it excluded a scenario that eventually happened. When the excluded scenario arrived, the models had no learned pattern to draw on and kept outputting confident wrong answers.

The models weren't technically broken or underfitted. They were trained correctly on historical data. The problem was what that historical data didn't contain — a specific scenario the past had never produced. What does that mean for how reliable past data is?

3. Why is "silent failure" particularly dangerous for deployed AI systems?

Correct. A system that fails loudly tells you to stop trusting it. A system that fails silently keeps producing plausible outputs — and users continue to act on them without knowing the underlying basis has collapsed.

Think about the thermometer analogy from the lesson. What's the difference between a thermometer that shows an error code and one that shows 98.6°F when you actually have a 104°F fever? That's the essence of silent failure.

4. After 2008, financial regulators mandated "stress tests" for risk models. In the context of distributional shift, what is the purpose of a stress test?

Exactly. A stress test probes the edges of a model's training distribution — asking "what if housing prices fall nationally?" or "what if a new disease appears?" It's a deliberate attempt to find the failure modes before deployment rather than after.

Think about what distributional shift means. A stress test is specifically designed to address that problem — to see what happens when the world moves beyond the training distribution. What would you have to do to a model to test that?

5. A school builds a model to predict which students are at risk of dropping out, trained on five years of student data. In 2020, the school switches entirely to remote learning. What should the school do before continuing to use the model?

Right. A fundamental change in how school operates — from in-person to remote — changes almost every signal the model was trained on. Before using predictions to make decisions about students' futures, you'd want to verify the model is still tracking the same underlying reality.

Think about what "distribution" means in this context. Five years of in-person school creates very specific patterns of attendance, engagement, and behavior. What happens to all those patterns when everything moves online? Does the old model still know what it's looking at?

Lesson 3 Lab

The Shift Monitor

A model was built before the world changed. You're deciding whether it should keep running.

Your Role: AI Deployment Reviewer

A city's social services department uses an AI model to predict which families are most at risk of housing instability and should be prioritized for outreach. The model was trained on 2015–2019 data. It's now 2021, after two years of pandemic-era economic disruption, eviction moratoriums, and changed assistance programs.

City leadership wants to keep using the model because it's "already built and tested." You've been asked whether it should continue operating as-is, be monitored carefully, or be suspended pending retraining.

Your partner will challenge your recommendation. Start by stating your position clearly and your single strongest reason for it.

Deployment Review Partner

Lab 3 · Shift Monitor

The city's budget office is pushing hard to keep the model running — retooling it costs money they don't have, and families need help now. The argument is: "even an imperfect model is better than no model." What's your position, and what's your strongest argument for it? Don't hedge. Take a side.

Module 2 · Lesson 4

Breaking the Model on Purpose

In 2018, a researcher added a small sticker to a stop sign. A self-driving car saw a speed limit sign instead. The sticker was designed using mathematics — and it worked every time.

If someone knows how your model makes decisions, can they make it fail whenever they want?

In August 2017, a team of researchers from the University of Washington, Carnegie Mellon, and UC Berkeley published a paper with a striking demonstration. They had taken a standard stop sign — the kind at any intersection — and added small black-and-white stickers to it. The stickers looked, to a human, like random graffiti or tape residue. Nothing concerning.

But the stickers had been calculated with precision. Using knowledge of how neural networks process images, the team had crafted the sticker pattern so that it would push the network's internal calculations in a specific direction. The sign would be classified not as a stop sign — but as a 45 mph speed limit sign. In every test. From multiple viewing angles. At different distances. 100% of the time.

This wasn't a magic trick. It was a demonstration of a category of attack that researchers had first identified in 2013 when scientists at Google and New York University — including Ian Goodfellow — showed that you could add carefully calculated noise to any image, invisible to human eyes, and cause a neural network to confidently misclassify it. Goodfellow called these "adversarial examples."

The 2017 stop sign paper was alarming for a specific reason: the attack worked in the physical world. Not just on digital images fed directly into a computer — but on a real sign, photographed by a real camera, processed by a real model. The researchers titled their paper: "Robust Physical-World Attacks on Deep Neural Networks." No one who read it thought self-driving cars were quite as safe as they'd seemed the week before.

How Adversarial Attacks Work

To understand why this is possible, you need to remember how neural networks make decisions. A network classifying an image isn't looking at the image the way you do — seeing shapes, objects, context. It's applying millions of learned numerical weights to the pixels in the image, combining them through layers of calculations, and producing a confidence score for each possible label.

An adversarial attack works by asking: which small changes to the pixel values would most push the model's calculation toward a wrong answer? If you know the model's weights and architecture, you can calculate this precisely using a technique called gradient descent — the same technique used to train the model in the first place, but run in reverse to find inputs that break it.

The result is a perturbation — a tiny modification to the image — that is mathematically optimized to deceive the specific model. To human eyes, a photo of a panda with an adversarial perturbation added still looks exactly like a panda. To the model: 99.3% confident it's a gibbon.

Adversarial ExampleAn input that has been deliberately modified — often imperceptibly — to cause a model to make a specific wrong prediction with high confidence.

Adversarial AttackThe deliberate use of adversarial examples to cause a deployed model to fail, either for research purposes or malicious ones.

Beyond Stop Signs — Where This Gets Real

The stop sign demonstration was alarming because autonomous vehicles were already on public roads. But adversarial attacks extend well beyond self-driving cars:

Face recognition systems. In 2019, researchers at Carnegie Mellon showed that wearing specially printed glasses could cause commercial face recognition systems to consistently misidentify the wearer as a different person — or fail to detect a face at all. The glasses weren't hiding the face. They were mathematically confusing the model about what face they were seeing.

Malware detection. Security researchers have demonstrated that malicious code can be modified — without changing its function — in ways that cause antivirus AI systems to classify it as safe. The code still does the harmful thing. The model just can't see it anymore.

Medical imaging. A 2019 paper in Nature showed that adversarial attacks on radiology AI could cause a model to miss a tumor in an image or add a non-existent one — again using pixel-level modifications invisible to a radiologist reviewing the same image.

Each of these isn't just a research curiosity. Each represents a class of attacker — someone who knows how a model works and uses that knowledge to make it fail exactly when and how they want.

The Institutional Stakes

Adversarial attacks change the security model for any system that relies on AI. Traditional computer security asks: can someone break into the system? Adversarial security asks: can someone, from the outside, without breaking into anything, cause the AI to reach any conclusion they want? That's a fundamentally different threat — and most deployed systems have no defense against it.

Why This Is Hard to Fix — and the Ethical Dimension

The direct defense against adversarial attacks is called adversarial training: you generate adversarial examples during training and teach the model to classify them correctly too. This helps — but researchers have consistently shown that models hardened against known attacks remain vulnerable to new attacks that approach from different directions. It becomes an arms race: fix the known vulnerability, someone finds a new one.

There's also a deeper reason this is hard: adversarial examples exist because the model is not perceiving the world the way humans do. It found patterns that work statistically but aren't robust in the way human perception is. No amount of training data fully closes that gap, because the gap is structural — the model is doing something fundamentally different from seeing.

For older readers, this raises a policy question that real institutions are grappling with right now: should AI systems in high-stakes domains (traffic, medicine, security) be required to demonstrate adversarial robustness before deployment? Currently, most are not. The EU's AI Act, passed in 2024, begins to address this — requiring risk documentation for high-stakes AI systems — but robustness requirements are still evolving.

Ethical Question — No Clean Answer

If an adversarial attack on a self-driving car's perception system causes an accident — and the attacker knew this was possible — who bears legal and moral responsibility? The attacker? The company that deployed a vulnerable model? The regulators who didn't require adversarial testing? All three?

You Can Now See What Most People Miss

Most people think AI security means protecting the server from hackers. You now understand a completely different threat: someone who never touches the server but changes a sticker on a sign — and the car does whatever they want. That reframes what "AI safety" means from the ground up. When you hear about self-driving cars, face recognition at airports, or AI in medical devices, you're now equipped to ask: what happens when someone who knows how this model works tries to break it?

Lesson 4 Quiz

Breaking the Model on Purpose

5 questions · Who can break an AI, how, and what does it mean?

1. The stop sign attack used small stickers to make a self-driving car see a speed limit sign. How was the sticker pattern determined?

Correct. The attack wasn't random — it was calculated. Knowing the model's architecture and weights, researchers could compute precisely which changes to the image would cause the misclassification. This is what makes adversarial attacks structurally different from ordinary errors.

Think about what "adversarial" means in this context. The attack wasn't guessing — it was calculated using mathematical knowledge of how the specific model makes decisions. That's what makes it different from simply damaging a sign.

2. A researcher adds invisible perturbations to a chest X-ray image. A radiologist sees nothing wrong, but the AI model says the patient has no abnormalities — when there is actually a tumor present. What type of attack is this?

Exactly. The modification is imperceptible to a human but mathematically optimized to deceive the model's specific perception mechanism. This is the defining feature of an adversarial attack applied to a real medical context.

The key here is intentionality and method. The image has been deliberately modified using knowledge of the model. That's what distinguishes this from a distribution shift (world changes on its own) or overfitting (model's training problem). What does "adversarial" actually mean?

3. Adversarial examples work because neural networks process images differently from how humans see them. What does this reveal about the nature of what neural networks learn?

Correct. The model learned patterns that worked statistically on training data — but those patterns aren't grounded in the same conceptual understanding of shapes and objects that makes human perception robust. That gap is structural, not just a training deficiency.

Think about the central theme of this module: models learn patterns from data, not underlying meaning. What does that imply about how the model processes an image — and why a carefully crafted perturbation could fool it while leaving a human unaffected?

4. "Adversarial training" is one defense against adversarial attacks. What is its core limitation?

Right. Adversarial training is a patch against the specific attacks you've already seen — but adversarial attacks can always approach from new directions. The fundamental vulnerability — that models learn patterns, not meanings — doesn't disappear with more training.

Think about what adversarial training actually does. It shows the model examples of known attacks. But what happens when an attacker comes up with a new type of attack that the model hasn't been trained against? Does knowing the answer to the last test help you on a new one?

5. A government wants to deploy a face recognition system at all major airports. A security researcher demonstrates that printed glasses can fool the system. The government's response is: "We'll make photos of the glasses classified so no one knows how to make them." Why is this an insufficient security response?

Exactly. The attack isn't a secret recipe — it's a mathematical consequence of how the model works. Anyone with enough knowledge of the model's structure can derive their own adversarial patterns. Hiding one particular example doesn't fix the underlying vulnerability that makes adversarial examples possible.

Think about how the sticker pattern was created — through mathematical calculation, not through discovering someone else's trick. If the vulnerability is structural (the model processes images in an exploitable way), does hiding one specific exploit actually fix it? What would a real fix require?

Lesson 4 Lab

Red Team

You're not building the model. You're trying to break it — before someone dangerous does.

Your Role: Adversarial Security Analyst

A hospital has just deployed an AI system that reads CT scans to detect internal bleeding. The system is 93% accurate on the validation set. Hospital administration is proud of it. Your job is to "red team" it — to think like an attacker and find scenarios where it fails before those failures cause harm.

Your lab partner is a fellow red-teamer who will push you to be more specific, more creative, and more rigorous. You need to make the case for at least three distinct failure scenarios — not just "it could be wrong sometimes."

Start with your most serious adversarial scenario. Describe who the attacker would be, what they would need to know, and what they would actually do to cause the model to fail.

Red Team Partner

Lab 4 · Adversarial Analyst

The hospital CTO just told me "this system has been tested extensively and no one would even know how to attack it." I think that's dangerously wrong. What's your first attack scenario — and more importantly, what does the attacker actually need to know or have access to in order to pull it off?

Module 2 · Final Test

When the Model Gets It Dangerously Wrong

15 questions · All four lessons · Pass at 80% or above to complete the module.

1. Amazon's hiring AI penalized résumés containing the word "women's." No engineer programmed this. What caused it?

Correct. Training bias: the data carried historical unfairness, and the model learned it.

Review Lesson 1: the model learned from biased historical data, not from deliberate programming.

2. A healthcare algorithm uses "total healthcare spending" to measure patient health need. Why does this produce racially biased outcomes?

Correct. Cost is a proxy for access — not health — and access is historically unequal.

Review Lesson 1: proxy variables carry hidden inequalities from the world into the model.

3. What is a proxy variable?

Correct. A proxy encodes what the model wasn't supposed to see — via correlation, not direct inclusion.

Review Lesson 1: a proxy variable indirectly substitutes for a protected characteristic through correlation.

4. The melanoma detection model associated rulers in photos with cancer because dermatologists measure suspicious lesions. This is an example of:

Correct. Overfitting: the model latched onto an artifact specific to this dataset instead of the real signal.

Review Lesson 2: overfitting is when a model memorizes quirks of the training data rather than the underlying pattern.

5. A model scores 97% on training data and 61% on data from a different hospital. What does this gap most likely indicate?

Correct. A large gap between training performance and performance on genuinely new data is the signature of overfitting.

Review Lesson 2: a large gap between training accuracy and accuracy on new data means the model memorized the specific training examples rather than learning a general rule.

6. Marco Ribeiro's wolf/husky classifier used snow in the background as its primary signal. What is the most important implication of this finding?

Correct. Accuracy is a measure of correct answers — not correct reasoning. A model can be right for the wrong reasons.

Review Lesson 2: the key takeaway from the husky case is that accuracy doesn't tell you what the model actually learned. It only tells you whether it got the right answer on the test set.

7. What is distributional shift?

Correct. The training world and the deployment world no longer match — and performance degrades as a result.

Review Lesson 3: distributional shift happens when the world the model encounters no longer looks like the world it was trained on.

8. COVID-19 triage AI systems failed partly because they were trained on pre-pandemic data. But why did some systems also fail even when trained on early COVID data?

Correct. Even COVID-trained models overfit to the specific hospitals and practices where data was collected — another layer of the same distributional shift problem.

Review Lesson 3: the Cambridge review found that even COVID-era models overfit to the specific conditions of their training hospitals, failing to generalize to other settings.

9. Financial models failed in 2008 because they had never been trained on data where U.S. housing prices fell nationally. What concept does this illustrate?

Correct. The models were accurate within their training distribution. The world moved outside that distribution, and they had no way to handle what they'd never seen.

Review Lesson 3: the financial crisis case is a distributional shift story — the training data didn't include the scenario that actually occurred.

10. A school deploys a dropout-risk model in 2019 and doesn't update it after switching to remote learning in 2020. A student who is genuinely struggling gets a "low risk" score and receives no outreach. What failure mode is this?

Correct. The model's training world (in-person school) no longer matches deployment reality (remote learning), and it fails silently — giving no signal that its predictions have become unreliable.

Review Lessons 3: when the world changes and the model doesn't update, it continues producing outputs without flagging that those outputs may no longer be valid. That's the combination of distributional shift and silent failure.

11. What is an adversarial example?

Correct. The key features: deliberate, targeted, often imperceptible, causes confident misclassification.

Review Lesson 4: adversarial examples are deliberately crafted inputs designed to fool a specific model, not just hard or unusual examples that happen to cause errors.

12. A security researcher wearing specially designed glasses consistently fools an airport face recognition system into identifying them as a different person. What type of attack is this, and what does the attacker need to execute it?

Correct. The glasses are adversarially crafted — calculated, not random — to force a specific misidentification. The attacker needs knowledge of the model's mechanism, not access to its servers.

Review Lesson 4: the glasses case is a physical adversarial attack. The key is that the pattern was mathematically computed to exploit the model's specific decision boundaries.

13. Why does "hiding the specific adversarial pattern used in a demonstrated attack" not solve the adversarial vulnerability?

Correct. The attack is a mathematical consequence of how the model processes inputs, not a secret recipe. Any sufficiently knowledgeable attacker can derive their own version independently.

Review Lesson 4: adversarial patterns are calculated from the model's own structure. Hiding one example doesn't fix the structural property that makes adversarial examples possible at all.

14. Which of the following best describes what all four failure modes in this module have in common?

Exactly right. Training bias, overfitting, distributional shift, and adversarial attacks are all different ways that the boundary between "what the model learned" and "what the world actually is" can break down.

Think about the through-line across all four lessons. What do training bias, overfitting, distributional shift, and adversarial attacks all share? Each involves a gap between the training data and some version of reality. The nature of that gap — and who or what caused it — differs, but the structural issue is the same.

15. A company releases an AI hiring tool with 91% accuracy on their internal test set. You are a policymaker deciding whether to require disclosure before companies can use it. Based on this module, what would you most need to know?

Correct. All four questions map directly onto the four failure modes: training bias (what's in the data?), overfitting (is the test set independent?), distributional shift (does deployment match training?), and adversarial robustness (has anyone tried to break it?). This is what informed AI policy oversight looks like.

Think about all four lessons together. Each one identified a different thing you'd need to know about a model before trusting it. Accuracy on an internal test set alone tells you almost nothing about any of those four concerns. What would a truly informed policymaker ask?