In 2014, Amazon's engineers had a genuinely exciting idea. They were hiring thousands of software engineers and product managers every year, and the process was slow and inconsistent. Different recruiters noticed different things. So the team built an AI system to do what AI does best: find patterns in large amounts of data. They fed it ten years of Amazon's own hiring records β rΓ©sumΓ©s, interview outcomes, who got offers, who got promoted.
The system trained on that data. It learned what "successful Amazon employee" looked like, based on real examples. Then it started scoring new applicants on a scale of one to five stars.
By 2015, something was wrong. The model had taught itself to penalize rΓ©sumΓ©s that included the word "women's" β as in "women's chess club" or "women's college." It downgraded graduates of all-women's colleges. It had also learned to favor certain verbs β "executed," "captured" β that appeared more often in men's rΓ©sumΓ©s. No engineer programmed it to do any of this. The model figured it out on its own.
Amazon tried to fix it. They removed the gender-related penalties. But auditors found the model was still finding other proxies β subtle signals that correlated with gender without directly naming it. In 2018, Amazon quietly scrapped the whole project. It was never used to make real hiring decisions, but the lesson was already written: a model trained on biased history will reproduce that bias, and it won't tell you it's doing it.
Here's the part most people miss: the algorithm did exactly what it was told to do. It found the patterns that predicted success at Amazon. And for ten years, the people promoted at Amazon had been β overwhelmingly β men. Not because women were less capable, but because of countless human decisions made before the model existed: who got interviewed, who got hired, who got the second chance after a slow start.
The model looked at history and said: "Okay, I see what a successful hire looks like." It was right about the pattern. The pattern was wrong.
This is called bias in training data. The data isn't neutral. It's a recording of the past, and the past contains all the unfairness, prejudice, and unexamined assumptions of the humans who created it.
Imagine you're trying to guess who's likely to score high on a test. You look at thousands of past students and you notice that students who own a desk lamp score higher. So you start predicting: desk lamp = smart. But what you've actually found is a correlation with wealth β students who could afford a lamp had quieter study spaces, more stable homes, better access to materials. The lamp didn't help them. It just went along for the ride.
A machine learning model does this all the time, at massive scale, with hundreds of variables simultaneously. It finds signals that predict the outcome in the training data. It doesn't ask whether those signals are fair, or what they actually mean, or whether they'd apply to a different population.
For younger readers: think of it like this. If your teacher graded every test with red pen, and then someone built a model to predict grades β and the model learned that "red ink on the page = low grade" β it would be right about the correlation but completely wrong about the cause. The ink didn't cause the low grade. The model can't tell the difference without more context.
Models find patterns. They don't find meaning. The pattern "women's college β lower rating" was real in the training data. What it meant was: we've been undervaluing women's applications for years. The model learned the symptom without anyone explaining the disease.
Amazon's case became famous because they admitted it. But the same dynamic plays out in systems that affect far more people's lives, often invisibly:
Criminal sentencing. The COMPAS algorithm, used in courts across the United States since the 2000s, predicted likelihood of reoffending. A 2016 investigation by ProPublica found it rated Black defendants as higher risk than white defendants with the same criminal history β not because race was a direct input, but because the training data reflected decades of racially unequal policing and sentencing.
Medical diagnosis. In 2019, researchers published a study in Science showing a widely-used healthcare algorithm was directing less care toward Black patients than equally sick white patients. The algorithm had been trained on healthcare costs β but cost is a proxy for access, and access is unequal.
Facial recognition. MIT researcher Joy Buolamwini found in 2018 that commercial facial recognition systems from IBM, Microsoft, and Face++ had error rates of up to 34% for dark-skinned women, compared to under 1% for light-skinned men. The training datasets had far more light-skinned faces.
In every case, no engineer woke up and decided to be unfair. The data carried the unfairness in silence.
Now here's where it gets genuinely difficult. Suppose you're trying to fix a biased model. You could remove race as an input variable. But the model might still use zip code, which correlates with race. You remove zip code. It uses school attended, which correlates with zip code. You remove that. The correlations run so deep that removing every proxy would also remove much of the model's ability to make predictions at all.
So you face a real tradeoff: a less accurate model that's more fair, or a more accurate model that perpetuates historical inequality. There's no clean mathematical answer to that. It's a value judgment β and value judgments belong to people, not machines.
If a company uses a hiring AI that produces fairer outcomes on average, but occasionally discriminates against individuals in a way no human would catch β is that better or worse than using a human recruiter who is consciously biased but accountable? Who should decide?
When you read a headline saying "AI discriminates," you now understand the real mechanism: the model didn't develop hatred β it learned history. That distinction matters enormously for how we fix it. Knowing this, you're equipped to ask the right question: not "is the AI evil?" but "what did the training data contain, and whose decisions made it that way?"
A hospital has deployed an algorithm that recommends whether patients should receive a specialist referral. The model was trained on five years of hospital referral records. You've been hired to audit it before it goes into wider use.
Your lab partner is an AI investigator who will challenge your thinking and push you to be more precise. Don't expect easy answers β this is a real investigation.
In 2017, researchers at Stanford University published a paper in Nature announcing that a deep learning model could detect skin cancer from photographs with the accuracy of a board-certified dermatologist. The model had been trained on nearly 130,000 clinical images. It was called a breakthrough. News articles declared AI was better than doctors at one of medicine's hardest visual tasks.
Shortly after, a German research team decided to dig deeper. They ran their own tests on similar image-recognition models trained to detect melanoma β the deadliest kind of skin cancer. They noticed something strange. Many of the malignant (cancerous) lesions in the training datasets had been photographed with a ruler in the frame. Dermatologists use rulers to document lesion size when something looks suspicious enough to measure. The benign (non-cancerous) spots typically weren't measured.
The model had β quietly, without anyone realizing β learned to associate rulers with cancer. It wasn't just looking at the lesion. It was looking at whether a ruler was present. In the training data, this was a nearly perfect signal. In the real world, it was completely meaningless.
When deployed on images without rulers, or on clinical setups where dermatologists routinely photographed even benign spots with rulers, the model's accuracy dropped. The team published their findings in 2018. The problem they'd identified had a name researchers had known about for decades β but that the hype around deep learning had quietly pushed aside.
The skin cancer model had made a classic error called overfitting. Here's what that means in plain terms: the model had fit itself so tightly to the specific examples it trained on that it stopped learning the general rule β and started memorizing the quirks of its dataset instead.
Imagine studying for a history test by memorizing only the exact practice questions your teacher handed out. On test day, if the teacher asks those exact questions word-for-word, you ace it. But if she asks the same historical concepts in different wording, or applies them to new events β you're lost. You memorized the test. You didn't learn history.
That's overfitting. The model learned "ruler in photo β cancer" not because rulers cause cancer, but because in its training examples that correlation happened to be true. It had no way to know the correlation was an artifact of how dermatologists practice β a shortcut, not a signal.
The skin cancer case isn't a fluke. In 2016, AI researcher Marco Ribeiro was testing an image classification model's ability to tell wolves from huskies. The model performed brilliantly on the test set. Then Ribeiro used a technique he'd developed β called LIME β to visualize which pixels the model was actually using to make its decisions.
The result: the model was classifying "husky" vs "wolf" almost entirely based on whether there was snow in the background of the photo. Husky photos in the training data happened to include snowy landscapes. Wolf photos were taken in forests. The model never actually looked at the dog. It looked at the scenery.
This illustrates a critical idea: a model can appear accurate while doing completely the wrong thing. If your test set has the same background patterns as your training set, the model scores well. Deploy it in the real world β say, at a dog shelter where all photos are taken against a plain wall β and it falls apart.
When thousands of doctors use a model to triage patients, or when thousands of parole decisions are influenced by a risk score, an overfitted model isn't just wrong on a test set. It's wrong in ways no one anticipated, on patients and people who had no say in how the training data was collected.
The standard defense against overfitting is called a validation set β a portion of data held back from training, used only to test whether the model generalizes. If your model scores 97% on training data but only 68% on the validation set, that gap is a red flag for overfitting.
But this only works if the validation set is genuinely different from the training data. In the melanoma case, both the training and validation images came from the same clinical datasets β which all had the same ruler artifact. The model scored well on both because it was using the same spurious shortcut in both cases.
The deeper problem: you can't test for failure modes you haven't thought to look for. Nobody thought to check for rulers. They were checking for cancer β and the model was getting the right answers, by the wrong method, in a way no one had a reason to inspect.
For older readers: this is why AI safety researchers talk about "evaluation goodhart" β once a metric becomes the target, models optimize for the metric rather than the underlying goal. A model trained to maximize accuracy on a test set will find whatever shortcut produces accuracy, whether or not that shortcut reflects genuine understanding.
If an overfitted model is 95% accurate β genuinely better than the average human for most cases β is it acceptable to deploy it in hospitals, even if its accuracy relies partly on shortcuts that could fail unpredictably? Who bears responsibility when it does fail?
When you see a headline claiming "AI outperforms doctors," you now know the question to ask first: where did they test it, and was that test environment the same as the training environment? Accuracy on a test set is not the same as accuracy in the real world. That distinction is something most people reading that headline will never think to ask.
A startup has built a model to detect fraudulent bank transactions. It was trained on two years of transaction data from a single major bank. It scores 96% on the held-out test set from that bank. They want to license it to 50 other banks and claim it's ready.
Your lab partner will challenge your thinking. You need to make a case β with specific reasoning β about whether this model is ready to deploy.
By late March 2020, hospitals across Europe and North America were being overwhelmed by COVID-19 patients. Administrators and doctors turned to AI triage tools β systems trained over years on millions of patient records β to help prioritize care and predict which patients would deteriorate fastest.
Many of these tools had passed extensive testing before the pandemic. They had good accuracy on historical data. And then they were asked to assess patients with a disease that did not exist anywhere in their training data.
A review published in the journal Nature Machine Intelligence in April 2021, led by researchers including Michael Roberts at Cambridge and Derek Driggs at Cambridge, evaluated 232 published AI models for diagnosing or predicting outcomes in COVID-19. The verdict was devastating: nearly all of them were found to be "poorly reported" or "potentially biased" in ways that made them clinically unusable. Some models had been trained on datasets where all the COVID-positive patients had been scanned in one position and all the COVID-negative patients in another β and the model had learned the position, not the disease.
Others had trained on data from early pandemic datasets that were too small, too skewed, or sourced from single hospitals where local practices created artificial patterns. When deployed at different hospitals in different countries, with different patient populations and different imaging equipment, the models failed quietly β often without alerting anyone that they were operating outside the conditions they were designed for.
Every model is trained on a distribution β a collection of data that represents some slice of the world at some moment in time. When a model is deployed, it encounters new data. If that new data looks similar to the training distribution, the model tends to perform well. But if the new data is systematically different β because time has passed, circumstances have changed, or the model is being used somewhere new β performance can collapse without warning.
This is called distributional shift (or distribution shift). The world that the model encounters no longer matches the world it was trained on.
For younger readers: imagine you've practiced basketball only on an indoor court. You've trained your muscle memory for the lighting, the echo, the feel of that specific floor. Now someone asks you to play outdoors β different light, different surface, wind affecting the ball. Your skills don't disappear, but your very specific learned responses stop working as well. The environment shifted. You didn't.
Distributional shift isn't only a healthcare problem. In the years before 2008, banks and hedge funds deployed sophisticated mathematical models to price complex financial instruments β particularly mortgage-backed securities. These models had been trained on mortgage data going back decades. They had been validated. They had been tested. They performed extremely well on historical data.
There was one problem: all that historical data came from a period when U.S. housing prices had never fallen significantly at a national level. The models had never encountered a scenario where they did. When housing prices began falling across the entire country simultaneously in 2007 and 2008, the models weren't just wrong β they were confidently wrong. They kept outputting low-risk assessments on instruments that were catastrophically failing. The firms trading on those assessments lost billions.
The 2008 financial crisis had many causes. But the systematic failure of quantitative models trained on non-representative historical data β and deployed without adequate monitoring for distributional shift β was among them. Economists including Nassim Nicholas Taleb had warned about this exact vulnerability for years before 2008, calling the excluded scenario a "Black Swan" β an event outside the historical distribution that models had no way to anticipate.
For older readers, this has direct policy implications: regulators now require financial institutions to perform "stress tests" β deliberately running models through scenarios outside their training distribution to see what happens. The COVID AI review had a similar implication: models need monitoring systems that detect when the incoming data is drifting away from training conditions.
After 2008, financial regulators mandated stress testing precisely to address distributional shift. The same logic is now being applied to medical AI β models should come with documentation of their training distribution, and deployment systems should flag when incoming data looks significantly different from what the model was trained on.
There's a simple defense against a tool that fails loudly: it tells you it's broken. A thermometer that displays an error code is much safer than one that gives you a reading of 98.6Β°F when you have a 104Β°F fever and no error code in sight.
Most deployed AI systems don't fail loudly. They produce an output β a score, a recommendation, a classification β regardless of whether the input data is within the model's reliable operating range. The COVID triage tools didn't refuse to give predictions when patients had a new disease. They gave predictions. The predictions happened to be based on patterns that didn't apply.
This is why knowing your model's distribution β understanding exactly what kind of data it was trained on and where its boundaries are β isn't optional. It's a safety requirement. And right now, most deployed models don't come with this documentation.
During COVID, hospitals were overwhelmed and doctors were desperate for any tool that could help triage patients. If an AI triage system was operating outside its training distribution and no one knew it β but doctors believed it was helping β should it have been deployed at all? What would you need to know to answer that question honestly?
Whenever you see a claim that an AI system "works" β now you'll automatically ask: works on what data, in what conditions, from what time period? A model that worked perfectly last year might be failing today because the world changed. That question β "has the world drifted away from the training data?" β is one most headlines never bother to ask. You will.
A city's social services department uses an AI model to predict which families are most at risk of housing instability and should be prioritized for outreach. The model was trained on 2015β2019 data. It's now 2021, after two years of pandemic-era economic disruption, eviction moratoriums, and changed assistance programs.
City leadership wants to keep using the model because it's "already built and tested." You've been asked whether it should continue operating as-is, be monitored carefully, or be suspended pending retraining.
In August 2017, a team of researchers from the University of Washington, Carnegie Mellon, and UC Berkeley published a paper with a striking demonstration. They had taken a standard stop sign β the kind at any intersection β and added small black-and-white stickers to it. The stickers looked, to a human, like random graffiti or tape residue. Nothing concerning.
But the stickers had been calculated with precision. Using knowledge of how neural networks process images, the team had crafted the sticker pattern so that it would push the network's internal calculations in a specific direction. The sign would be classified not as a stop sign β but as a 45 mph speed limit sign. In every test. From multiple viewing angles. At different distances. 100% of the time.
This wasn't a magic trick. It was a demonstration of a category of attack that researchers had first identified in 2013 when scientists at Google and New York University β including Ian Goodfellow β showed that you could add carefully calculated noise to any image, invisible to human eyes, and cause a neural network to confidently misclassify it. Goodfellow called these "adversarial examples."
The 2017 stop sign paper was alarming for a specific reason: the attack worked in the physical world. Not just on digital images fed directly into a computer β but on a real sign, photographed by a real camera, processed by a real model. The researchers titled their paper: "Robust Physical-World Attacks on Deep Neural Networks." No one who read it thought self-driving cars were quite as safe as they'd seemed the week before.
To understand why this is possible, you need to remember how neural networks make decisions. A network classifying an image isn't looking at the image the way you do β seeing shapes, objects, context. It's applying millions of learned numerical weights to the pixels in the image, combining them through layers of calculations, and producing a confidence score for each possible label.
An adversarial attack works by asking: which small changes to the pixel values would most push the model's calculation toward a wrong answer? If you know the model's weights and architecture, you can calculate this precisely using a technique called gradient descent β the same technique used to train the model in the first place, but run in reverse to find inputs that break it.
The result is a perturbation β a tiny modification to the image β that is mathematically optimized to deceive the specific model. To human eyes, a photo of a panda with an adversarial perturbation added still looks exactly like a panda. To the model: 99.3% confident it's a gibbon.
The stop sign demonstration was alarming because autonomous vehicles were already on public roads. But adversarial attacks extend well beyond self-driving cars:
Face recognition systems. In 2019, researchers at Carnegie Mellon showed that wearing specially printed glasses could cause commercial face recognition systems to consistently misidentify the wearer as a different person β or fail to detect a face at all. The glasses weren't hiding the face. They were mathematically confusing the model about what face they were seeing.
Malware detection. Security researchers have demonstrated that malicious code can be modified β without changing its function β in ways that cause antivirus AI systems to classify it as safe. The code still does the harmful thing. The model just can't see it anymore.
Medical imaging. A 2019 paper in Nature showed that adversarial attacks on radiology AI could cause a model to miss a tumor in an image or add a non-existent one β again using pixel-level modifications invisible to a radiologist reviewing the same image.
Each of these isn't just a research curiosity. Each represents a class of attacker β someone who knows how a model works and uses that knowledge to make it fail exactly when and how they want.
Adversarial attacks change the security model for any system that relies on AI. Traditional computer security asks: can someone break into the system? Adversarial security asks: can someone, from the outside, without breaking into anything, cause the AI to reach any conclusion they want? That's a fundamentally different threat β and most deployed systems have no defense against it.
The direct defense against adversarial attacks is called adversarial training: you generate adversarial examples during training and teach the model to classify them correctly too. This helps β but researchers have consistently shown that models hardened against known attacks remain vulnerable to new attacks that approach from different directions. It becomes an arms race: fix the known vulnerability, someone finds a new one.
There's also a deeper reason this is hard: adversarial examples exist because the model is not perceiving the world the way humans do. It found patterns that work statistically but aren't robust in the way human perception is. No amount of training data fully closes that gap, because the gap is structural β the model is doing something fundamentally different from seeing.
For older readers, this raises a policy question that real institutions are grappling with right now: should AI systems in high-stakes domains (traffic, medicine, security) be required to demonstrate adversarial robustness before deployment? Currently, most are not. The EU's AI Act, passed in 2024, begins to address this β requiring risk documentation for high-stakes AI systems β but robustness requirements are still evolving.
If an adversarial attack on a self-driving car's perception system causes an accident β and the attacker knew this was possible β who bears legal and moral responsibility? The attacker? The company that deployed a vulnerable model? The regulators who didn't require adversarial testing? All three?
Most people think AI security means protecting the server from hackers. You now understand a completely different threat: someone who never touches the server but changes a sticker on a sign β and the car does whatever they want. That reframes what "AI safety" means from the ground up. When you hear about self-driving cars, face recognition at airports, or AI in medical devices, you're now equipped to ask: what happens when someone who knows how this model works tries to break it?
A hospital has just deployed an AI system that reads CT scans to detect internal bleeding. The system is 93% accurate on the validation set. Hospital administration is proud of it. Your job is to "red team" it β to think like an attacker and find scenarios where it fails before those failures cause harm.
Your lab partner is a fellow red-teamer who will push you to be more specific, more creative, and more rigorous. You need to make the case for at least three distinct failure scenarios β not just "it could be wrong sometimes."