In 2013, a team of researchers at Memorial Sloan Kettering Cancer Center in New York partnered with IBM to build an AI system called Watson for Oncology. The goal was serious: give oncologists โ cancer doctors โ a tool that could read medical literature and patient records and recommend chemotherapy treatments.
When Watson was tested on the cases it had been trained on, it performed remarkably. It matched what expert doctors recommended at very high rates. The headlines wrote themselves: AI beats doctors at cancer treatment planning.
But something quietly wrong was building underneath those numbers. Watson had been trained almost entirely on patient cases from Memorial Sloan Kettering โ one of the most elite, specialized cancer hospitals in the world. When other hospitals tried to use it on their own patients โ patients who didn't look quite like the original training examples โ the recommendations started going sideways. A 2018 internal report from MD Anderson Cancer Center described Watson suggesting treatments that were, in some cases, medically unsafe.
Watson hadn't learned oncology. It had memorized one particular hospital's approach to oncology. When the world didn't match the training data, the model broke.
What happened to Watson has a name in machine learning: overfitting. It's one of the most important concepts you'll ever encounter when thinking about AI, and it's surprisingly easy to understand once you see it.
Imagine you're studying for a history test by memorizing every single answer from last year's practice exam โ not understanding why events happened, just memorizing the exact questions and exact answers. You'd ace that practice test. But when the real test showed up with slightly different questions about the same events, you'd struggle. You learned the answers, not the subject.
A machine learning model overfits when it learns the training data too specifically. Instead of finding the general pattern โ the real underlying rule โ it finds every tiny quirk, every coincidence, every specific detail of the examples it was shown. It gets excellent at reproducing those exact examples. But the real world is full of new cases that weren't in the training data.
There's a technical way to spot overfitting: the gap between training accuracy (how well the model does on data it's already seen) and test accuracy (how well it does on fresh examples it's never seen). A model that scores 98% on training data but only 72% on test data is almost certainly overfitting. Watson's internal scores looked great โ because it was being evaluated mostly on data similar to what it trained on.
Here's what makes overfitting tricky to avoid: the more powerful and complex you make a model, the more capable it is of memorizing. A model with millions of parameters โ adjustable settings โ can find extremely subtle patterns. Some of those patterns are real and useful. But some are noise: coincidences that existed in the training data but don't exist in the real world.
Think about a medical study with 200 patients. Maybe in those 200 patients, everyone who got better also happened to be left-handed. That's almost certainly a coincidence โ there are simply not enough patients for this to mean anything. But an overfitting model might latch onto left-handedness as a predictor of recovery. It found a pattern. The pattern just isn't real.
The size of your training dataset matters too. With Watson, the problem was partly that the training cases all came from one hospital. That hospital treats a specific kind of patient โ often wealthier, often living in the northeastern United States, often already at an advanced stage of cancer because they came there seeking specialized care. A model trained on those patients learned that population, not cancer patients in general.
If you could only study 100 people to understand all of humanity, and those 100 people all happened to live in the same city and have the same job, what would your model of "humans" get wrong? Overfitting is that same problem โ at scale, in software.
The machine learning community has developed several tools to catch and reduce overfitting. The most important one is something called a validation split โ deliberately holding back some data so the model never sees it during training, then using that hidden data to test whether the model actually generalizes.
Imagine training a model on 80% of your data and keeping 20% locked away. After training, you run the locked-away data through the model. If performance drops dramatically, you know overfitting has occurred. This is now considered a basic standard in responsible ML development.
Another technique is regularization โ a mathematical penalty that discourages the model from getting too complicated. Regularization basically tells the model: "If you can explain this data with a simpler pattern, prefer that over a complicated one." It's like telling a student: don't write a ten-page essay if three paragraphs would make the same point just as well.
A third approach is early stopping. Training a neural network is an iterative process โ the model improves in rounds. Early stopping watches for the point where training accuracy keeps going up but validation accuracy starts to level off or decline. That's the signal: stop training now, the model is starting to memorize instead of learn.
Watson for Oncology cost hundreds of millions of dollars to develop. It was marketed to hospitals around the world, including in countries like India and South Korea where many patients purchased access to it. Those patients' care was being guided by a system that was, at some level, overfit to a wealthy American hospital in New York.
This is where the technical becomes ethical. Nobody making the original system set out to hurt anyone. They ran tests. The tests looked good. But the tests were run on the same kind of data the model was trained on โ which meant they weren't testing whether the model could handle new situations at all. They were testing whether the model had memorized well.
If a company tests a medical AI thoroughly and all their tests pass โ but those tests happen to use the same population the model was trained on โ is the company responsible for harm when the model fails elsewhere? They weren't negligent in the obvious sense. But they also chose which tests to run. At what point does oversight become a choice?
You now understand something most people reading headlines about AI don't. When an AI system is announced as achieving "state-of-the-art accuracy," the first question worth asking is: accuracy on what data? Is that data representative of the real world the system will face? Or is it the same data it trained on? Knowing the difference between those two questions is how you read AI news critically โ and it's a skill most adults haven't developed.
Every time you read that an AI scored 95% accuracy โ on medical diagnosis, on detecting fraud, on anything โ you're equipped to ask the question that separates a real result from an overfit one: was that 95% on data the model already saw, or data it hadn't? Most headlines don't tell you. Most readers don't know to ask.
You've been hired as a junior AI auditor for a hospital board that's deciding whether to purchase three different AI diagnostic systems. Each vendor has given you their accuracy numbers. Your job is to figure out which numbers are trustworthy โ and which might be hiding overfitting.
Your lab partner โ an AI systems analyst โ will challenge your reasoning. Don't just state conclusions. Defend them.
In 2014, Amazon began building an AI tool to automate the first stage of job hiring. The company received hundreds of thousands of applications every year. The goal was to build a system that could read a rรฉsumรฉ and give it a score from one to five stars โ sorting good candidates from weak ones before a human recruiter ever got involved.
The engineers trained the model on ten years of Amazon's own hiring data: rรฉsumรฉs that had been submitted and decisions that had been made about those rรฉsumรฉs. It seemed like a clean, logical starting point. Real data. Real decisions. Real outcomes.
By 2015, researchers inside Amazon started noticing something troubling. The model was systematically giving lower scores to rรฉsumรฉs that included the word "women's" โ as in "women's chess club" or "women's college." It was downgrading candidates who had attended all-women's colleges. It was, in some detectable way, penalizing applicants for being women.
Nobody programmed that in. The engineers hadn't written a rule that said "penalize women." But the training data โ ten years of Amazon hiring decisions โ reflected a tech industry that had hired far more men than women. The AI had found the pattern in the data and amplified it. It learned: people who got hired here looked like this. People who didn't get hired looked like that. It simply reproduced the biases embedded in a decade of human choices.
Amazon quietly shelved the project in 2018 after Reuters reported on it. The system was never used to make actual hiring decisions. But its existence raised a question that the entire AI field is still wrestling with today.
The word "bias" gets used in a lot of different ways. In everyday language it means having unfair opinions. In statistics it has a precise technical definition. In machine learning it can mean both โ and the distinction matters.
When we say a model has training data bias, we mean: the dataset used to train the model doesn't accurately represent the real world the model will operate in. This can happen in a few different ways.
Amazon's hiring tool suffered from historical bias. The dataset was made of real decisions โ but those real decisions reflected years of a hiring culture that had favored men. The AI didn't invent that bias. It inherited it, packaged it in math, and automated it at scale.
In 2019, a study published in the journal Nature Medicine examined dozens of AI systems trained to detect skin cancer from photographs. These systems had been celebrated for matching or exceeding dermatologist performance in clinical trials. The catches were buried in the details.
The training datasets were overwhelmingly composed of photographs of light-skinned patients. When researchers tested the models on darker skin tones, performance dropped โ sometimes dramatically. One system that achieved 91% accuracy overall performed at 65% on darker skin. The disease looks different on different skin. The model had barely seen the darker-skin version.
Nobody in those studies was trying to build a tool that worked better for white patients. But the photographs that were available โ collected over decades at hospitals serving predominantly white populations โ reflected those demographics. The data shaped what the model knew. And what the model didn't know, it couldn't detect reliably.
If a skin cancer AI is deployed to 10 million users and performs 20% worse on darker skin tones, the patients most likely to have a cancer missed are already among the most medically underserved populations. The AI doesn't create the disparity from nothing. But it can automate it and expand it.
This is harder than it sounds โ and there's genuine disagreement in the machine learning community about what "fair" even means mathematically. It turns out there are multiple different definitions of fairness that are mathematically incompatible with each other. You can satisfy one definition of fair but not another simultaneously.
One approach is data rebalancing: deliberately collecting more training examples from underrepresented groups, or artificially reweighting the existing examples so the model treats all groups as equally important during training. Amazon could have, in theory, rebalanced its hiring data to include equal numbers of successful male and female hires.
But there's a deeper problem. If the real historical data shows more men being promoted, and you rebalance to pretend it was equal โ you've changed the data to reflect a more just world that didn't actually exist. That's not capturing reality; it's manufacturing it. The model trained on that adjusted data might make different predictions. Whether those predictions are more fair or less accurate or both is genuinely contested.
If you train an AI on real historical data, it learns and perpetuates past discrimination. If you deliberately modify the data to be more equitable, you're making a value judgment about what the world should look like โ not what it did look like. Who decides which is the right approach? Should that be an engineer's decision? A government's? The people affected?
There is no neutral choice here. Choosing to use historical data uncritically is a decision. Choosing to modify it is a decision. Choosing which definition of fairness to use is a decision. The decisions just happen to be made by engineers, usually without much public input.
When someone says an AI system is objective โ that it removes human bias from a decision โ you know exactly why that claim is suspect. Every AI was trained on data produced by humans in a world shaped by human history. "Objective" doesn't mean "unbiased." It often just means "the bias is harder to trace."
A major bank wants to deploy an AI loan approval system. They say it was trained on "10 years of successful loan data" from their existing customer base. Before the regulator signs off, you โ the bias auditor โ need to ask the hard questions.
Your lab partner is a senior data scientist at the bank. They built the system and believe in it. Push them to reveal what the training data actually shows, and argue your case for what should be tested before deployment.
In 2019, the investigative outlet TIME Magazine published a report on where a significant portion of AI training data labels come from. The answer was surprising to people outside the industry: a global network of contract workers โ often in Venezuela, Kenya, the Philippines, and India โ earning wages as low as one to three dollars per hour to label training data.
These workers โ called data annotators โ looked at images, listened to audio clips, and read text. They drew boxes around pedestrians in street scenes so self-driving car systems could learn what a pedestrian looks like. They listened to recordings and transcribed speech. They read sentences and decided whether the emotion was "positive," "negative," or "neutral."
One Venezuelan annotator described labeling 200 to 400 images per hour to make her quota. She was deciding, at a rate of seconds per image, what was in each photo โ and those decisions would become the ground truth that an AI model learned from. She wasn't a domain expert. She was working on a deadline, making judgment calls, and moving on.
The AI learned what she decided was true. The model had no way to question it. Whatever she labeled "cat," the model learned was a cat. Whatever she labeled "dangerous," the model learned to call dangerous. The model's entire understanding of the world was built on millions of individual human decisions made under time pressure, for very little pay, by people who often didn't speak the language of the content they were labeling.
Supervised learning โ the most common form of machine learning โ requires labeled data. Every training example needs a tag that tells the model: this image is a cat, this review is negative, this tumor scan is malignant. The model learns by matching patterns in the input to the labels assigned to it.
The problem is that labels are not facts. They're judgments. And judgments depend on who is making them, under what circumstances, using what definitions.
Label noise is one problem. But label subjectivity is often a deeper one. Consider the task of labeling text as "toxic" or "not toxic" for a content moderation system. What counts as toxic? One annotator's "heated debate" is another's "harassment." Researchers studying the major toxicity datasets have found that annotators from different demographic backgrounds label the same text differently at measurable rates. The model trained on those labels learns the average judgment of whoever happened to be hired to label the data โ not some objective definition of harm.
In 2020, researchers at the University of Washington published a study examining whether major toxic speech datasets had racial bias in their labels. What they found was striking: text written in African American English โ a distinct dialect with its own grammar and patterns โ was labeled "toxic" at significantly higher rates than text expressing the same sentiment in standard American English.
The implication was serious. If you train a content moderation AI on those labeled datasets, the AI learns to flag African American English at higher rates. Users who write in that dialect would have their posts removed at higher rates โ not because of what they were saying, but because of how they were saying it. An AI trained to remove "toxic" content would, in practice, systematically remove the speech of a particular community.
No one who labeled that data necessarily intended this. Many were probably trying to make the judgment they were asked to make. But when you aggregate millions of those individual calls, patterns emerge that nobody explicitly chose.
A content moderation system decides what speech gets removed from a social platform used by billions of people. The training labels for that system were made by paid workers on a deadline. The workers' demographic background influences their labeling patterns. Nobody voted on what "harmful" means. Does this bother you? It should โ at least a little.
Here's what makes this consequential beyond just data quality: every labeled dataset encodes someone's definition of what's true, good, harmful, normal, or correct. That definition gets baked into the model's weights. The model then goes out into the world and applies those definitions at scale โ to billions of people who were never consulted about what the labels meant.
In 2021, a team of researchers examined the ImageNet dataset โ one of the most influential training datasets in AI history, containing 14 million labeled images. They found hundreds of labels for human beings that were derogatory, clinical in pathologizing ways, or simply demeaning. Those labels had been used to train AI systems that millions of people used. Some were eventually removed, but only after years of those definitions being baked into models already in deployment.
The label problem is ultimately a question of power: who gets to define the world that the AI learns? Right now, that power is largely held by whoever commissions the dataset and whoever is hired (often cheaply) to annotate it. That is a very small number of people making very large decisions about how machines will perceive reality โ for everyone.
If toxic speech datasets systematically label certain dialects as more harmful, and AI systems trained on those datasets remove more content from speakers of those dialects, is that a technical failure, a social failure, or both? And who is responsible for fixing it โ the annotators, the researchers who built the dataset, the companies that deployed the models, or the platforms that used the outputs?
The next time you encounter an AI system that classifies something โ whether a post is harmful, whether a person is a risk, whether an email is spam โ you can ask the question most people never think to ask: who decided what the labels meant, under what conditions, and who reviewed them? The label is not the truth. It's someone's decision about truth. Those are very different things.
A major social media company has hired you to write the labeling guidelines for their new content safety AI. Annotators across the world will use your guidelines to decide what gets flagged as "harmful." Your decisions will shape what the model considers dangerous โ for everyone on the platform.
Your lab partner is a researcher who has studied annotation bias. They're skeptical that any guidelines can avoid value judgments. Defend your approach โ or change it when they make a good point.
In the first weeks of March 2020, as COVID-19 began spreading rapidly across the United States, hospitals and health systems started deploying AI tools they had been developing for years โ tools that analyzed patient data to predict which patients were most at risk and how to allocate limited resources.
These tools had been trained on years of patient records from before the pandemic. They had learned patterns from a world where most respiratory illness came from the flu, pneumonia, and other familiar conditions. When COVID-19 patients started arriving, they didn't fit those patterns. The symptoms overlapped with what the model knew, but the underlying biology and progression were different. COVID affected blood oxygen, the heart, blood clotting โ in ways the model had never been trained to associate with respiratory illness.
A sweeping 2021 review published in The BMJ โ one of the world's oldest and most respected medical journals โ analyzed 232 separate AI tools developed for COVID-19 detection and triage. The conclusion was damning: "We identified major methodological flaws and high risks of bias in most of the models." Among the most cited problems was that tools trained before or during the early pandemic performed poorly as the pandemic evolved. The virus changed. New variants arrived. The population affected changed. The interventions available changed. The models had been trained on a world that no longer existed.
Some hospitals quietly stopped using their AI triage tools. Others kept using them, unaware of how much drift had accumulated between the world the model learned from and the world it was operating in.
The technical term for what happened during COVID is distribution shift โ or sometimes called data drift or covariate shift. It describes what happens when the statistical patterns in the real world change after a model has been trained, but the model itself doesn't change.
Think of it this way: you learn to recognize cats by looking at thousands of photos of domestic cats โ mostly sitting, mostly indoors, mostly in well-lit rooms. Then someone asks you to identify cats in dark outdoor environments with unusual angles. Your recognition skills might fail โ not because you're bad at recognizing cats, but because the conditions don't match what you learned from.
A model doesn't know when it's operating outside its training distribution. It doesn't raise a flag and say "warning โ conditions have changed." It keeps generating outputs with the same confidence as always. The predictions get worse; the confidence stays the same.
COVID is an extreme case because the shift happened rapidly and visibly. But distribution shift is actually a constant, quiet problem in deployed AI systems โ not a rare event.
In 2016, a widely cited paper examined the behavior of fraud detection systems at financial institutions. These systems are trained to flag unusual patterns that suggest a transaction is fraudulent. But consumer behavior changes โ new apps become popular, people travel differently, shopping patterns shift with the economy. A system trained in 2016 on typical transaction patterns would, without updates, start mislabeling legitimate transactions as fraudulent โ and approving increasingly sophisticated fraud that uses patterns not in the original training data.
Credit scoring models face the same issue. A model trained to predict loan repayment behavior in 2018 used patterns from an economy that looked very different from 2020's pandemic economy or 2022's high-inflation economy. People's financial behavior changed. The model's definition of "risky borrower" didn't.
Language models trained on internet text from a specific period can become subtly outdated as language evolves โ slang changes, cultural references shift, events alter the meaning of words. A model trained before 2020 on the word "mask" primarily associated it with Halloween and surgery. By 2021, it had become one of the most politically loaded words in American discourse.
Most large-scale AI systems are not retrained constantly. Retraining is expensive, technically complex, and requires new labeled data. The practical result is that many real-world AI systems are running on training data that is months or years old. Every day they operate, the gap between their learned world and the real world widens. Nobody sends you a notification saying "this model is now 18 months out of date."
Responsible ML engineering includes monitoring deployed models for signs that their inputs or outputs have started to drift from the training distribution. This is sometimes called model monitoring or drift detection.
One approach is to track the statistical properties of the inputs the model receives over time. If the average value of certain features starts to move significantly from where they were during training, that's a signal worth investigating. Another approach is to watch for drops in the model's confidence scores โ when a model that used to predict with 90% confidence is now at 70%, something in the world has changed.
Some high-stakes deployments include automatic retraining pipelines โ systems that gather new labeled data continuously and periodically update the model to reflect the current world. But this requires ongoing human oversight to make sure the new labels are correct, and it introduces its own risks (what if the new data contains new biases?).
At the institutional level โ the level of hospitals, banks, governments, and courts โ the question of how often to retrain models and who is responsible for monitoring drift is a genuine policy question that is currently being worked out in real time. Many regulatory frameworks for AI in healthcare and finance now explicitly require documentation of when a model was trained and regular performance audits.
A hospital deployed an AI diagnostic tool in 2019. By 2022, the medical literature had significantly advanced, new treatment protocols existed, and disease prevalence patterns had shifted. The model was never retrained. A patient receives a suboptimal diagnosis as a result. Who is responsible โ the hospital that failed to update the model, the company that sold the tool without a clear update policy, the regulators who didn't require one, or the physician who trusted the output without questioning it?
This question is actively being debated in medical AI policy today. The FDA's framework for regulating AI-based medical devices was designed primarily for static software โ software that doesn't change. An AI model that retrains itself is harder to regulate using the same tools. The institutions that govern medicine are still catching up to the pace of AI deployment.
When you encounter an AI tool making decisions about you โ a credit score, a medical screening tool, a content recommendation system โ you can ask: when was this trained, and on what era's data? A model making decisions about your world in 2025 that was trained on 2020 data is not just a technical footnote. It's a meaningful fact about the quality of that decision. Very few people know to ask this. You do now.
A federal agency uses an AI system to assess benefit eligibility โ deciding who qualifies for housing assistance. The model was trained in 2019 on pre-pandemic economic data. It's now three years later. Economic conditions, housing markets, and the demographics of people seeking assistance have all shifted significantly.
Your lab partner is the agency's AI policy director, who is under pressure to keep the system running because replacing it would take 18 months and cost millions. You need to make the case โ or counter it โ based on what you know about distribution shift.