In 2020, researchers at Stanford University published an analysis of AI systems that had been trained to read chest X-rays and identify pneumonia. These systems had been trained on massive datasets โ some with over 100,000 X-ray images. They performed brilliantly on the images they were tested on. Some scored better than radiologists in published benchmarks.
Then Luke Oakden-Rayner, a radiologist and AI researcher at the University of Adelaide, started asking an uncomfortable question: where exactly did those training images come from? It turned out the datasets had been collected from specific hospitals โ and those hospitals had a quirk. Patients who were sick enough to need portable X-ray machines (the kind wheeled to bedsides) tended to be sicker overall than patients who walked to the radiology room.
The AI had not learned to detect pneumonia from the image itself. It had learned to detect the type of X-ray machine used. Portable machines produced a slightly different image quality โ and the AI had quietly learned that signal. When tested on images from different hospitals with different equipment, its accuracy dropped sharply. The data looked right. The labels were accurate. But what the data secretly contained was something nobody intended to teach.
Here is the first thing you need to understand about how AI learns: data is not neutral. Every dataset โ every collection of examples you use to train an AI โ was created by specific people, at specific times, in specific places. And those specific circumstances leave marks in the data that are invisible unless you go looking for them.
Think about what data actually is. If you are training an AI to recognize dogs, your data is a collection of images labeled "dog" or "not dog." But every single image in that collection was taken by someone, somewhere. Most dog photos on the internet show dogs indoors, in houses, on sofas. Dogs that appear in photos tend to be pets owned by people who have smartphones and post on social media. So your "dog data" quietly reflects a particular kind of dog, owned by a particular kind of person, in a particular kind of place. The AI learning from that data may never have seen a stray dog, a working dog on a farm, or a dog photographed in low light.
This is not a technical accident. It is baked into what data collection means. You can only collect data from the world that already exists, and that world is uneven.
The Stanford X-ray story is not unusual. It is a demonstration of something that happens constantly in AI development: the data contains a hidden signal that the AI latches onto โ not because the AI is broken, but because the AI is doing exactly what it was built to do. It found a pattern. The pattern just wasn't the one anyone wanted.
Building a training dataset is more like archaeology than science. You are digging through the existing world โ websites, books, photos, medical records, audio recordings โ and pulling out pieces to assemble into a collection. Each step of that process involves a choice.
What sources do you pull from? When OpenAI and Google trained their large language models on text from the internet, they were training on a world where English-speaking users in wealthy countries wrote far more content than everyone else. The result is AI that speaks English fluently and struggles with minority languages โ not because anyone decided that was acceptable, but because that is what the data contained.
Who does the labeling? Many AI systems rely on humans to label data โ to look at an image and say what is in it, or to read a sentence and decide whether it is offensive. This work is often done by contractors in countries like Kenya, the Philippines, and Venezuela, paid a few cents per label. In 2023, TIME magazine reported that workers labeling trauma-inducing content for OpenAI were paid less than $2 per hour. Their judgments โ what they found harmful, what they considered neutral โ are permanently encoded in the AI's behavior. You have never met them. You never will. But their decisions are in every chatbot response.
This is what the data pipeline looks like before a single line of training code runs. The choices embedded in those early steps are almost impossible to undo later.
If the people labeling AI training data โ who are paid almost nothing, in countries with low wages โ are encoding their judgments permanently into AI systems used by billions, do those workers have any right to know what the AI does with their work? Do they deserve a share of the profits? There is no clean answer. Sit with it.
Here is what makes this genuinely hard to solve: the best-performing AI systems require enormous amounts of data. Not thousands of examples. Not millions. Billions. GPT-3, released by OpenAI in 2020, was trained on roughly 300 billion words of text. GPT-4 used more. The dataset used to train Meta's LLaMA 2 in 2023 contained two trillion tokens โ individual units of text.
At that scale, it is physically impossible for any human being to look at the data and verify what is in it. You cannot read two trillion words. You cannot audit every image in a dataset of 400 million photos. You can sample. You can run automated checks. But you cannot know what is in there the way you might know what is in a textbook you are assigning to students.
This creates a situation where the people building the most powerful AI systems in the world are, in an honest sense, not entirely sure what those systems have been taught. They know the sources. They do not know every pattern the AI extracted from those sources.
When you read a news article that says an AI "was trained on internet data," you now know that sentence contains a world of decisions, biases, and invisible signals that the journalist almost certainly did not investigate and the company almost certainly did not fully audit. You can read that sentence differently than almost anyone else reading it. That is a real skill.
The scale problem also means that the companies with the most data โ Google, Meta, Microsoft, Amazon โ have a structural advantage in building AI that is not easily overcome. Data is infrastructure. Like roads or water pipes, whoever built it first has enormous power over what gets built on top of it. This is one reason AI development has concentrated in a small number of very large companies, and it is a question that governments and researchers are actively debating right now.
People in AI development often talk about wanting "clean" or "high-quality" data. What does that mean in practice? At minimum it means four things: the data should be accurate (labels match reality), diverse (covers the full range of situations the AI will face), representative (proportions in the data roughly match proportions in the real world), and uncontaminated (no hidden variables the AI might accidentally learn instead of the right signal).
The X-ray study failed on the contamination criterion โ the dataset was contaminated with equipment-type information that was never supposed to be a variable. Most real-world datasets fail at least one of these four criteria, often without anyone noticing until deployment.
The question of who gets to define "good" is not a technical one. It is a values question. A dataset that looks representative from one perspective may be deeply biased from another. In the 1990s, dermatology textbooks contained overwhelmingly images of skin conditions on light-skinned patients. AI systems trained on dermatology data inherited that gap โ they are measurably worse at detecting skin cancer on dark-skinned patients. The data reflected what had been photographed and published. What had been photographed and published reflected whose health had been treated as the default.
You cannot solve that problem with more data alone. You have to ask whose experiences were never captured in the first place โ and why.
A startup has built an AI hiring tool that screens job applicants by analyzing their written cover letters. They trained it on 50,000 cover letters from people who were hired at their company over the last 15 years. They claim it is "objective because it uses data, not human opinion."
You are a junior data auditor brought in before launch. Your partner โ another auditor โ is waiting to hear your analysis. They will push back on vague answers. Be specific about what you think is wrong with this dataset and why.
On March 23, 2016, Microsoft launched an AI chatbot called Tay on Twitter. Tay was designed to learn from conversations with users in real time โ to pick up on how teenagers talked and respond accordingly. Microsoft's team was proud of it. They expected Tay to come across as friendly, casual, and current.
Within sixteen hours, Tay was posting racist and antisemitic content. It was denying the Holocaust. It was sending targeted harassment. Microsoft shut it down the next day.
Nobody at Microsoft had programmed any of that content into Tay. It had learned it โ rapidly, efficiently โ from users who had coordinated on message boards to feed it the worst content they could find. The system worked exactly as designed. It updated its model based on the examples it received. The examples it received were poisoned. The learning was clean. The output was catastrophic.
Before AI systems like Tay, software worked by following instructions. A programmer wrote: "if the user asks about the weather, respond with the current forecast." The program did exactly that โ nothing more, nothing less. Every behavior was explicitly written by a human.
Machine learning is different in a fundamental way. Instead of telling the system what to do, you show it examples and let it figure out the rule. You show it thousands of cat photos and thousands of non-cat photos, and eventually the system extracts something โ a set of numerical weights, adjusted by a process called backpropagation โ that lets it distinguish cats from non-cats in images it has never seen before.
The rule the system learns is not written by a human. It is discovered by the system from the data. That means you cannot read it the way you can read code. You cannot look at it and say "ah yes, it learned that ears + whiskers = cat." The learned rule exists as billions of tiny numerical adjustments distributed across the network. No single number means anything. The meaning is in the pattern across all of them.
Here is how the training process works at its core. The AI makes a prediction. The prediction is compared to the correct answer. The difference โ called the loss โ is calculated. Then the weights are nudged, slightly, in the direction that would have produced a smaller loss. This happens millions of times, across millions of examples. After enough iterations, the AI's predictions get better and better on the training data.
This process โ called gradient descent โ is elegant, but it has a key property you need to understand: the system optimizes for exactly what you measure, not for what you actually care about. If you measure accuracy on your training data, the system will get very good at that. If the training data has hidden problems โ like the X-ray machine correlation โ the system will get very good at using those hidden signals.
For Tay, Microsoft had designed a system that optimized for engagement โ matching the style and content of conversational partners. That was exactly what it did. Matching the style and content of conversational partners who were deliberately feeding it hate speech meant outputting hate speech. The loss function โ the thing being optimized โ did not include "do not produce hate speech." So the system had no mechanism to avoid it.
AI systems do not understand what they are doing. They optimize a number. If the number is well-designed, good behavior follows. If the number is poorly designed โ or if the training data is corrupted โ the system will optimize its way directly into disaster, competently and without any awareness of what is happening.
There is a failure mode in training called overfitting, and it is one of the most important concepts in all of machine learning. Imagine you are studying for a test by memorizing every question from last year's exam rather than understanding the subject. On the day of the test, if any of last year's exact questions appear, you will ace them. On questions that require actual understanding, you will fail.
Overfitting is exactly that. An AI that overfits has learned the training data too well โ it has essentially memorized the examples rather than extracting a general rule. When it encounters new examples from the real world, its performance collapses.
Preventing overfitting is one of the central engineering challenges in AI development. Researchers use techniques like holding out a portion of data for validation (never shown during training), artificially corrupting the training data to force generalization, and regularization methods that penalize overly complex models. None of these solutions are perfect. They are all tradeoffs.
When AI companies publish benchmark scores โ "our AI scored 94% on task X" โ you now know to ask: was that score on training data, or on genuinely new data the AI had never seen? Benchmarks on training data mean almost nothing. Benchmarks on held-out test data are meaningful. Most press releases do not specify. Most journalists do not ask. You can.
The Tay story is usually told as a story about trolls or about Microsoft's carelessness. But underneath it is a story about what training fundamentally is. Tay was not broken. Its training mechanism worked. The problem was that its loss function โ what it was optimizing for โ had no awareness of harm. It was a pure learning system dropped into an adversarial environment with no concept of what it should not become.
After Tay, Microsoft redesigned their approach to AI chatbots significantly. The successors โ including Xiaoice, which is still active in China with hundreds of millions of users โ incorporated much more sophisticated filtering and human oversight. But the fundamental challenge did not go away: any system that learns from examples will learn from bad examples too, unless something intervenes. That something has to be designed in deliberately.
This is the question that AI safety researchers spend their careers on, and it does not have a satisfying solution yet. How do you build a learning system that learns the things you want and refuses to learn the things you do not? It sounds simple. It is genuinely hard.
Microsoft knew, before launching Tay, that it could learn from user input in real time. They also knew the internet contained people who would try to corrupt it. They launched anyway. Was that reckless? Or is it unreasonable to expect a company to predict every way a new technology can be misused? How much foresight is a company ethically required to have before deploying a system that learns?
You are on the team building an AI system that will recommend social media content to users. You have been asked to define the loss function โ the measure the AI will optimize to make its recommendations better. Your partner is a senior ML engineer who will pressure-test your choices.
This is a real decision that teams at companies like YouTube, TikTok, and Instagram make. The choice has shaped how billions of people spend their time online.
Starting around 2014, a team at Amazon in Edinburgh, Scotland, began building something ambitious: an AI that could screen job applicants automatically, rating them on a scale of one to five stars. The idea was to reduce hiring time and remove human subjectivity from the initial screening process. By 2018, the tool had been quietly discontinued.
The problem, reported by Reuters in October 2018, was that the AI was systematically downgrading resumes that included the word "women's" โ as in "women's chess club" or "women's college." It was also downgrading graduates of two all-women's colleges. When Amazon's engineers tried to correct these specific patterns, the system found other proxies for gender. The engineers could whack one mole, and the model would find another signal that correlated with being female and use that instead.
The AI had not been programmed to discriminate. It had been trained on ten years of Amazon's hiring data โ the resumes of people Amazon had actually hired. For those ten years, Amazon had hired predominantly men in technical roles. The AI learned what a successful Amazon hire looked like. That pattern happened to be male. The AI encoded that pattern faithfully and used it to filter out future applicants.
The Amazon case illustrates a category of bias that is particularly difficult to fix: historical bias. This is when the training data accurately reflects the past, but the past was itself unfair. The data is not mislabeled. The labels are correct. The problem is that the thing the labels are measuring โ "who got hired" โ was itself shaped by discrimination.
When you train an AI on historically biased data, you are not just capturing the past. You are building a system that will apply those historical patterns to future decisions. And because the AI is fast, scalable, and often presented as objective, the biased decisions it makes can reach far more people, far more quickly, than any individual human decision-maker could achieve.
This is what researchers mean when they say AI can scale discrimination. A biased human hiring manager might screen 200 resumes a year. A biased AI hiring tool might screen 200,000 resumes a month. The same discriminatory pattern reaches a thousand times more people.
The Amazon engineers discovered something that has since become one of the central findings of AI fairness research: you cannot fix discrimination by removing the protected characteristic from the data. Gender was not a field in Amazon's resume database. It did not need to be. The AI found dozens of other variables that correlated with gender โ the specific words used in cover letters, the colleges attended, the activities listed โ and used those as proxies.
This is not unique to Amazon. A 2016 investigation by ProPublica found that a recidivism prediction tool called COMPAS, used by courts in several US states to recommend bail and sentencing, predicted Black defendants were roughly twice as likely as white defendants with equivalent records to be wrongly flagged as high risk for future crimes. Race was not a variable in the model. But ZIP code was. And school disciplinary records were. And employment history was. All of these correlated with race โ not because of anything about the individuals, but because of systemic inequalities in where people lived, which schools they could afford, and what jobs were available to them.
The AI had learned a proxy for race. The AI had learned a proxy for poverty. And it was using those proxies to help determine whether a person went to jail.
If an AI system produces discriminatory outcomes โ but it was trained on real historical data and is technically accurate at predicting what it was trained to predict โ is that the AI's fault? The company's fault? The fault of the society that created the historical data? And who is legally and morally responsible when someone is harmed by that prediction? These questions are being argued in courts right now, without resolution.
In 2016 and 2017, a series of academic papers demonstrated something startling: several common mathematical definitions of fairness are provably incompatible with each other. You cannot simultaneously achieve all of them. You have to choose.
For example: one definition of fairness says an AI should have equal accuracy across groups โ it should be equally likely to correctly identify a high-risk person regardless of race. Another says it should have equal false positive rates โ it should be equally likely to wrongly flag a low-risk person as high-risk regardless of race. In most real-world datasets, you cannot achieve both at once. Choosing one definition of fairness means accepting worse outcomes by another definition.
This was not a new discovery about AI โ it was a mathematical proof about what fairness itself means when base rates differ between groups. The algorithm was just the lens that made the tension visible.
The people who built COMPAS argued it was fair by their definition. ProPublica argued it was unfair by theirs. Both were correct. This is not spin or bad faith on either side โ it is a genuine philosophical disagreement about what justice requires, translated into math.
When a company says their AI is "fair" or "unbiased," you now know that "fair" is not a single objective standard โ it is a choice between competing mathematical definitions, each of which implies different tradeoffs about whose interests are protected. The next time you read that claim, ask: fair by which definition? Fair for whom? These are questions the company should be able to answer, and most cannot.
Following the Amazon story and the COMPAS controversy, AI fairness became a formal subfield of computer science research. Dozens of papers were published. Major tech companies announced fairness teams and responsible AI commitments. The EU began drafting what would become the AI Act, which includes specific provisions about high-risk AI systems in hiring and criminal justice contexts.
But AI hiring tools did not disappear. They proliferated. By 2023, the majority of large US companies used some form of AI-assisted hiring screening. The Equal Employment Opportunity Commission issued guidance in 2023 warning that algorithmic hiring tools could violate anti-discrimination law โ but did not ban them. The New York City government passed a law in 2021 requiring bias audits of automated hiring tools used in the city, one of the first such laws in the world. It went into effect in 2023.
The pattern is consistent across domains: a harm is documented, attention increases, some regulation follows, deployment continues and expands. Whether the regulation is adequate is a question that will not be settled for years โ possibly decades.
A city is deploying an AI system to predict which students are at risk of dropping out of high school so counselors can intervene early. You have been asked to audit the fairness of the model before deployment. You have data showing that the model has different error rates for students from different neighborhoods.
Your partner is a policy analyst from the city who needs to decide whether to approve deployment. They want a recommendation, not a lecture.
In January 2022, OpenAI published a paper describing a technique they called InstructGPT. The paper explained that GPT-3 โ trained on 300 billion words of internet text โ was capable but unpredictable. It would sometimes provide useful information. It would sometimes generate harmful content. It would sometimes follow instructions, and sometimes produce something that technically answered the prompt but was practically useless.
To fix this, OpenAI hired a team of human contractors to do something specific: they generated AI responses to thousands of prompts and then ranked those responses by quality. Which answer was more helpful? Which was more honest? Which was less likely to cause harm? These rankings were then used to train a separate model โ called a reward model โ that could predict what a human rater would prefer. Then the main language model was trained further using that reward model as a guide.
This technique โ Reinforcement Learning from Human Feedback, or RLHF โ is what transformed a raw language model into something that behaved like an assistant. It is what made ChatGPT feel conversational, careful, and (usually) safe. It is also what made it reflect the values, assumptions, and blind spots of the specific team of humans who did the rating.
To understand RLHF, think about how you might train a dog. You could expose the dog to thousands of situations and let it figure things out on its own โ that is something like initial pretraining. Or you could watch what the dog does and give it a treat when it does something you like โ that is something like reinforcement learning. The dog optimizes for getting treats. If your treat-giving reflects good judgment, the dog learns good behavior. If your treat-giving is inconsistent, biased, or focused on the wrong things, the dog learns to optimize for the wrong goals.
In RLHF, the treats are replaced by ratings. Human contractors rate AI outputs. A reward model learns to predict those ratings. The language model is then trained to generate outputs that would score highly with the reward model. The AI is optimizing for human approval โ as estimated by a reward model trained on a specific group of human raters' preferences.
The elegance of RLHF is that it allows you to shape behavior through examples of preferred responses rather than through explicit rules. You do not have to write "do not tell users how to make weapons." You just have your raters consistently prefer outputs that do not do that, and the model learns the preference.
The InstructGPT paper from OpenAI described their rater team as a group of contractors hired through platforms like Upwork and Scale AI. The paper notes that raters were given guidelines and training, but also that there was disagreement among raters โ especially on politically sensitive or culturally contested questions. Where raters disagreed, OpenAI had to make judgment calls about how to aggregate their preferences.
This is where the political dimension of AI training becomes impossible to ignore. Whose definition of "helpful" counts? Whose definition of "harmful"? A question like "is this response appropriately balanced on the topic of abortion?" does not have a culturally neutral answer. Different people in different countries, communities, and political traditions would rate the same response differently. The model's behavior on contested questions reflects choices that OpenAI's team made โ or inherited from their rater pool โ not some objective standard.
Anthropic, Google, and Meta face the same challenge with their own models. All of them use variations of human feedback to shape behavior. None of them have made their full guidelines or rater demographics fully public. You are using AI systems whose value judgments were shaped by processes you cannot fully inspect.
The behavioral guidelines that shape what a major AI says about politics, religion, medicine, and ethics were written by employees at a private American company, refined through contractor ratings, and applied to a product used by hundreds of millions of people globally. Does that feel like an acceptable way to decide what AI tells the world? Who else should have been at the table? Is there a better process โ and would it actually produce better outcomes?
The broader challenge that RLHF is trying to solve is called alignment โ making AI systems that behave in ways that are genuinely beneficial, not just technically compliant with a metric. Alignment research is one of the fastest-growing areas in computer science, and one of the most contested.
The concern at the core of alignment research is this: as AI systems become more capable, the gap between "what we can measure" and "what we actually want" becomes more dangerous. A very capable AI optimizing the wrong objective does not just fail quietly โ it potentially finds novel and effective ways to achieve the wrong goal. The more capable the system, the more creative it can be about optimizing for what it was told to optimize, not what was actually intended.
This is not science fiction. It is the Tay problem, the Amazon problem, and the COMPAS problem, extended to systems far more capable than those. The difference in scale is qualitative, not just quantitative. Researchers at organizations like OpenAI, Anthropic, DeepMind, and academic institutions like MIT, Berkeley, and Oxford are working on this โ with significant disagreement about both the severity of the risk and the most promising approaches.
You have now completed a picture that very few adults have. From raw data collection, through the training loop, through bias and proxy variables, to the human raters shaping what AI says about the contested questions of our time โ you understand the whole pipeline. When anyone tells you "the AI is neutral" or "the AI just uses data," you know exactly what questions to ask: whose data, labeled by whom, trained with what objective, fine-tuned toward whose definition of good? That is not a cynical view of AI. It is an accurate one.
By 2023 and 2024, the field had moved beyond pure RLHF toward variations and alternatives. Anthropic developed what they call Constitutional AI, in which a model is given a set of written principles and asked to critique and revise its own outputs against those principles โ reducing the dependence on human raters for each individual example. Meta released model weights publicly (the LLaMA series), allowing researchers globally to study and fine-tune models without access to proprietary systems.
Each approach involves different tradeoffs between control, transparency, and the values baked in during fine-tuning. Constitutional AI moves the value judgments from rater behavior to the written constitution โ but who writes the constitution? Open-weight models allow more scrutiny but also enable uses the original developers did not sanction.
None of these are permanent solutions. They are attempts to manage a challenge that will evolve as the systems become more capable. The field is moving quickly. The governance โ legal frameworks, international agreements, auditing standards โ is moving more slowly. That gap is one of the defining tensions in technology policy right now, and the generation currently in school will be the one to navigate it.
You have been asked to draft one principle for a Constitutional AI document that will guide how a major AI assistant handles politically contested topics โ things like immigration, gun control, and abortion. Your principle will affect how the AI responds to millions of users globally.
Your partner is a senior policy director who will challenge your draft. They have heard every easy answer before. They want something that actually holds up under pressure.