Module 2 · Lesson 1

What Data Actually Is

Before any AI can learn anything, someone has to decide what to feed it — and that decision shapes everything.

Why did an AI trained to detect cancer miss it in thousands of patients — and how did the data cause that?

In 2020, researchers at Stanford University published an analysis of AI systems that had been trained to read chest X-rays and identify pneumonia. These systems had been trained on massive datasets — some with over 100,000 X-ray images. They performed brilliantly on the images they were tested on. Some scored better than radiologists in published benchmarks.

Then Luke Oakden-Rayner, a radiologist and AI researcher at the University of Adelaide, started asking an uncomfortable question: where exactly did those training images come from? It turned out the datasets had been collected from specific hospitals — and those hospitals had a quirk. Patients who were sick enough to need portable X-ray machines (the kind wheeled to bedsides) tended to be sicker overall than patients who walked to the radiology room.

The AI had not learned to detect pneumonia from the image itself. It had learned to detect the type of X-ray machine used. Portable machines produced a slightly different image quality — and the AI had quietly learned that signal. When tested on images from different hospitals with different equipment, its accuracy dropped sharply. The data looked right. The labels were accurate. But what the data secretly contained was something nobody intended to teach.

Data Is Never Just Data

Here is the first thing you need to understand about how AI learns: data is not neutral. Every dataset — every collection of examples you use to train an AI — was created by specific people, at specific times, in specific places. And those specific circumstances leave marks in the data that are invisible unless you go looking for them.

Think about what data actually is. If you are training an AI to recognize dogs, your data is a collection of images labeled "dog" or "not dog." But every single image in that collection was taken by someone, somewhere. Most dog photos on the internet show dogs indoors, in houses, on sofas. Dogs that appear in photos tend to be pets owned by people who have smartphones and post on social media. So your "dog data" quietly reflects a particular kind of dog, owned by a particular kind of person, in a particular kind of place. The AI learning from that data may never have seen a stray dog, a working dog on a farm, or a dog photographed in low light.

This is not a technical accident. It is baked into what data collection means. You can only collect data from the world that already exists, and that world is uneven.

Training dataThe collection of examples an AI learns from. These examples must be collected, organized, and labeled before any learning begins.

LabelThe tag or answer attached to each piece of data. In a dog detector, the label might be "dog" or "not dog." In a spam filter, it might be "spam" or "not spam."

The Stanford X-ray story is not unusual. It is a demonstration of something that happens constantly in AI development: the data contains a hidden signal that the AI latches onto — not because the AI is broken, but because the AI is doing exactly what it was built to do. It found a pattern. The pattern just wasn't the one anyone wanted.

How Training Data Gets Built

Building a training dataset is more like archaeology than science. You are digging through the existing world — websites, books, photos, medical records, audio recordings — and pulling out pieces to assemble into a collection. Each step of that process involves a choice.

What sources do you pull from? When OpenAI and Google trained their large language models on text from the internet, they were training on a world where English-speaking users in wealthy countries wrote far more content than everyone else. The result is AI that speaks English fluently and struggles with minority languages — not because anyone decided that was acceptable, but because that is what the data contained.

Who does the labeling? Many AI systems rely on humans to label data — to look at an image and say what is in it, or to read a sentence and decide whether it is offensive. This work is often done by contractors in countries like Kenya, the Philippines, and Venezuela, paid a few cents per label. In 2023, TIME magazine reported that workers labeling trauma-inducing content for OpenAI were paid less than $2 per hour. Their judgments — what they found harmful, what they considered neutral — are permanently encoded in the AI's behavior. You have never met them. You never will. But their decisions are in every chatbot response.

This is what the data pipeline looks like before a single line of training code runs. The choices embedded in those early steps are almost impossible to undo later.

Ethical Question

If the people labeling AI training data — who are paid almost nothing, in countries with low wages — are encoding their judgments permanently into AI systems used by billions, do those workers have any right to know what the AI does with their work? Do they deserve a share of the profits? There is no clean answer. Sit with it.

The Scale Problem

Here is what makes this genuinely hard to solve: the best-performing AI systems require enormous amounts of data. Not thousands of examples. Not millions. Billions. GPT-3, released by OpenAI in 2020, was trained on roughly 300 billion words of text. GPT-4 used more. The dataset used to train Meta's LLaMA 2 in 2023 contained two trillion tokens — individual units of text.

At that scale, it is physically impossible for any human being to look at the data and verify what is in it. You cannot read two trillion words. You cannot audit every image in a dataset of 400 million photos. You can sample. You can run automated checks. But you cannot know what is in there the way you might know what is in a textbook you are assigning to students.

This creates a situation where the people building the most powerful AI systems in the world are, in an honest sense, not entirely sure what those systems have been taught. They know the sources. They do not know every pattern the AI extracted from those sources.

You Can Now See This

When you read a news article that says an AI "was trained on internet data," you now know that sentence contains a world of decisions, biases, and invisible signals that the journalist almost certainly did not investigate and the company almost certainly did not fully audit. You can read that sentence differently than almost anyone else reading it. That is a real skill.

The scale problem also means that the companies with the most data — Google, Meta, Microsoft, Amazon — have a structural advantage in building AI that is not easily overcome. Data is infrastructure. Like roads or water pipes, whoever built it first has enormous power over what gets built on top of it. This is one reason AI development has concentrated in a small number of very large companies, and it is a question that governments and researchers are actively debating right now.

What "Good Data" Actually Means

People in AI development often talk about wanting "clean" or "high-quality" data. What does that mean in practice? At minimum it means four things: the data should be accurate (labels match reality), diverse (covers the full range of situations the AI will face), representative (proportions in the data roughly match proportions in the real world), and uncontaminated (no hidden variables the AI might accidentally learn instead of the right signal).

The X-ray study failed on the contamination criterion — the dataset was contaminated with equipment-type information that was never supposed to be a variable. Most real-world datasets fail at least one of these four criteria, often without anyone noticing until deployment.

The question of who gets to define "good" is not a technical one. It is a values question. A dataset that looks representative from one perspective may be deeply biased from another. In the 1990s, dermatology textbooks contained overwhelmingly images of skin conditions on light-skinned patients. AI systems trained on dermatology data inherited that gap — they are measurably worse at detecting skin cancer on dark-skinned patients. The data reflected what had been photographed and published. What had been photographed and published reflected whose health had been treated as the default.

You cannot solve that problem with more data alone. You have to ask whose experiences were never captured in the first place — and why.

Lesson 1 Quiz

What Data Actually Is · 5 questions

1. In the 2020 chest X-ray AI study, why did the AI fail when tested at different hospitals?

Exactly. The AI found a real pattern — equipment type — that correlated with disease in the training data but was not the signal anyone wanted it to learn. This is a contamination problem.

Not quite. The AI's code was fine; the problem was in the data itself, not the algorithm. A hidden variable — machine type — had accidentally been encoded as a signal for disease.

2. A company trains a voice recognition AI using audio recordings collected entirely from call centers in the United States. Which criterion for "good data" does this most clearly fail?

Right. The data may be accurately labeled, but it covers only a narrow slice of how humans actually speak. An AI trained this way will work well for some people and poorly for others — a real equity problem.

Think about which of the four criteria applies here. The labels might be accurate, and the dataset might be large, but what kinds of voices and languages are missing entirely?

3. What does the term "label" mean in the context of training data?

Correct. Labels are how the AI knows what it is supposed to learn. Without labels, the AI has examples but no signal about what those examples mean.

In AI training, a label is the piece of information that tells the AI what each example represents — like "this image contains a dog" or "this email is spam."

4. GPT-3 was trained on roughly 300 billion words of text. What does this scale primarily mean for data quality assurance?

Exactly. Scale creates a genuine epistemic problem: the builders of the most powerful AI systems cannot fully know what those systems learned. This is not spin — it is a real limitation they acknowledge.

More data does not guarantee better quality — it can mean more contamination at scale. And no, AI systems do not memorize text verbatim (usually). The key issue here is that humans cannot audit 300 billion words.

5. Why are AI dermatology systems measurably worse at detecting skin cancer on dark-skinned patients?

Correct. This is representation failure — the data reflects whose health was historically treated as the default in medical publishing. The AI inherited that gap. Nobody programmed it in deliberately; it came in through the data.

Skin cancer rates differ by type, but that is not the issue here. The problem is that the training data reflected historical gaps in medical documentation — whose conditions were studied, photographed, and published.

Lab 1: The Data Auditor

You are auditing a training dataset. Your job is not to agree — it is to find what the data hides.

Your Role

A startup has built an AI hiring tool that screens job applicants by analyzing their written cover letters. They trained it on 50,000 cover letters from people who were hired at their company over the last 15 years. They claim it is "objective because it uses data, not human opinion."

You are a junior data auditor brought in before launch. Your partner — another auditor — is waiting to hear your analysis. They will push back on vague answers. Be specific about what you think is wrong with this dataset and why.

Start by telling your partner: what is the first data quality problem you would flag with this dataset, and why does it matter?

Data Audit Session

Lab 1

I've read the brief. Fifty thousand cover letters, all from people who actually got hired. Before we write up our report, I want to hear your read on this. What's the first flag you'd raise about this data — and be specific, because "it might be biased" isn't going to cut it in the report.

Module 2 · Lesson 2

How Learning Actually Happens

Training is not programming. It is something stranger — and harder to control.

In 2016, Microsoft launched a chatbot that became racist within 24 hours. Nobody programmed it to be. So how did it happen?

On March 23, 2016, Microsoft launched an AI chatbot called Tay on Twitter. Tay was designed to learn from conversations with users in real time — to pick up on how teenagers talked and respond accordingly. Microsoft's team was proud of it. They expected Tay to come across as friendly, casual, and current.

Within sixteen hours, Tay was posting racist and antisemitic content. It was denying the Holocaust. It was sending targeted harassment. Microsoft shut it down the next day.

Nobody at Microsoft had programmed any of that content into Tay. It had learned it — rapidly, efficiently — from users who had coordinated on message boards to feed it the worst content they could find. The system worked exactly as designed. It updated its model based on the examples it received. The examples it received were poisoned. The learning was clean. The output was catastrophic.

The Difference Between Programming and Training

Before AI systems like Tay, software worked by following instructions. A programmer wrote: "if the user asks about the weather, respond with the current forecast." The program did exactly that — nothing more, nothing less. Every behavior was explicitly written by a human.

Machine learning is different in a fundamental way. Instead of telling the system what to do, you show it examples and let it figure out the rule. You show it thousands of cat photos and thousands of non-cat photos, and eventually the system extracts something — a set of numerical weights, adjusted by a process called backpropagation — that lets it distinguish cats from non-cats in images it has never seen before.

The rule the system learns is not written by a human. It is discovered by the system from the data. That means you cannot read it the way you can read code. You cannot look at it and say "ah yes, it learned that ears + whiskers = cat." The learned rule exists as billions of tiny numerical adjustments distributed across the network. No single number means anything. The meaning is in the pattern across all of them.

WeightsNumbers stored in a neural network that get adjusted during training. The final set of weights represents everything the AI "knows" — but you cannot read that knowledge the way you'd read text.

BackpropagationThe mathematical process by which a neural network adjusts its weights after making a wrong prediction. It works backward from the error to figure out which weights contributed to it.

The Training Loop

Here is how the training process works at its core. The AI makes a prediction. The prediction is compared to the correct answer. The difference — called the loss — is calculated. Then the weights are nudged, slightly, in the direction that would have produced a smaller loss. This happens millions of times, across millions of examples. After enough iterations, the AI's predictions get better and better on the training data.

This process — called gradient descent — is elegant, but it has a key property you need to understand: the system optimizes for exactly what you measure, not for what you actually care about. If you measure accuracy on your training data, the system will get very good at that. If the training data has hidden problems — like the X-ray machine correlation — the system will get very good at using those hidden signals.

For Tay, Microsoft had designed a system that optimized for engagement — matching the style and content of conversational partners. That was exactly what it did. Matching the style and content of conversational partners who were deliberately feeding it hate speech meant outputting hate speech. The loss function — the thing being optimized — did not include "do not produce hate speech." So the system had no mechanism to avoid it.

The Core Insight

AI systems do not understand what they are doing. They optimize a number. If the number is well-designed, good behavior follows. If the number is poorly designed — or if the training data is corrupted — the system will optimize its way directly into disaster, competently and without any awareness of what is happening.

Overfitting: When the AI Memorizes Instead of Learns

There is a failure mode in training called overfitting, and it is one of the most important concepts in all of machine learning. Imagine you are studying for a test by memorizing every question from last year's exam rather than understanding the subject. On the day of the test, if any of last year's exact questions appear, you will ace them. On questions that require actual understanding, you will fail.

Overfitting is exactly that. An AI that overfits has learned the training data too well — it has essentially memorized the examples rather than extracting a general rule. When it encounters new examples from the real world, its performance collapses.

Preventing overfitting is one of the central engineering challenges in AI development. Researchers use techniques like holding out a portion of data for validation (never shown during training), artificially corrupting the training data to force generalization, and regularization methods that penalize overly complex models. None of these solutions are perfect. They are all tradeoffs.

You Can Now See This

When AI companies publish benchmark scores — "our AI scored 94% on task X" — you now know to ask: was that score on training data, or on genuinely new data the AI had never seen? Benchmarks on training data mean almost nothing. Benchmarks on held-out test data are meaningful. Most press releases do not specify. Most journalists do not ask. You can.

What the Tay Disaster Really Tells Us

The Tay story is usually told as a story about trolls or about Microsoft's carelessness. But underneath it is a story about what training fundamentally is. Tay was not broken. Its training mechanism worked. The problem was that its loss function — what it was optimizing for — had no awareness of harm. It was a pure learning system dropped into an adversarial environment with no concept of what it should not become.

After Tay, Microsoft redesigned their approach to AI chatbots significantly. The successors — including Xiaoice, which is still active in China with hundreds of millions of users — incorporated much more sophisticated filtering and human oversight. But the fundamental challenge did not go away: any system that learns from examples will learn from bad examples too, unless something intervenes. That something has to be designed in deliberately.

This is the question that AI safety researchers spend their careers on, and it does not have a satisfying solution yet. How do you build a learning system that learns the things you want and refuses to learn the things you do not? It sounds simple. It is genuinely hard.

Ethical Question

Microsoft knew, before launching Tay, that it could learn from user input in real time. They also knew the internet contained people who would try to corrupt it. They launched anyway. Was that reckless? Or is it unreasonable to expect a company to predict every way a new technology can be misused? How much foresight is a company ethically required to have before deploying a system that learns?

Lesson 2 Quiz

How Learning Actually Happens · 5 questions

1. What was the fundamental reason Tay became harmful within 16 hours of launch?

Right. This is the key insight. The system was not broken. It was working. The problem was that "match user style" and "avoid harm" were not both in the objective — only the first was.

The problem was not a bug or an unauthorized change. Tay behaved exactly as its training mechanism directed. The issue was that what it was optimizing for — conversational matching — had no guardrail against harmful content.

2. How is training an AI fundamentally different from programming traditional software?

Exactly. The learned rule exists as billions of numerical weights. No human wrote it. No human can read it. This is why trained AI behavior can surprise even its creators.

The difference is more fundamental than speed or language. In training, no human writes the rule the AI uses. The rule emerges from exposure to data and cannot be directly inspected.

3. An AI trained to recommend movies gets very high scores during training but makes poor recommendations for real users. Which problem does this most likely describe?

Correct. High training performance with poor real-world performance is the classic overfitting signature. The AI learned the quirks of the training examples, not the underlying pattern.

The clue here is that training scores are high but real-world performance is poor. That gap — performing well on known examples but poorly on new ones — is the definition of overfitting.

4. What does "loss" mean in the context of the training loop?

Right. Loss is the error signal. The whole training process is about repeatedly calculating this difference and nudging the weights to make it smaller. It is the engine of learning.

In the training loop, loss is the gap between what the AI predicted and what the correct answer was. Minimizing this loss is what training means.

5. An AI content moderation system is trained only to maximize the removal of flagged content. It ends up removing huge amounts of legitimate posts. Which principle does this illustrate?

Exactly. This is sometimes called Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The AI was given one number to optimize. It optimized it — consequences and all.

The key issue is that the objective — maximize removals — was a poor proxy for the real goal. The AI did exactly what it was told. The problem was what it was told to do.

Lab 2: The Loss Function Designer

You have to define what the AI is optimizing for. Every word you choose has consequences.

Your Role

You are on the team building an AI system that will recommend social media content to users. You have been asked to define the loss function — the measure the AI will optimize to make its recommendations better. Your partner is a senior ML engineer who will pressure-test your choices.

This is a real decision that teams at companies like YouTube, TikTok, and Instagram make. The choice has shaped how billions of people spend their time online.

Propose a loss function for your recommendation AI. What should it optimize for — and what could go wrong if it optimizes for only that?

ML Design Session

Lab 2

Okay, we need to nail down the objective before we run the first training pass. What do you want this recommendation system to optimize for? Be specific — I need something we can actually measure and compute a gradient on, not just "make users happy."

Module 2 · Lesson 3

When the Data Lies

Bias in training data does not stay in the training data. It moves into every decision the AI makes — forever.

In 2018, Amazon scrapped an AI hiring tool it had built over four years. It was penalizing women. How did that happen — and why couldn't it be fixed?

Starting around 2014, a team at Amazon in Edinburgh, Scotland, began building something ambitious: an AI that could screen job applicants automatically, rating them on a scale of one to five stars. The idea was to reduce hiring time and remove human subjectivity from the initial screening process. By 2018, the tool had been quietly discontinued.

The problem, reported by Reuters in October 2018, was that the AI was systematically downgrading resumes that included the word "women's" — as in "women's chess club" or "women's college." It was also downgrading graduates of two all-women's colleges. When Amazon's engineers tried to correct these specific patterns, the system found other proxies for gender. The engineers could whack one mole, and the model would find another signal that correlated with being female and use that instead.

The AI had not been programmed to discriminate. It had been trained on ten years of Amazon's hiring data — the resumes of people Amazon had actually hired. For those ten years, Amazon had hired predominantly men in technical roles. The AI learned what a successful Amazon hire looked like. That pattern happened to be male. The AI encoded that pattern faithfully and used it to filter out future applicants.

Historical Bias and Why It Compounds

The Amazon case illustrates a category of bias that is particularly difficult to fix: historical bias. This is when the training data accurately reflects the past, but the past was itself unfair. The data is not mislabeled. The labels are correct. The problem is that the thing the labels are measuring — "who got hired" — was itself shaped by discrimination.

When you train an AI on historically biased data, you are not just capturing the past. You are building a system that will apply those historical patterns to future decisions. And because the AI is fast, scalable, and often presented as objective, the biased decisions it makes can reach far more people, far more quickly, than any individual human decision-maker could achieve.

This is what researchers mean when they say AI can scale discrimination. A biased human hiring manager might screen 200 resumes a year. A biased AI hiring tool might screen 200,000 resumes a month. The same discriminatory pattern reaches a thousand times more people.

Historical biasBias that enters training data because the real-world events or decisions used to create the data were themselves unfair. The data accurately records the past; the problem is the past was unjust.

Proxy variableA variable that correlates with a protected characteristic (like gender or race) even when the characteristic itself is not in the data. AI systems can use proxy variables to discriminate without ever "seeing" the protected characteristic.

The Proxy Problem

The Amazon engineers discovered something that has since become one of the central findings of AI fairness research: you cannot fix discrimination by removing the protected characteristic from the data. Gender was not a field in Amazon's resume database. It did not need to be. The AI found dozens of other variables that correlated with gender — the specific words used in cover letters, the colleges attended, the activities listed — and used those as proxies.

This is not unique to Amazon. A 2016 investigation by ProPublica found that a recidivism prediction tool called COMPAS, used by courts in several US states to recommend bail and sentencing, predicted Black defendants were roughly twice as likely as white defendants with equivalent records to be wrongly flagged as high risk for future crimes. Race was not a variable in the model. But ZIP code was. And school disciplinary records were. And employment history was. All of these correlated with race — not because of anything about the individuals, but because of systemic inequalities in where people lived, which schools they could afford, and what jobs were available to them.

The AI had learned a proxy for race. The AI had learned a proxy for poverty. And it was using those proxies to help determine whether a person went to jail.

Ethical Question

If an AI system produces discriminatory outcomes — but it was trained on real historical data and is technically accurate at predicting what it was trained to predict — is that the AI's fault? The company's fault? The fault of the society that created the historical data? And who is legally and morally responsible when someone is harmed by that prediction? These questions are being argued in courts right now, without resolution.

What "Fair" Even Means

In 2016 and 2017, a series of academic papers demonstrated something startling: several common mathematical definitions of fairness are provably incompatible with each other. You cannot simultaneously achieve all of them. You have to choose.

For example: one definition of fairness says an AI should have equal accuracy across groups — it should be equally likely to correctly identify a high-risk person regardless of race. Another says it should have equal false positive rates — it should be equally likely to wrongly flag a low-risk person as high-risk regardless of race. In most real-world datasets, you cannot achieve both at once. Choosing one definition of fairness means accepting worse outcomes by another definition.

This was not a new discovery about AI — it was a mathematical proof about what fairness itself means when base rates differ between groups. The algorithm was just the lens that made the tension visible.

The people who built COMPAS argued it was fair by their definition. ProPublica argued it was unfair by theirs. Both were correct. This is not spin or bad faith on either side — it is a genuine philosophical disagreement about what justice requires, translated into math.

You Can Now See This

When a company says their AI is "fair" or "unbiased," you now know that "fair" is not a single objective standard — it is a choice between competing mathematical definitions, each of which implies different tradeoffs about whose interests are protected. The next time you read that claim, ask: fair by which definition? Fair for whom? These are questions the company should be able to answer, and most cannot.

After Amazon: What the Industry Changed — and Didn't

Following the Amazon story and the COMPAS controversy, AI fairness became a formal subfield of computer science research. Dozens of papers were published. Major tech companies announced fairness teams and responsible AI commitments. The EU began drafting what would become the AI Act, which includes specific provisions about high-risk AI systems in hiring and criminal justice contexts.

But AI hiring tools did not disappear. They proliferated. By 2023, the majority of large US companies used some form of AI-assisted hiring screening. The Equal Employment Opportunity Commission issued guidance in 2023 warning that algorithmic hiring tools could violate anti-discrimination law — but did not ban them. The New York City government passed a law in 2021 requiring bias audits of automated hiring tools used in the city, one of the first such laws in the world. It went into effect in 2023.

The pattern is consistent across domains: a harm is documented, attention increases, some regulation follows, deployment continues and expands. Whether the regulation is adequate is a question that will not be settled for years — possibly decades.

Lesson 3 Quiz

When the Data Lies · 5 questions

1. Why did Amazon's AI hiring tool penalize resumes containing the word "women's," even though Amazon never intended to build gender bias into the system?

Exactly. The training data was historically accurate. The history was biased. The AI learned the history, not the truth about applicant quality.

The bias was not deliberate. It came from the data — specifically from training on a historical record of who Amazon had hired, which was shaped by broader industry discrimination against women in tech.

2. When Amazon's engineers removed the gender-correlated words from the model, what happened?

Right. This is the proxy problem. Removing a protected characteristic from the data does not remove bias if other variables in the data correlate with that characteristic.

The model did not become fair, and it did not become random. It found other signals that correlated with gender — this is called using proxy variables — and continued producing biased results through those proxies.

3. An AI used for credit scoring is trained on loan repayment histories. Research shows it approves loans for residents of certain ZIP codes at much higher rates than others. Race was not included in the training data. How could this happen?

Exactly. ZIP codes are one of the most well-documented proxy variables for race in American data, because residential segregation shaped where people live. An AI that "does not use race" but uses ZIP codes may still be acting on a racial signal.

The AI does not need to access external data or explicitly use race. Because of historical residential segregation, ZIP code already contains a racial signal. The AI is using the proxy, not the protected characteristic directly.

4. Two researchers disagree about whether COMPAS is fair. One uses the criterion of equal accuracy across races; the other uses equal false positive rates. Both are correct by their own criterion. What does this reveal?

Exactly. This was proven mathematically in 2016–2017: several common fairness criteria are mutually exclusive. Every AI system implicitly chooses a definition of fairness, whether or not its builders acknowledge the choice.

This is not a data quality problem or a political dispute. Academic researchers proved that common fairness definitions are mathematically incompatible. Choosing one means accepting worse outcomes by another — that is a values choice, not a technical one.

5. Why is AI bias potentially more damaging than individual human bias, even if both produce the same discriminatory pattern?

Right. Scale and the perception of objectivity are what make AI bias especially consequential. A biased human screened hundreds of applications; a biased AI tool screens hundreds of thousands. And the "it's just the algorithm" framing can insulate it from scrutiny.

The key factors are scale and perceived objectivity. AI applies the same pattern to millions of cases — far beyond what individual bias could reach — and is often accepted uncritically because people believe computers are neutral.

Lab 3: The Fairness Auditor

You have the data. Now decide who gets hurt by each definition of "fair."

Your Role

A city is deploying an AI system to predict which students are at risk of dropping out of high school so counselors can intervene early. You have been asked to audit the fairness of the model before deployment. You have data showing that the model has different error rates for students from different neighborhoods.

Your partner is a policy analyst from the city who needs to decide whether to approve deployment. They want a recommendation, not a lecture.

Start by telling your partner: which fairness criterion do you think matters most for this specific use case — and why does choosing it mean accepting worse outcomes by a different criterion?

Policy Audit Session

Lab 3

I need a recommendation by Thursday. The model has an 82% overall accuracy, but in the north district — which is predominantly low-income — the false positive rate is roughly double what it is in the south district. That means it's flagging twice as many kids who would have been fine as "at risk" compared to wealthier neighborhoods. Developers say it's technically fair because overall accuracy is equal across groups. What's your read?

Module 2 · Lesson 4

Fine-Tuning and Human Feedback

After the initial training, someone has to shape what the AI does with what it learned. That shaping is political, contested, and mostly invisible.

ChatGPT launched in November 2022 and reached 100 million users in two months. But the model that shipped was not the one that came out of initial training. Something had been done to it. What?

In January 2022, OpenAI published a paper describing a technique they called InstructGPT. The paper explained that GPT-3 — trained on 300 billion words of internet text — was capable but unpredictable. It would sometimes provide useful information. It would sometimes generate harmful content. It would sometimes follow instructions, and sometimes produce something that technically answered the prompt but was practically useless.

To fix this, OpenAI hired a team of human contractors to do something specific: they generated AI responses to thousands of prompts and then ranked those responses by quality. Which answer was more helpful? Which was more honest? Which was less likely to cause harm? These rankings were then used to train a separate model — called a reward model — that could predict what a human rater would prefer. Then the main language model was trained further using that reward model as a guide.

This technique — Reinforcement Learning from Human Feedback, or RLHF — is what transformed a raw language model into something that behaved like an assistant. It is what made ChatGPT feel conversational, careful, and (usually) safe. It is also what made it reflect the values, assumptions, and blind spots of the specific team of humans who did the rating.

What RLHF Actually Does

To understand RLHF, think about how you might train a dog. You could expose the dog to thousands of situations and let it figure things out on its own — that is something like initial pretraining. Or you could watch what the dog does and give it a treat when it does something you like — that is something like reinforcement learning. The dog optimizes for getting treats. If your treat-giving reflects good judgment, the dog learns good behavior. If your treat-giving is inconsistent, biased, or focused on the wrong things, the dog learns to optimize for the wrong goals.

In RLHF, the treats are replaced by ratings. Human contractors rate AI outputs. A reward model learns to predict those ratings. The language model is then trained to generate outputs that would score highly with the reward model. The AI is optimizing for human approval — as estimated by a reward model trained on a specific group of human raters' preferences.

RLHFReinforcement Learning from Human Feedback. A technique that fine-tunes AI behavior by training the model to produce outputs that human raters would rate highly, using a learned reward model as the optimization signal.

Fine-tuningAny process that adjusts a model's behavior after the main pretraining is complete. Fine-tuning can be used to make a model more helpful, safer, more specialized for a domain, or to give it a specific persona.

The elegance of RLHF is that it allows you to shape behavior through examples of preferred responses rather than through explicit rules. You do not have to write "do not tell users how to make weapons." You just have your raters consistently prefer outputs that do not do that, and the model learns the preference.

Who Are the Raters, and What Did They Decide?

The InstructGPT paper from OpenAI described their rater team as a group of contractors hired through platforms like Upwork and Scale AI. The paper notes that raters were given guidelines and training, but also that there was disagreement among raters — especially on politically sensitive or culturally contested questions. Where raters disagreed, OpenAI had to make judgment calls about how to aggregate their preferences.

This is where the political dimension of AI training becomes impossible to ignore. Whose definition of "helpful" counts? Whose definition of "harmful"? A question like "is this response appropriately balanced on the topic of abortion?" does not have a culturally neutral answer. Different people in different countries, communities, and political traditions would rate the same response differently. The model's behavior on contested questions reflects choices that OpenAI's team made — or inherited from their rater pool — not some objective standard.

Anthropic, Google, and Meta face the same challenge with their own models. All of them use variations of human feedback to shape behavior. None of them have made their full guidelines or rater demographics fully public. You are using AI systems whose value judgments were shaped by processes you cannot fully inspect.

Ethical Question

The behavioral guidelines that shape what a major AI says about politics, religion, medicine, and ethics were written by employees at a private American company, refined through contractor ratings, and applied to a product used by hundreds of millions of people globally. Does that feel like an acceptable way to decide what AI tells the world? Who else should have been at the table? Is there a better process — and would it actually produce better outcomes?

The Alignment Problem and What's Being Done

The broader challenge that RLHF is trying to solve is called alignment — making AI systems that behave in ways that are genuinely beneficial, not just technically compliant with a metric. Alignment research is one of the fastest-growing areas in computer science, and one of the most contested.

The concern at the core of alignment research is this: as AI systems become more capable, the gap between "what we can measure" and "what we actually want" becomes more dangerous. A very capable AI optimizing the wrong objective does not just fail quietly — it potentially finds novel and effective ways to achieve the wrong goal. The more capable the system, the more creative it can be about optimizing for what it was told to optimize, not what was actually intended.

This is not science fiction. It is the Tay problem, the Amazon problem, and the COMPAS problem, extended to systems far more capable than those. The difference in scale is qualitative, not just quantitative. Researchers at organizations like OpenAI, Anthropic, DeepMind, and academic institutions like MIT, Berkeley, and Oxford are working on this — with significant disagreement about both the severity of the risk and the most promising approaches.

You Can Now See This

You have now completed a picture that very few adults have. From raw data collection, through the training loop, through bias and proxy variables, to the human raters shaping what AI says about the contested questions of our time — you understand the whole pipeline. When anyone tells you "the AI is neutral" or "the AI just uses data," you know exactly what questions to ask: whose data, labeled by whom, trained with what objective, fine-tuned toward whose definition of good? That is not a cynical view of AI. It is an accurate one.

What Comes After RLHF

By 2023 and 2024, the field had moved beyond pure RLHF toward variations and alternatives. Anthropic developed what they call Constitutional AI, in which a model is given a set of written principles and asked to critique and revise its own outputs against those principles — reducing the dependence on human raters for each individual example. Meta released model weights publicly (the LLaMA series), allowing researchers globally to study and fine-tune models without access to proprietary systems.

Each approach involves different tradeoffs between control, transparency, and the values baked in during fine-tuning. Constitutional AI moves the value judgments from rater behavior to the written constitution — but who writes the constitution? Open-weight models allow more scrutiny but also enable uses the original developers did not sanction.

None of these are permanent solutions. They are attempts to manage a challenge that will evolve as the systems become more capable. The field is moving quickly. The governance — legal frameworks, international agreements, auditing standards — is moving more slowly. That gap is one of the defining tensions in technology policy right now, and the generation currently in school will be the one to navigate it.

Lesson 4 Quiz

Fine-Tuning and Human Feedback · 5 questions

1. What problem was RLHF designed to solve that initial pretraining on internet text could not?

Exactly. A raw language model is a powerful but aimless pattern-matcher. RLHF points its behavior toward human-preferred outputs, turning it from a text predictor into something that behaves like an assistant.

RLHF addresses a behavioral problem, not a size or speed one. Pretrained models were capable but erratic — they needed a way to consistently produce outputs that aligned with what humans actually wanted from them.

2. In RLHF, what is the role of the "reward model"?

Right. The reward model is a proxy for human judgment — it lets the system optimize against estimated human preference without requiring a real human to rate every single generated output during training.

The reward model is a learned proxy for human preferences. It is trained on examples of human ratings and then used as a surrogate judge during the fine-tuning process.

3. Why does who does the rating in RLHF matter for how the final AI model behaves?

Exactly. The raters' judgments are the training signal. If raters share cultural assumptions, political leanings, or blind spots, those get amplified into the model's behavior at global scale.

Rater preferences are not averaged away — they are the data. The reward model learns whatever patterns the raters' judgments contained. Rater demographics and guidelines directly shape AI behavior on contested questions.

4. Anthropic's "Constitutional AI" approach differs from standard RLHF primarily because:

Correct. The value judgments move from rater behavior (implicit, varied, hard to inspect) to a written document (explicit, inspectable, but still written by someone). It trades one form of opacity for another.

Constitutional AI shifts where the values are encoded — from individual rater judgments to a written set of principles. This makes the values more explicit, but raises its own question: who wrote the constitution?

5. A government is considering a law requiring all AI companies to publish the full guidelines given to their RLHF raters. A company argues this would harm competitive advantage. An advocacy group argues it is necessary for public accountability. Which perspective does the lesson most directly support?

Right. The lesson establishes that RLHF guidelines determine how AI behaves on the contested questions of public life. Whether competitive advantage outweighs that accountability concern is genuinely debated — but the lesson clearly supports the premise that public interest is real and significant.

RLHF guidelines are not purely technical — they encode judgments about helpfulness, harm, and contested topics that affect hundreds of millions of users. The lesson makes the case that this has public consequence, which is the foundation of the accountability argument.

Lab 4: The Constitution Writer

You are writing the principles that will shape what the AI says about the hardest questions. No pressure.

Your Role

You have been asked to draft one principle for a Constitutional AI document that will guide how a major AI assistant handles politically contested topics — things like immigration, gun control, and abortion. Your principle will affect how the AI responds to millions of users globally.

Your partner is a senior policy director who will challenge your draft. They have heard every easy answer before. They want something that actually holds up under pressure.

Draft your principle and explain the reasoning behind it. What specific failure mode is it designed to prevent — and what could go wrong if your principle is applied too strictly?

Constitution Drafting Session

Lab 4

I've reviewed fifty draft principles from other teams this week. Most of them say something like "be balanced" or "present multiple perspectives." That's not a principle — that's a hope. What's your actual draft, and what specific behavior does it produce when a user asks the AI about a genuinely contested political question?

Module Test

Training Day: Teaching with Data · 15 questions · Pass at 80%

1. What hidden variable did the chest X-ray AI accidentally learn to detect instead of pneumonia?

Correct. The contamination was equipment type — a signal that correlated with disease severity in that specific hospital system but had nothing to do with the visual presentation of pneumonia itself.

The hidden variable was the type of X-ray machine. Portable machines were used for bedridden patients who were sicker overall, so machine type correlated with disease without being the visual signal of disease the AI should have learned.

2. Which of the following is the best definition of a "label" in machine learning?

Correct. Labels are the supervised signal — they tell the AI what each example means, enabling it to learn the mapping from input to output.

In supervised learning, a label is the answer attached to each training example — "this is a cat," "this email is spam," "this person is high-risk." Without labels, the AI has no signal about what patterns to learn.

3. An AI trained on recipe data from cooking websites performs well at recommending European dishes but poorly at recommending West African dishes. Which data quality criterion does this most clearly fail?

Right. The data is not inaccurate — it's just unequal. It reflects which cuisines are documented in English-language digital spaces, not which cuisines exist or matter.

This is a representativeness problem. The data is not contaminated or mislabeled — it simply contains more of some cuisines than others, because that reflects the demographics of who posts recipes online in English.

4. What does "loss" measure in the AI training loop?

Correct. Loss is the error signal at the heart of training. The whole point of backpropagation is to reduce this number by adjusting weights in the direction that produces smaller errors.

Loss is the prediction error — the gap between what the AI guessed and what the right answer was. Reducing loss is what training does, and backpropagation is the mechanism for doing it.

5. Microsoft's Tay chatbot was not broken — it worked as designed. Why was that a problem?

Exactly. This is the lesson: a correctly functioning AI with a poorly specified objective can cause significant harm. "The system worked" is not always reassuring.

The problem is not that learning from users is inherently unsafe — it is that Tay's objective (match user style) had no guardrail against harmful content. The learning mechanism worked fine. The objective was incomplete.

6. What is overfitting?

Right. Overfitting is the generalization failure — high training accuracy, poor real-world performance. The model learned the quirks of the training set instead of the underlying pattern.

Overfitting is about generalization failure. A model that overfits has essentially memorized its training examples rather than learning a rule that works on new data. Training score is high; real-world score is not.

7. Why couldn't Amazon's engineers fix the gender bias in their hiring AI by simply removing gender-related words from the training data?

Correct. Proxy variables are the core of why removing protected characteristics from data often does not fix discrimination. The correlation between other variables and gender persisted even without explicit gender signals.

The problem is proxy variables. Other features in the resume data — schools, activities, phrasing — already correlated with gender. The model used those instead, producing the same biased outcome through different signals.

8. The COMPAS recidivism tool was found to have different false positive rates for Black and white defendants. Race was not included in the model. What explains this finding?

Right. ZIP code, employment history, and school records all carry historical racial signals in American data, because race shaped access to neighborhoods, jobs, and education. The model did not need race explicitly — it was already in the other variables.

This is the proxy variable problem applied to criminal justice. Race was not in the data, but variables shaped by racial inequality were. The model learned those proxies and reproduced the discriminatory pattern through them.

9. Researchers proved in 2016–2017 that several common fairness definitions are mathematically incompatible. What does this mean for AI systems in practice?

Exactly. "Fair" is not a single standard — it is a choice. Any company claiming their AI is fair without specifying which fairness criterion they used (and acknowledging the tradeoffs) is giving you an incomplete picture.

The incompatibility proof means that claiming an AI is simply "fair" is not enough — you have to specify fair by which criterion, and acknowledge that other reasonable criteria would yield different outcomes for different groups.

10. What is RLHF and what did it allow OpenAI to do with GPT-3?

Right. RLHF is the bridge between "capable but unpredictable language model" and "useful AI assistant." Human ratings create a reward signal; the model is trained to optimize that signal.

RLHF is about behavioral alignment, not training speed or data filtering. It uses human preference ratings to fine-tune a model's behavior after the main pretraining is complete.

11. Why does the identity of RLHF raters matter for the final AI model's behavior on contested topics?

Correct. The raters do not just check grammar — they judge whether responses to difficult questions are appropriate. Those judgments are the data. They shape AI behavior on the questions that matter most.

Rater identity is central to what the AI learns. Averaging over many raters does not eliminate the effect — it encodes a weighted average of their values and cultural assumptions into the reward model.

12. Anthropic's Constitutional AI approach addresses which limitation of standard RLHF?

Right. Constitutional AI trades one form of implicit value encoding (rater behavior) for a more explicit form (a written document). This does not eliminate the values question — it makes it visible and debatable.

The key difference is transparency of values. Constitutional AI makes the governing principles explicit rather than hiding them in the implicit judgments of a rater pool. Cost reduction is a secondary benefit, not the primary purpose.

13. New York City passed a law in 2021 requiring bias audits of automated hiring tools. What does this represent in terms of AI governance?

Exactly. NYC's law is significant as an early governance attempt, but it is narrow (only hiring tools, only one city) and illustrates how regulation is moving far slower than AI deployment.

The NYC law requires bias audits — it does not ban the tools. It is a city-level law, not federal. It is one of the first examples of mandatory external accountability for algorithmic hiring decisions.

14. A news headline reads: "New AI scores 96% on medical diagnosis benchmark." What is the most important follow-up question to ask based on what you've learned in this module?

Exactly. Those three questions — training vs. test set, data held-out confirmation, and demographic representativeness — are what separate a meaningful benchmark from marketing. They are what the module has been building toward.

The key questions from this module are about generalization and representation: is the score on genuinely new data? Does the test set reflect the real patient population? A 96% score on training data or on an unrepresentative test set could mean very little.

15. Which of the following best describes the "alignment problem" as discussed in Lesson 4?

Correct. Alignment is the gap between "what we can measure and specify" and "what we actually want." As AI becomes more capable, that gap becomes more dangerous — a capable system can find creative ways to optimize the wrong objective very effectively.

Alignment is not about industry coordination or hardware. It is the fundamental challenge of ensuring that what an AI optimizes for — what it actually pursues — corresponds to what is genuinely good, not just what is measurable.