Intro
L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Coded Unfair: AI Bias Exposed · Introduction

The Machine That Inherits Our Mistakes

Every powerful technology arrives promising neutrality — and delivers the prejudices of its makers.

When the first standardized IQ tests were administered by the U.S. Army in 1917 — the so-called Alpha and Beta tests given to 1.75 million recruits — their designers at Princeton believed they were measuring innate intelligence with scientific objectivity. Within a decade, those scores were being used to argue for immigration quotas and forced sterilization laws. The instruments were mathematically rigorous. The inputs, the assumptions about what intelligence meant and who possessed it, were drenched in the social hierarchies of their era. The math laundered the prejudice. It made discrimination look like measurement.

That pattern is repeating right now, at a scale the Princeton psychologists could not have imagined. Algorithms trained on decades of hiring records, lending decisions, criminal sentencing data, and medical diagnoses are being deployed by Fortune 500 companies, federal courts, insurance underwriters, and hospital systems. The inputs carry the sediment of every discriminatory choice made by every human who produced them. The outputs inherit that sediment — and then, because they arrive as a number or a score rather than a human opinion, they acquire a legitimacy that human opinions never could.

This course exists to make that invisible mechanism visible. We will examine specific documented cases — Amazon's scrapped hiring tool, the COMPAS recidivism algorithm, Google's photo classifier, dermatology AI trained almost entirely on light skin — and extract from each one a concrete, transferable lesson about how bias enters automated systems and what, if anything, can be done about it. We will not pretend there are easy fixes. But understanding the problem precisely is the first form of power available to anyone who encounters these systems — which, in 2025, means nearly everyone.

Coded Unfair: AI Bias Exposed · Lesson 1

Training Data Is Not Neutral Ground

What the algorithm learns depends entirely on what it was taught — and who produced the lessons.
How does historical human bias get mathematically encoded into a machine that has never met a human?

In 2014, Amazon's machine learning team in Edinburgh began building a tool to automate the first stage of recruiting. The system would scan résumés and score candidates from one to five stars, the same way Amazon rated products. Engineers fed it ten years of historical hiring data — real résumés submitted to Amazon between 2004 and 2014, tagged with whether those applicants were eventually hired. By 2015 the system was running. By 2017 the team had discovered a serious problem. The model had learned to penalize résumés that contained the word "women's" — as in "captain of the women's chess club" — and to downgrade graduates of all-female colleges. It had done this without being told to. It inferred the preference from the data: the people Amazon had historically hired in technical roles were overwhelmingly male, so the model treated maleness as a proxy for quality. Amazon quietly shelved the tool in 2018.

No Amazon engineer wrote a rule saying "prefer men." The discrimination emerged from the data itself. This is the central mechanism of training-data bias, and understanding it changes how you read every claim that a particular algorithm is objective.

What Training Data Actually Is

A supervised machine learning model learns by finding patterns in labeled examples. You show it thousands of inputs — résumés, loan applications, X-ray images, sentences — paired with the "correct" answer someone attached to each one. The model adjusts its internal parameters until it can reliably reproduce those answers. Then it applies the pattern to new inputs it has never seen.

The critical word is labeled. Every label in a training dataset was produced by a human or a human institution. Hiring decisions. Loan approvals. Criminal sentences. Medical diagnoses. Each label reflects the knowledge, assumptions, capacity, and biases of whoever made that decision, in the social and legal context of whenever they made it. When the model learns from those labels, it is not learning some abstract ground truth. It is learning to reproduce human judgment — including the parts of that judgment that were systematically wrong.

Why This Is Hard to See

The mathematical process of training a model obscures the human origin of the labels. By the time a model is deployed, its billions of parameters look nothing like the original data. The bias is no longer visible as a human opinion. It has been transformed into a weight, a threshold, a correlation — something that looks like an empirical fact about the world rather than a record of past human choices.

Three Pathways Bias Enters Training Data

Historical bias occurs when the world the data was collected from was already discriminatory. Amazon's résumé tool is a clean example: the underlying reality — that tech companies hired fewer women — was real. The model learned a real pattern. The problem is that a real pattern of discrimination becomes, when automated, a mechanism for perpetuating it indefinitely.

Representation bias occurs when some groups are simply missing from or underrepresented in the dataset. A landmark 2018 study by MIT researcher Joy Buolamwini — published as "Gender Shades" — found that three major commercial face recognition systems (IBM, Microsoft, and Megvision) achieved error rates of less than 1% on light-skinned men but as high as 34.7% on dark-skinned women. The training datasets used to build these systems contained far more images of light-skinned faces. The model performed worst on the people it had seen least.

Measurement bias occurs when the variable being measured is itself a flawed proxy for what you actually care about. The COMPAS recidivism algorithm — used in criminal sentencing across dozens of U.S. states from the mid-2010s onward — predicted "risk of re-offense" using a 137-question survey. A 2016 ProPublica investigation found it labeled Black defendants as higher risk at roughly twice the rate of white defendants with similar criminal histories. One of COMPAS's inputs was prior arrests — not convictions. Prior arrests reflect policing patterns, not actual crime rates. Neighborhoods that receive heavier policing generate more arrests. The measurement was biased before it ever reached the algorithm.

Key Terms

Training dataThe labeled dataset of examples used to teach a supervised machine learning model. The model learns to reproduce the patterns — including errors and biases — present in this data.
Historical biasBias that enters a model because the world from which training data was collected was already characterized by discriminatory practices or unequal outcomes.
Representation biasBias caused by systematic underrepresentation of certain groups in training data, causing the model to perform worse on those groups.
Measurement biasBias introduced when the features used to train a model are flawed proxies for the outcome of interest — for example, using arrest rates as a proxy for criminal behavior.
Proxy discriminationWhen a model uses a variable that correlates with a protected characteristic (race, gender, etc.) to produce discriminatory outcomes, even without directly using that characteristic.
The Core Insight of Lesson 1

An algorithm trained on biased data will reproduce that bias with mathematical precision. The model is not malfunctioning — it is doing exactly what it was designed to do. This means the problem is not a bug to be patched but a structural feature of how these systems are built, requiring structural responses.

Why It Compounds

Training-data bias is not a static problem. It tends to compound over time through a feedback loop. A biased hiring algorithm screens out qualified candidates from underrepresented groups. Those people are not hired. Future training data — collected from subsequent hiring decisions — contains even fewer examples of those candidates being hired successfully. The next version of the model, trained on that data, learns an even stronger version of the original bias. Researchers call this runaway feedback or performative prediction: the algorithm's decisions shape the reality that will be used to evaluate and retrain it.

This is not hypothetical. A 2019 study by researchers at Cornell and MIT found that recommender systems for job postings on a major platform were showing higher-paying job ads to men significantly more often than to identically qualified women — and that this gap widened over multiple cycles of data collection and retraining.

Lesson 1 Quiz

Training Data Is Not Neutral Ground · 4 questions
Amazon's résumé-screening AI penalized candidates who attended all-female colleges. What was the primary cause of this behavior?
Correct. No engineer wrote a discriminatory rule. The bias emerged automatically from the pattern the model found in a decade of historically male-dominated hiring decisions — a textbook example of historical bias in training data.
Not quite. The bias was not intentional and did not come from an explicit rule. It emerged from the training data itself: ten years of hiring decisions in which men were overwhelmingly selected for technical roles. The model learned that pattern and reproduced it.
Joy Buolamwini's 2018 "Gender Shades" study found that IBM and Microsoft face recognition systems had error rates as high as 34.7% on which group?
Correct. The systems performed nearly perfectly on light-skinned men (under 1% error) and worst on dark-skinned women (up to 34.7% error). The root cause was representation bias: training datasets contained far more images of light-skinned individuals.
Incorrect. The highest error rates — up to 34.7% — were found for dark-skinned women. Light-skinned men had error rates below 1%. This disparity is a clear example of representation bias: the model had seen far fewer dark-skinned faces during training.
The COMPAS recidivism algorithm used prior arrests as one of its inputs. Why does this introduce measurement bias?
Correct. Arrests are not convictions, and areas with heavier policing produce more arrests regardless of actual crime levels. When COMPAS used arrest records as a signal, it was inadvertently encoding policing disparities into its risk scores — a classic case of measurement bias through a flawed proxy.
That is not the right framing. The issue is not legality but accuracy: arrest records measure how much police attention a neighborhood receives, not how much crime occurs there. Using arrests as a proxy for criminal behavior embeds existing policing disparities directly into the algorithm's outputs.
What is "runaway feedback" in the context of biased AI systems?
Correct. Runaway feedback (also called performative prediction) means the algorithm's biased outputs alter the reality that produces its next training dataset. A hiring tool that screens out women results in fewer women being hired, which produces future training data with even fewer successful female hires, which trains an even more biased model.
Incorrect. Runaway feedback refers to the self-reinforcing cycle in which a biased model's decisions shape the future data used to retrain it. Because the model determines outcomes, and those outcomes become training labels, an initial bias can grow stronger with each successive version of the model.

Lab 1 — The Data Audit

Investigate training data bias with an AI research assistant · 3 exchanges to complete

Your Task

You are reviewing a fictional hiring algorithm before it goes into production. The system was trained on five years of hiring decisions at a large consulting firm. Your job is to interrogate the AI assistant about the training data and identify which types of bias may be present.

Suggested starting points: Ask about the composition of the training dataset, what labels were used and who created them, or how the system handles applicants from groups underrepresented in the historical data. Push back on answers that seem to dismiss concerns.
Bias Audit Assistant
Lab 1 · Training Data
I'm your bias audit assistant for Lab 1. I represent the data team that built the consulting firm's hiring algorithm. You're here to audit the training data before this system goes live. Ask me anything about how the data was collected, labeled, and used — I'll answer as the data team would. What would you like to know first?
Coded Unfair: AI Bias Exposed · Lesson 2

When the Label Is the Lie

Ground truth is only as trustworthy as the humans who produced it — and the institutions that shaped those humans.
If the correct answer in your training data was wrong all along, how would you know?

In 2017, Stanford researchers published a landmark paper in Nature describing an AI system that matched board-certified dermatologists at classifying skin lesions from photographs. The system was trained on 129,450 clinical images. The media coverage was electric: here was proof that AI could democratize expert diagnosis, bringing dermatology-quality care to patients who had no access to specialists. What the coverage largely omitted was a single sentence buried in the supplementary materials: of the 129,450 images, the vast majority depicted light skin. When independent researchers tested similar systems on darker skin tones in 2019 and 2021, accuracy dropped substantially — in some cases making the AI less useful than a general practitioner's visual inspection. The labels on the training images were provided by expert dermatologists. They were not wrong in any simple sense. But the data the dermatologists had generated over decades of clinical practice reflected the demographics of who had historically sought and received dermatological care in the United States.

What "Ground Truth" Means in Practice

In machine learning, the labels attached to training examples are called ground truth — the definitive correct answer the model is trying to learn. The term implies objectivity. It suggests the labels reflect reality rather than interpretation. This framing is often wrong.

Every label in a training dataset was produced by someone. That someone operated within institutional constraints, applied professional standards developed at a particular historical moment, and brought both explicit and implicit assumptions to the labeling task. When we train a model on those labels, we are not training it on reality. We are training it on a particular human community's interpretation of reality — an interpretation shaped by who was in that community, what resources they had, and whose problems they were trying to solve.

The St. George's Hospital Case

In 1988, researchers discovered that St. George's Hospital Medical School in London had been using a computer program to shortlist applicants since 1979. The program had been trained on historical admissions decisions. An internal review found it had been systematically discriminating against women and applicants with non-European names — not because anyone programmed it to, but because the historical decisions it learned from reflected the biases of the selection committees of the 1970s. This case predates modern machine learning by decades, yet the mechanism is identical.

Label Bias in Criminal Justice

The labeling problem is especially visible in criminal justice AI. Systems that predict recidivism are trained on labels like "re-arrested within two years." But re-arrest is not the same as re-offense. People who are on parole, who live in heavily policed neighborhoods, or who are Black are re-arrested at higher rates than others who commit identical acts but face less police surveillance. The label — re-arrested — reflects system behavior, not underlying behavior. A model trained on this label learns to predict who will be caught, which is heavily correlated with race and geography, rather than who will commit a crime.

This distinction was central to the ongoing academic debate sparked by the 2016 ProPublica investigation into COMPAS. Northpointe (now Equivant), the company that built COMPAS, argued that their system was "fair" because it had equal predictive accuracy across racial groups. ProPublica argued it was unfair because it had dramatically different false positive rates — labeling Black defendants as high-risk when they would not re-offend at roughly twice the rate it did for white defendants. Both claims are mathematically true. They represent different definitions of fairness applied to the same numbers.

Key Terms

Ground truthThe labels attached to training data, treated as the definitive correct answer. In practice, ground truth reflects the judgments of whoever created the labels and is subject to the same biases as any human judgment.
Label biasSystematic error introduced into training data when the labels themselves were produced through biased processes — for example, diagnoses made by practitioners who rarely treated certain patient populations.
Construct validityThe degree to which a measured variable actually captures the underlying concept it is supposed to represent. Re-arrest, for instance, has low construct validity as a measure of re-offense.
Fairness impossibility theoremA mathematical result showing that several common definitions of algorithmic fairness cannot all be satisfied simultaneously when base rates differ between groups — forcing developers to make an explicit value choice about which form of fairness to prioritize.
The Deeper Problem

Label bias is harder to detect than representation bias because the data looks complete. You have examples from every group. You have expert labels. The error is not that data is missing — it is that the data faithfully records a biased past. Auditing for this requires not just looking at the dataset but understanding the history of the institution that produced it.

The Annotation Problem at Scale

Modern large-scale AI systems require massive quantities of labeled data, which means the labeling work is typically distributed across thousands of low-paid workers, often employed through platforms like Amazon Mechanical Turk or Appen. Research by scholars including Lilly Irani and Mary Gray has documented the working conditions in this "ghost work" economy — the invisible human labor behind AI's apparent automation. These annotation workers bring their own cultural contexts, language assumptions, and blind spots to labeling tasks. A 2021 study found that crowd-sourced content moderation labels for what constituted "offensive language" varied significantly by annotator demographics, with white annotators being substantially more likely to label African American Vernacular English as offensive than Black annotators were. The training data for toxicity classifiers — used by major platforms to moderate billions of posts — carries this annotation artifact forward.

Lesson 2 Quiz

When the Label Is the Lie · 4 questions
The Stanford dermatology AI published in Nature in 2017 performed well overall but poorly on darker skin tones. What best explains this?
Correct. The dataset was a faithful record of the clinical images dermatologists had generated over decades — but those images reflected which patients had historically sought and received specialist care. The AI performed worst on patients least represented in its training data.
Incorrect. The issue was representation bias in the training data. The 129,450 images overwhelmingly depicted light skin because that reflected decades of real clinical practice. The algorithm's poor performance on darker skin tones was a predictable consequence of who was and wasn't in the training set.
St. George's Hospital Medical School used a computer admissions program from 1979 that was found in 1988 to discriminate against women and non-European applicants. What mechanism drove this?
Correct. The St. George's case is one of the earliest documented examples of what we now call historical bias in automated decision-making. The computer did not invent the discrimination — it reproduced and systematized discriminatory decisions that humans had already been making for years.
Incorrect. The discrimination was not intentionally coded or caused by a bug. The program learned to replicate the decisions made by human selection committees in the 1970s — decisions that reflected the era's biases against women and non-European applicants. This is a pre-modern-AI example of historical bias.
ProPublica found that COMPAS had higher false positive rates for Black defendants than for white defendants. Northpointe argued the system was fair because it had equal predictive accuracy across groups. How should we understand this disagreement?
Correct. This is one of the most important lessons in algorithmic fairness: multiple mathematical definitions of fairness exist, they can conflict with each other, and when base rates differ between groups (which they do in criminal justice due to prior systemic inequities), satisfying all definitions simultaneously is mathematically impossible. The choice between them is a policy and ethical decision, not a purely technical one.
Incorrect. Both analyses are mathematically valid — they simply measure different things. The fairness impossibility theorem establishes that when base rates differ between groups, you cannot simultaneously satisfy all common mathematical definitions of algorithmic fairness. The COMPAS debate illustrates this: equal predictive accuracy and equal false positive rates are genuinely incompatible under those conditions.
A 2021 study found that crowd-sourced annotators labeled African American Vernacular English as "offensive" at higher rates than Black annotators did. What risk does this create for AI systems trained on such data?
Correct. If the training labels encode one demographic group's perception of what counts as offensive, the resulting model will apply that standard universally — effectively treating the linguistic norms of the majority-annotator group as the neutral baseline and penalizing departures from it.
Incorrect. The risk runs in the opposite direction. If annotations over-flagged African American Vernacular English, a toxicity classifier trained on those labels would likely suppress more speech from Black users, not less harmful content overall. This is a form of label bias with real consequences for platform moderation at scale.

Lab 2 — Ground Truth on Trial

Interrogate label quality in a medical AI system · 3 exchanges to complete

Your Task

A hospital network is deploying a wound-classification AI to help nurses triage skin wounds. The system was trained on images labeled by dermatologists at three academic medical centers in the U.S. Northeast. You are a health equity consultant brought in to evaluate the label quality before deployment.

Consider asking: Who were the patients whose images were used? Did all patient demographics appear equally in the training data? How were labeling disagreements between dermatologists resolved? What does "accuracy" mean for this system, and for whom?
Medical AI Label Auditor
Lab 2 · Label Bias
Welcome to Lab 2. I'm representing the clinical informatics team that built the wound-classification AI. You're auditing our label quality before we deploy at six community hospitals, three of which serve predominantly communities of color. What would you like to examine first?
Coded Unfair: AI Bias Exposed · Lesson 3

The Architecture of Discrimination

Bias is not only in what you feed a model — it is in what you ask it to optimize for, and how you measure success.
Can a model be technically correct and systematically unjust at the same time?

In June 2015, software developer Jacky Alcine discovered that Google Photos had automatically labeled photos of himself and a friend — both Black — as "gorillas." Google apologized immediately and within days removed the label category entirely. The technical failure was a face recognition model trained on datasets that dramatically underrepresented dark-skinned faces. But the case also reveals something beyond representation bias: the model was optimizing for overall classification accuracy across all images in the training set. Because dark-skinned individuals were rare in that set, misclassifying them had almost no effect on the aggregate accuracy metric the engineers were watching. The objective function — maximize accuracy — was met. The outcome was catastrophic. The problem was not just the data; it was the metric the developers chose to define success.

Objective Functions and What They Optimize For

Every machine learning model is built around an objective function: a mathematical formula that defines what the model is trying to maximize or minimize. In classification tasks, this is usually accuracy — the percentage of examples the model gets right across the entire dataset. This sounds reasonable until you consider what happens when the dataset is imbalanced.

If 95% of your training examples belong to Group A and 5% belong to Group B, a model that ignores Group B entirely and just predicts the majority-group outcome for everyone will achieve 95% accuracy. It has technically achieved the objective while completely failing Group B. This is known as the accuracy paradox, and it is not a theoretical edge case. It describes the actual situation in many deployed AI systems in healthcare, criminal justice, and credit scoring.

The Illinois DCFS Child Welfare Case, 2022

A child welfare algorithm used by Illinois's Department of Children and Family Services to predict child abuse risk was found by researchers at the University of Michigan in 2022 to flag Black families at substantially higher rates than white families with comparable risk profiles. The system was optimized for predictive accuracy on its full dataset. Because Black families had historically been investigated at higher rates — due to disparate reporting and surveillance — they were overrepresented in the "high risk" labeled examples. The model learned that pattern and reproduced it, achieving strong overall accuracy while systematically over-surveilling one group.

What Gets Measured, What Gets Managed

Developers track the metrics they choose to track. In the early deployment of many AI systems, those metrics were aggregate ones: overall accuracy, area under the curve, mean average precision. These metrics report performance averaged across all examples and all groups. They are entirely insensitive to the distribution of errors across subgroups — they cannot tell you whether a model that is 92% accurate overall is 99% accurate for one group and 70% accurate for another.

In 2019, researchers at Google published a framework called Model Cards — a standardized approach to documenting AI systems that includes disaggregated performance metrics: accuracy broken down by gender, age, race, and other relevant subgroups. The same year, IBM released AI Fairness 360, an open-source toolkit providing over 70 fairness metrics and mitigation algorithms. These tools exist precisely because aggregate metrics were insufficient.

But measurement reform alone is insufficient. Choosing which fairness metrics to report is itself a value-laden decision. Reporting disaggregated accuracy tells you something, but not which threshold of disparity is acceptable, or what trade-off between group accuracy and overall accuracy is justified.

Key Terms

Objective functionThe mathematical formula a model is trained to optimize — often overall accuracy, log loss, or AUC. The choice of objective function determines what the model values and which errors it is willing to make.
Accuracy paradoxThe counterintuitive situation where a model achieves high overall accuracy by performing well on the majority class while failing almost entirely on a minority class — because minority-class errors barely affect the aggregate metric.
Disaggregated metricsPerformance statistics reported separately for different demographic subgroups rather than averaged across all examples. Disaggregated metrics reveal disparities that aggregate metrics hide.
Model CardsA documentation standard proposed by Google researchers in 2019 that requires AI systems to report disaggregated performance metrics across relevant subgroups, intended to make bias visible at deployment time.
Subgroup validityThe extent to which a model's performance on the overall dataset accurately reflects its performance on specific subgroups. Low subgroup validity means the model may be broadly accurate while failing specific populations.
The Design Choice Nobody Labels as a Design Choice

Optimizing for overall accuracy is presented as a default, a neutral technical starting point. It is not. It is a choice to treat errors as equal regardless of who bears them. When a model fails on a minority group, those errors are numerically small and get averaged away. Making that choice visible — and asking whether it is defensible — is one of the most important interventions a human can make in the AI development process.

The Threshold Problem

Classification models typically output a probability — a number between 0 and 1 — and then apply a threshold to convert that to a binary decision. At what probability score does a loan applicant get denied? At what recidivism score does a defendant get held without bail? The threshold is chosen by humans, and where you set it determines the balance between false positives and false negatives.

A 2019 study published in Science by researchers at Obermeyer et al. examined a commercial algorithm used by U.S. health insurers to identify patients for care management programs. The algorithm used healthcare spending as a proxy for medical need. Because Black patients historically had less access to care, they spent less on average than white patients with equivalent levels of illness. The model therefore assigned lower risk scores to Black patients, systematically underenrolling them in the care management programs. Setting a single cost-based threshold for enrollment, without auditing its effect by race, produced a system that was algorithmically coherent and medically inequitable. Correcting the bias required not just changing the threshold but replacing the proxy variable entirely.

Lesson 3 Quiz

The Architecture of Discrimination · 4 questions
In the Google Photos incident of 2015, why did optimizing for overall accuracy fail to prevent the misclassification of Black users' photos?
Correct. This is the accuracy paradox in action. When a subgroup is small, even severe misclassification of that group barely moves the aggregate accuracy number. Engineers watching the aggregate metric would see nothing wrong. The errors only become visible when you disaggregate the metric by subgroup.
Incorrect. The failure was structural, not a matter of negligence. Because Black individuals were underrepresented in the training data, errors on that group had a trivially small effect on the overall accuracy number. The engineers were watching a metric that was genuinely incapable of revealing this problem.
What is the accuracy paradox?
Correct. The accuracy paradox is why aggregate accuracy is an insufficient metric for evaluating AI fairness. A model can be technically achieving its optimization target — high accuracy — while completely failing a subgroup whose errors, being few in absolute number, are averaged away in the aggregate statistic.
Incorrect. The accuracy paradox refers specifically to the situation where high aggregate accuracy masks severe failure on a minority group. Because the minority group represents a small fraction of the dataset, even 100% error rates for that group may move overall accuracy only marginally — making the problem invisible to aggregate metrics.
The Obermeyer et al. 2019 study found that a healthcare algorithm underenrolled Black patients in care management programs. What was the root cause?
Correct. This is a sophisticated example of measurement bias compounded by threshold setting. Healthcare spending is a reasonable-sounding proxy for medical need — sicker people spend more. But that relationship breaks down when access to care is unequal. Black patients with equivalent illness severity spent less, so the model rated them as healthier. The bias was in the proxy variable, not the algorithm's math.
Incorrect. Race was not an explicit input. The problem was that healthcare spending — the proxy variable — was itself racially confounded: Black patients with equivalent illness spent less due to historical barriers to care access. The algorithm was technically race-blind but produced racially disparate outcomes because the variable it used correlated with race in ways the developers had not examined.
What are disaggregated metrics and why do they matter for AI fairness evaluation?
Correct. Disaggregated metrics — advocated through tools like Google's Model Cards and IBM's AI Fairness 360 — report how a model performs for men vs. women, for different racial groups, for different age brackets, etc. Only by breaking down performance this way can you identify situations where strong aggregate performance hides severe subgroup failure.
Incorrect. Disaggregated metrics report model performance separately for different demographic subgroups rather than as a single averaged number. They are essential for fairness auditing precisely because they reveal what aggregate metrics hide: that a model with 92% overall accuracy might be 99% accurate for one group and 70% accurate for another.

Lab 3 — Metric Design Workshop

Challenge an AI team's success metrics before deployment · 3 exchanges to complete

Your Task

A large bank is deploying an AI loan underwriting system. The engineering team reports 94% overall accuracy on their test set and is ready to go live. You are a fairness auditor. Your job is to probe what that 94% means, which errors are included in it, and whether the metric adequately captures the system's performance across applicant demographics.

Try asking: What does the 94% figure represent — which metric, on which test set? How does accuracy break down for applicants from different income levels, races, or zip codes? What is the false positive rate versus the false negative rate, and for whom? Would you be comfortable if the 6% error rate fell entirely on one demographic group?
Loan AI Fairness Auditor
Lab 3 · Metrics & Optimization
Welcome to Lab 3. I'm the lead ML engineer on the loan underwriting system. We've hit 94% accuracy on our held-out test set of 50,000 applications, and the business team wants to go live next quarter. You're here to sign off from a fairness perspective. What questions do you have about our evaluation methodology?
Coded Unfair: AI Bias Exposed · Lesson 4

Who Gets to Fix It — and How

Technical mitigations exist, but they are insufficient without accountability structures, legal frameworks, and human oversight.
If bias is structural, can it be fixed with a technical patch — or does fixing it require changing the structure?

HireVue's AI-powered video interview system analyzed candidates' facial expressions, vocal tone, word choice, and body language to generate a "hirability" score. By 2019, the system had been used to screen more than 10 million job seekers for companies including Goldman Sachs, Unilever, and Hilton. In January 2021, following pressure from the Electronic Privacy Information Center — which had filed an FTC complaint in 2019 — HireVue announced it was dropping facial expression analysis from its assessments. The company acknowledged that the facial analysis component had not been validated for bias and that scientific support for inferring personality from facial movement was weak. The pivot was partially a technical fix — remove the unreliable input — but it was primarily a legal and reputational response. The technical problem had existed since deployment. What changed was external accountability pressure, not internal technical discovery.

Technical Mitigations and Their Limits

Researchers have developed a range of technical approaches to reducing algorithmic bias. These fall into three general categories: pre-processing (modifying training data before training to remove or balance bias), in-processing (modifying the training algorithm itself to incorporate fairness constraints), and post-processing (adjusting the model's outputs after training to equalize outcomes across groups).

Pre-processing techniques include re-sampling to balance underrepresented groups, re-weighting examples so minority-group errors are penalized more heavily, and adversarial debiasing — training a secondary model to detect demographic signals in the primary model's representations and penalizing the primary model for relying on them. IBM's AI Fairness 360 toolkit implements dozens of these approaches.

In-processing approaches include adding fairness constraints directly to the objective function — for example, requiring that the model's false positive rate differs by no more than a specified percentage between groups. This directly trades off against overall accuracy; making the model fairer in this sense makes it less accurate in aggregate, and how much of that trade-off is acceptable is a policy question, not a technical one.

Post-processing approaches include adjusting decision thresholds separately for different groups so that error rates are equalized. This approach was applied in research on COMPAS: researchers showed that by adjusting the risk threshold for Black and white defendants separately, you could equalize false positive rates — but only by also reducing overall predictive accuracy. Again, the trade-off is real and unavoidable.

The Resampling Trap

A common instinct is to fix representation bias by simply adding more examples from underrepresented groups. This helps but does not fully solve the problem if the labels on those new examples are themselves biased. Adding more images of dark-skinned patients to a dermatology dataset is valuable only if those images are labeled with the same quality and consistency as light-skinned images. If the annotation process itself is biased — different standards applied to different groups — more data can amplify the problem.

Legal Frameworks: What Exists, What Is Inadequate

In the United States, algorithmic discrimination is addressed primarily through existing civil rights laws rather than AI-specific regulation. Title VII of the Civil Rights Act prohibits employment discrimination; the Fair Housing Act prohibits discrimination in housing; the Equal Credit Opportunity Act prohibits discrimination in lending. These laws apply to algorithmic decisions as they apply to human decisions — but enforcement is hampered by the opacity of proprietary systems, the difficulty of proving intent, and the absence of mandatory disclosure requirements.

In 2023, the Equal Employment Opportunity Commission issued guidance stating that AI hiring tools are subject to Title VII and that employers cannot use "business necessity" to justify a tool that produces disparate impact without validating that it is job-related. In 2024, the Consumer Financial Protection Bureau issued guidance that algorithmic credit decisions must comply with the Equal Credit Opportunity Act's adverse action notice requirements — lenders must explain denials in terms a consumer can understand, even when the decision was made by a model.

The European Union's AI Act, which entered force in 2024, takes a more structural approach. It classifies AI systems used in employment, credit, and criminal justice as "high-risk" and requires conformity assessments, bias audits, transparency documentation, and human oversight before deployment. The long-term effect on practice remains to be seen.

Key Terms

Pre-processing debiasingTechniques applied to training data before model training to reduce bias — including re-sampling, re-weighting, and adversarial data augmentation.
In-processing debiasingTechniques that modify the training algorithm itself, typically by adding fairness constraints to the objective function that the model must satisfy alongside accuracy targets.
Post-processing debiasingTechniques applied after model training to adjust outputs — for example, applying group-specific decision thresholds to equalize error rates across demographic subgroups.
Disparate impactA legal doctrine holding that a facially neutral practice that produces substantially different outcomes for protected groups is discriminatory, regardless of intent. Established by the U.S. Supreme Court in Griggs v. Duke Power Co. (1971).
Algorithmic accountabilityThe set of mechanisms — technical audits, legal requirements, organizational practices — by which the developers and deployers of AI systems can be held responsible for discriminatory outcomes.
The Accountability Gap

The HireVue case illustrates a recurring pattern: technical bias problems that exist from deployment are only addressed when external accountability pressure — legal complaints, investigative journalism, regulatory scrutiny — becomes commercially costly. Internal technical teams often identify problems early. The organizational incentives to disclose and address them are frequently weak. Technical tools for measuring and mitigating bias are now widely available. The constraint is not technical capacity but accountability structure.

What Effective Oversight Requires

Effective oversight of deployed AI systems requires several things that are currently rare in practice. First, mandatory pre-deployment bias audits conducted by parties with no financial stake in approval. The dermatology AI described in Lesson 1 was published in Nature without any audit of its performance on darker skin tones, and the journal's peer reviewers did not require one. This was not malice — it was the absence of a norm requiring it.

Second, ongoing monitoring after deployment, because distribution shift — changes in the population of users, their demographics, and the context of use — can introduce new biases even when a system was reasonably fair at launch. The 2020 paper "Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations" found that AI radiology tools, deployed in new clinical settings serving different patient demographics than those used for validation, exhibited substantially higher error rates. No one had checked.

Third, meaningful recourse for individuals harmed by algorithmic decisions. Currently, most systems that deny loans, flag welfare fraud, or assign criminal risk scores offer no mechanism for individuals to challenge the decision, understand the basis for it, or identify whether bias played a role. The EU AI Act's right to human review of high-risk AI decisions is one model. Whether it will function in practice depends entirely on enforcement capacity that is not yet in place.

Lesson 4 Quiz

Who Gets to Fix It — and How · 4 questions
HireVue removed facial expression analysis from its hiring AI in January 2021. What primarily triggered this change?
Correct. The facial analysis component's scientific weakness and bias risks had existed since deployment. The change came not from internal discovery but from external pressure — a regulatory complaint, investigative coverage, and the legal exposure that followed. This illustrates the accountability gap: technical problems exist, but organizational incentives to fix them often require external forcing functions.
Incorrect. The component was not removed due to internal technical discovery or proactive action. It was the FTC complaint filed by the Electronic Privacy Information Center in 2019, and the legal and reputational pressure that followed, that made the feature commercially unsustainable. The technical problems predated the fix by years.
In-processing debiasing techniques address bias by:
Correct. In-processing approaches intervene during training, typically by adding a fairness penalty to the objective function. This means the model must balance accuracy with a fairness constraint simultaneously — a direct, mathematically enforced trade-off that forces developers to make the fairness-accuracy balance explicit rather than ignoring it.
Incorrect. That describes pre-processing (removing examples before training) or post-processing (adjusting thresholds after training). In-processing debiasing modifies the training algorithm itself — most commonly by adding fairness constraints directly to the loss function the model optimizes during training.
The legal doctrine of "disparate impact" — established in Griggs v. Duke Power Co. (1971) — holds that:
Correct. Griggs established that employment practices producing disparate outcomes for protected groups can be unlawful even absent discriminatory intent — and that the burden falls on employers to demonstrate business necessity and job-relatedness. The EEOC's 2023 guidance explicitly extends this doctrine to AI hiring tools that produce disparate impact.
Incorrect. Disparate impact doctrine, established in Griggs v. Duke Power Co. (1971), holds that intent is not required. A practice that produces substantially different outcomes for protected groups is presumptively discriminatory, and the employer bears the burden of proving it is genuinely necessary and job-related. The EEOC in 2023 confirmed this applies to AI hiring systems.
What problem does "distribution shift" create for AI systems that were assessed as reasonably fair at deployment?
Correct. A model validated on one patient population and then deployed in a hospital serving a different demographic may exhibit substantially higher error rates for the new population. The 2020 chest radiography study found exactly this. Fair at launch does not mean fair in deployment — continuous monitoring is required, not just a one-time pre-deployment audit.
Incorrect. Distribution shift refers to the mismatch between the population used for validation and the population encountered in deployment. If a model is validated on one demographic and then used on a different one, performance can degrade significantly — and often disproportionately for minority groups. A one-time fairness audit does not protect against this.

Lab 4 — Accountability by Design

Design an oversight framework for a high-stakes AI deployment · 3 exchanges to complete

Your Task

A city government is deploying a predictive AI system to flag welfare benefits applications for fraud review. Before it goes live, you have been appointed to design the accountability and oversight framework. The AI assistant will help you think through the requirements — push it to be specific and honest about gaps.

Consider: What pre-deployment audit requirements should exist? Who conducts the audit and who funds it? What ongoing monitoring is required post-deployment? What recourse do individuals have if they believe they were wrongly flagged? Which legal frameworks apply, and are they adequate? What happens when bias is discovered after launch?
AI Accountability Framework Advisor
Lab 4 · Oversight & Accountability
Welcome to Lab 4. I'm your policy and technical advisor for designing the oversight framework for this fraud-detection AI. This system will automatically flag benefit applications for human review — affecting thousands of applicants per month across demographics that include elderly residents, people with disabilities, and non-English speakers. Where would you like to start: pre-deployment auditing, ongoing monitoring, individual recourse, or legal compliance?

Module 1 Test

The Algorithm That Chose Wrong · 15 questions · 80% to pass
1. What did Amazon's scrapped résumé-screening AI learn to do with the word "women's" (as in "women's chess club")?
Correct. The model penalized résumés containing "women's" because its ten years of training data showed that successfully hired candidates were predominantly male. No rule was written — the bias emerged from the pattern.
Incorrect. The system penalized these résumés. It inferred from ten years of historical male-dominated hiring that female-coded signals were negatively correlated with being hired. This is historical bias reproduced automatically through training data.
2. Representation bias in training data refers to:
Correct. Representation bias occurs when some groups appear rarely or not at all in training data. The model has few examples to learn from for those groups, and its performance on them is consequently poor — as demonstrated in the Gender Shades study on face recognition.
Incorrect. Representation bias is not about intent or overweighting. It refers to underrepresentation: certain groups appear so rarely in training data that the model learns too little about them, producing worse performance on precisely those groups.
3. Joy Buolamwini's 2018 "Gender Shades" study found maximum error rates of up to 34.7% in commercial face recognition systems. On which combination of characteristics was this error rate observed?
Correct. The intersection of darker skin tone and female gender produced the highest error rates — up to 34.7% — while light-skinned men had error rates below 1%. This was driven by representation bias: training datasets contained far more light-skinned, male faces.
Incorrect. The highest error rates were observed for dark-skinned women specifically — up to 34.7% compared to under 1% for light-skinned men. The compounding of two underrepresented characteristics produced the largest performance gap.
4. COMPAS's use of prior arrests (rather than convictions) as an input variable is an example of which type of bias?
Correct. Measurement bias occurs when a variable used in a model is a flawed proxy for the underlying construct of interest. Prior arrests measure policing intensity, not actual criminal behavior — making them a biased proxy for the recidivism risk the model claims to predict.
Incorrect. This is measurement bias: the variable being used (arrests) is a systematically flawed proxy for the thing the model is supposed to measure (likelihood of re-offending). Arrests are heavily influenced by policing patterns, not just actual criminal behavior.
5. St. George's Hospital Medical School in London used an automated admissions program from 1979 onward. What did an internal review discover in 1988?
Correct. The St. George's case is an early documented example of historical bias in automated decision-making. The program did not invent discrimination — it automated and systematized the discriminatory decisions already being made by human selection committees, making the bias faster and more consistent.
Incorrect. The program had been discriminating against women and non-European applicants since 1979 — not randomly but systematically, by reproducing the biased patterns in historical admissions decisions made by human committees. The 1988 review confirmed this had been happening for nearly a decade.
6. The term "ground truth" in machine learning refers to:
Correct. "Ground truth" implies objectivity but is actually shorthand for "the labels someone attached to this data." Those labels are human products, subject to human biases, institutional constraints, and historical circumstances — as illustrated by the dermatology AI case.
Incorrect. Ground truth is the label data used for training — not an independent or objective standard. The term implies more certainty than is warranted: labels are produced by humans in specific institutional and historical contexts and can be systematically biased.
7. The "fairness impossibility theorem" establishes that:
Correct. The fairness impossibility theorem, proved by researchers including Chouldechova (2017) and Kleinberg et al. (2017), shows that satisfying all common fairness criteria simultaneously is impossible when base rates differ between groups. This means fairness choices are inherently value choices — they cannot be resolved purely technically.
Incorrect. The fairness impossibility theorem is a specific mathematical result: when base rates differ between groups (which they typically do in domains shaped by historical inequity), common fairness criteria like equal accuracy, equal false positive rate, and equal false negative rate are mutually incompatible. You must choose — and that choice is a value judgment.
8. Crowd-sourced content moderation annotators were found to label African American Vernacular English as offensive at higher rates than Black annotators. What does this illustrate?
Correct. When the annotator pool is demographically homogeneous, the resulting labels reflect that group's norms. A toxicity classifier trained on those labels treats one demographic group's linguistic standards as neutral and may disproportionately suppress speech that departs from those standards.
Incorrect. This is label bias. The labels were not produced by an algorithm but by humans whose cultural and linguistic assumptions shaped their judgments. Those assumptions then get encoded in the trained model, which applies the majority annotator group's norms as though they were universal.
9. What is the accuracy paradox in the context of AI model evaluation?
Correct. The accuracy paradox explains why aggregate accuracy is an insufficient fairness metric. A model can hit 94% overall accuracy while performing at 60% for a minority group — because that group's errors are averaged away by the majority group's good performance.
Incorrect. The accuracy paradox specifically refers to how high aggregate accuracy can mask severe subgroup failure. Because minority groups represent small fractions of the dataset, even near-total failure on them barely moves the overall accuracy figure — making aggregate accuracy a misleading success metric for fairness evaluation.
10. The Obermeyer et al. 2019 Science study found that a health insurance algorithm underenrolled Black patients in care management programs. Which type of bias does the use of healthcare spending as a proxy for medical need exemplify?
Correct. Measurement bias occurs when the variable used to predict an outcome is a flawed proxy for what you actually care about. Healthcare spending looks like a reasonable proxy for medical need — but when access to care is unequal, spending reflects access patterns as much as illness severity, confounding the measurement.
Incorrect. While historical discrimination in healthcare access is part of the context, the specific mechanism here is measurement bias: the proxy variable (spending) does not accurately measure the underlying construct (medical need) when access to care is unequal. People who cannot access care spend less regardless of how sick they are.
11. Post-processing debiasing of a recidivism algorithm — adjusting thresholds separately for different racial groups — was shown to equalize false positive rates across groups. What was the necessary trade-off?
Correct. The fairness impossibility theorem predicts exactly this: satisfying one fairness criterion (equal false positive rates) while base rates differ between groups requires sacrificing another criterion (overall accuracy). Technical interventions cannot make the trade-off disappear; they make it visible and explicit.
Incorrect. Post-processing to equalize false positive rates did reduce overall predictive accuracy — this is not a workaround failure but a mathematical inevitability. When base rates differ between groups, you cannot simultaneously maximize overall accuracy and equalize error rates across groups. The choice between them is a value decision.
12. Google's Model Cards framework, proposed in 2019, addresses AI bias primarily by:
Correct. Model Cards are a documentation standard, not a technical debiasing tool. Their contribution is transparency: by requiring disaggregated performance metrics, they make subgroup failures visible to users and deployers who would otherwise only see aggregate accuracy figures.
Incorrect. Model Cards are not an automated debiasing tool or certification requirement. They are a documentation standard that requires reporting performance metrics disaggregated by relevant subgroups — so that someone deploying or using the model can see how it performs for different populations, not just on average.
13. What does "runaway feedback" (also called performative prediction) mean in the context of biased AI systems?
Correct. Runaway feedback is one reason bias tends to compound rather than remain stable. A biased model produces biased outcomes; those outcomes become training data for the next model; the next model learns an amplified version of the original bias. The system's own decisions change the world in ways that reinforce its errors.
Incorrect. Runaway feedback refers to the self-amplifying cycle of bias: a biased model's outputs shape the real-world outcomes that become its future training data. A hiring algorithm that screens out qualified minority candidates produces a future where fewer minority candidates appear in "successful hire" training labels — making the next model even more biased.
14. HireVue's removal of facial expression analysis in 2021 and the EEOC's 2023 guidance on AI hiring tools both reflect which broader mechanism for addressing algorithmic bias?
Correct. Both cases illustrate the accountability gap: technical problems that exist at deployment are often only addressed when external pressure — regulatory action, legal complaints, investigative journalism — makes inaction commercially or legally untenable. Technical tools for measuring bias are available; organizational incentives to act on them are often insufficient without external accountability.
Incorrect. Neither change was primarily driven by internal engineering action or voluntary ethics commitments. HireVue responded to an FTC complaint by EPIC; the EEOC issued its guidance in response to documented disparate impact from AI hiring tools. External legal and regulatory pressure was the forcing mechanism in both cases.
15. Which of the following best describes why the EU AI Act's approach to high-risk AI differs from the U.S. approach under existing civil rights law?
Correct. The EU AI Act is structural and preventive: it requires demonstrating conformity before deployment for high-risk systems. U.S. civil rights law is reactive: it allows challenges after discriminatory outcomes occur, but does not require pre-deployment audits. Both are imperfect — the EU Act's effectiveness depends on enforcement capacity that is still being established.
Incorrect. The key distinction is proactive versus reactive. The EU AI Act classifies AI in hiring, credit, and criminal justice as high-risk and requires mandatory bias audits and conformity assessments before deployment. U.S. civil rights law applies anti-discrimination doctrine after the fact — it can be used to challenge discriminatory AI systems, but it does not require anyone to audit for bias before going live.