When the first standardized IQ tests were administered by the U.S. Army in 1917 — the so-called Alpha and Beta tests given to 1.75 million recruits — their designers at Princeton believed they were measuring innate intelligence with scientific objectivity. Within a decade, those scores were being used to argue for immigration quotas and forced sterilization laws. The instruments were mathematically rigorous. The inputs, the assumptions about what intelligence meant and who possessed it, were drenched in the social hierarchies of their era. The math laundered the prejudice. It made discrimination look like measurement.
That pattern is repeating right now, at a scale the Princeton psychologists could not have imagined. Algorithms trained on decades of hiring records, lending decisions, criminal sentencing data, and medical diagnoses are being deployed by Fortune 500 companies, federal courts, insurance underwriters, and hospital systems. The inputs carry the sediment of every discriminatory choice made by every human who produced them. The outputs inherit that sediment — and then, because they arrive as a number or a score rather than a human opinion, they acquire a legitimacy that human opinions never could.
This course exists to make that invisible mechanism visible. We will examine specific documented cases — Amazon's scrapped hiring tool, the COMPAS recidivism algorithm, Google's photo classifier, dermatology AI trained almost entirely on light skin — and extract from each one a concrete, transferable lesson about how bias enters automated systems and what, if anything, can be done about it. We will not pretend there are easy fixes. But understanding the problem precisely is the first form of power available to anyone who encounters these systems — which, in 2025, means nearly everyone.
If you finish every module, here's who you become:
In 2014, Amazon's machine learning team in Edinburgh began building a tool to automate the first stage of recruiting. The system would scan résumés and score candidates from one to five stars, the same way Amazon rated products. Engineers fed it ten years of historical hiring data — real résumés submitted to Amazon between 2004 and 2014, tagged with whether those applicants were eventually hired. By 2015 the system was running. By 2017 the team had discovered a serious problem. The model had learned to penalize résumés that contained the word "women's" — as in "captain of the women's chess club" — and to downgrade graduates of all-female colleges. It had done this without being told to. It inferred the preference from the data: the people Amazon had historically hired in technical roles were overwhelmingly male, so the model treated maleness as a proxy for quality. Amazon quietly shelved the tool in 2018.
No Amazon engineer wrote a rule saying "prefer men." The discrimination emerged from the data itself. This is the central mechanism of training-data bias, and understanding it changes how you read every claim that a particular algorithm is objective.
A supervised machine learning model learns by finding patterns in labeled examples. You show it thousands of inputs — résumés, loan applications, X-ray images, sentences — paired with the "correct" answer someone attached to each one. The model adjusts its internal parameters until it can reliably reproduce those answers. Then it applies the pattern to new inputs it has never seen.
The critical word is labeled. Every label in a training dataset was produced by a human or a human institution. Hiring decisions. Loan approvals. Criminal sentences. Medical diagnoses. Each label reflects the knowledge, assumptions, capacity, and biases of whoever made that decision, in the social and legal context of whenever they made it. When the model learns from those labels, it is not learning some abstract ground truth. It is learning to reproduce human judgment — including the parts of that judgment that were systematically wrong.
The mathematical process of training a model obscures the human origin of the labels. By the time a model is deployed, its billions of parameters look nothing like the original data. The bias is no longer visible as a human opinion. It has been transformed into a weight, a threshold, a correlation — something that looks like an empirical fact about the world rather than a record of past human choices.
Historical bias occurs when the world the data was collected from was already discriminatory. Amazon's résumé tool is a clean example: the underlying reality — that tech companies hired fewer women — was real. The model learned a real pattern. The problem is that a real pattern of discrimination becomes, when automated, a mechanism for perpetuating it indefinitely.
Representation bias occurs when some groups are simply missing from or underrepresented in the dataset. A landmark 2018 study by MIT researcher Joy Buolamwini — published as "Gender Shades" — found that three major commercial face recognition systems (IBM, Microsoft, and Megvision) achieved error rates of less than 1% on light-skinned men but as high as 34.7% on dark-skinned women. The training datasets used to build these systems contained far more images of light-skinned faces. The model performed worst on the people it had seen least.
Measurement bias occurs when the variable being measured is itself a flawed proxy for what you actually care about. The COMPAS recidivism algorithm — used in criminal sentencing across dozens of U.S. states from the mid-2010s onward — predicted "risk of re-offense" using a 137-question survey. A 2016 ProPublica investigation found it labeled Black defendants as higher risk at roughly twice the rate of white defendants with similar criminal histories. One of COMPAS's inputs was prior arrests — not convictions. Prior arrests reflect policing patterns, not actual crime rates. Neighborhoods that receive heavier policing generate more arrests. The measurement was biased before it ever reached the algorithm.
An algorithm trained on biased data will reproduce that bias with mathematical precision. The model is not malfunctioning — it is doing exactly what it was designed to do. This means the problem is not a bug to be patched but a structural feature of how these systems are built, requiring structural responses.
Training-data bias is not a static problem. It tends to compound over time through a feedback loop. A biased hiring algorithm screens out qualified candidates from underrepresented groups. Those people are not hired. Future training data — collected from subsequent hiring decisions — contains even fewer examples of those candidates being hired successfully. The next version of the model, trained on that data, learns an even stronger version of the original bias. Researchers call this runaway feedback or performative prediction: the algorithm's decisions shape the reality that will be used to evaluate and retrain it.
This is not hypothetical. A 2019 study by researchers at Cornell and MIT found that recommender systems for job postings on a major platform were showing higher-paying job ads to men significantly more often than to identically qualified women — and that this gap widened over multiple cycles of data collection and retraining.
You are reviewing a fictional hiring algorithm before it goes into production. The system was trained on five years of hiring decisions at a large consulting firm. Your job is to interrogate the AI assistant about the training data and identify which types of bias may be present.
In 2017, Stanford researchers published a landmark paper in Nature describing an AI system that matched board-certified dermatologists at classifying skin lesions from photographs. The system was trained on 129,450 clinical images. The media coverage was electric: here was proof that AI could democratize expert diagnosis, bringing dermatology-quality care to patients who had no access to specialists. What the coverage largely omitted was a single sentence buried in the supplementary materials: of the 129,450 images, the vast majority depicted light skin. When independent researchers tested similar systems on darker skin tones in 2019 and 2021, accuracy dropped substantially — in some cases making the AI less useful than a general practitioner's visual inspection. The labels on the training images were provided by expert dermatologists. They were not wrong in any simple sense. But the data the dermatologists had generated over decades of clinical practice reflected the demographics of who had historically sought and received dermatological care in the United States.
In machine learning, the labels attached to training examples are called ground truth — the definitive correct answer the model is trying to learn. The term implies objectivity. It suggests the labels reflect reality rather than interpretation. This framing is often wrong.
Every label in a training dataset was produced by someone. That someone operated within institutional constraints, applied professional standards developed at a particular historical moment, and brought both explicit and implicit assumptions to the labeling task. When we train a model on those labels, we are not training it on reality. We are training it on a particular human community's interpretation of reality — an interpretation shaped by who was in that community, what resources they had, and whose problems they were trying to solve.
In 1988, researchers discovered that St. George's Hospital Medical School in London had been using a computer program to shortlist applicants since 1979. The program had been trained on historical admissions decisions. An internal review found it had been systematically discriminating against women and applicants with non-European names — not because anyone programmed it to, but because the historical decisions it learned from reflected the biases of the selection committees of the 1970s. This case predates modern machine learning by decades, yet the mechanism is identical.
The labeling problem is especially visible in criminal justice AI. Systems that predict recidivism are trained on labels like "re-arrested within two years." But re-arrest is not the same as re-offense. People who are on parole, who live in heavily policed neighborhoods, or who are Black are re-arrested at higher rates than others who commit identical acts but face less police surveillance. The label — re-arrested — reflects system behavior, not underlying behavior. A model trained on this label learns to predict who will be caught, which is heavily correlated with race and geography, rather than who will commit a crime.
This distinction was central to the ongoing academic debate sparked by the 2016 ProPublica investigation into COMPAS. Northpointe (now Equivant), the company that built COMPAS, argued that their system was "fair" because it had equal predictive accuracy across racial groups. ProPublica argued it was unfair because it had dramatically different false positive rates — labeling Black defendants as high-risk when they would not re-offend at roughly twice the rate it did for white defendants. Both claims are mathematically true. They represent different definitions of fairness applied to the same numbers.
Label bias is harder to detect than representation bias because the data looks complete. You have examples from every group. You have expert labels. The error is not that data is missing — it is that the data faithfully records a biased past. Auditing for this requires not just looking at the dataset but understanding the history of the institution that produced it.
Modern large-scale AI systems require massive quantities of labeled data, which means the labeling work is typically distributed across thousands of low-paid workers, often employed through platforms like Amazon Mechanical Turk or Appen. Research by scholars including Lilly Irani and Mary Gray has documented the working conditions in this "ghost work" economy — the invisible human labor behind AI's apparent automation. These annotation workers bring their own cultural contexts, language assumptions, and blind spots to labeling tasks. A 2021 study found that crowd-sourced content moderation labels for what constituted "offensive language" varied significantly by annotator demographics, with white annotators being substantially more likely to label African American Vernacular English as offensive than Black annotators were. The training data for toxicity classifiers — used by major platforms to moderate billions of posts — carries this annotation artifact forward.
A hospital network is deploying a wound-classification AI to help nurses triage skin wounds. The system was trained on images labeled by dermatologists at three academic medical centers in the U.S. Northeast. You are a health equity consultant brought in to evaluate the label quality before deployment.
In June 2015, software developer Jacky Alcine discovered that Google Photos had automatically labeled photos of himself and a friend — both Black — as "gorillas." Google apologized immediately and within days removed the label category entirely. The technical failure was a face recognition model trained on datasets that dramatically underrepresented dark-skinned faces. But the case also reveals something beyond representation bias: the model was optimizing for overall classification accuracy across all images in the training set. Because dark-skinned individuals were rare in that set, misclassifying them had almost no effect on the aggregate accuracy metric the engineers were watching. The objective function — maximize accuracy — was met. The outcome was catastrophic. The problem was not just the data; it was the metric the developers chose to define success.
Every machine learning model is built around an objective function: a mathematical formula that defines what the model is trying to maximize or minimize. In classification tasks, this is usually accuracy — the percentage of examples the model gets right across the entire dataset. This sounds reasonable until you consider what happens when the dataset is imbalanced.
If 95% of your training examples belong to Group A and 5% belong to Group B, a model that ignores Group B entirely and just predicts the majority-group outcome for everyone will achieve 95% accuracy. It has technically achieved the objective while completely failing Group B. This is known as the accuracy paradox, and it is not a theoretical edge case. It describes the actual situation in many deployed AI systems in healthcare, criminal justice, and credit scoring.
A child welfare algorithm used by Illinois's Department of Children and Family Services to predict child abuse risk was found by researchers at the University of Michigan in 2022 to flag Black families at substantially higher rates than white families with comparable risk profiles. The system was optimized for predictive accuracy on its full dataset. Because Black families had historically been investigated at higher rates — due to disparate reporting and surveillance — they were overrepresented in the "high risk" labeled examples. The model learned that pattern and reproduced it, achieving strong overall accuracy while systematically over-surveilling one group.
Developers track the metrics they choose to track. In the early deployment of many AI systems, those metrics were aggregate ones: overall accuracy, area under the curve, mean average precision. These metrics report performance averaged across all examples and all groups. They are entirely insensitive to the distribution of errors across subgroups — they cannot tell you whether a model that is 92% accurate overall is 99% accurate for one group and 70% accurate for another.
In 2019, researchers at Google published a framework called Model Cards — a standardized approach to documenting AI systems that includes disaggregated performance metrics: accuracy broken down by gender, age, race, and other relevant subgroups. The same year, IBM released AI Fairness 360, an open-source toolkit providing over 70 fairness metrics and mitigation algorithms. These tools exist precisely because aggregate metrics were insufficient.
But measurement reform alone is insufficient. Choosing which fairness metrics to report is itself a value-laden decision. Reporting disaggregated accuracy tells you something, but not which threshold of disparity is acceptable, or what trade-off between group accuracy and overall accuracy is justified.
Optimizing for overall accuracy is presented as a default, a neutral technical starting point. It is not. It is a choice to treat errors as equal regardless of who bears them. When a model fails on a minority group, those errors are numerically small and get averaged away. Making that choice visible — and asking whether it is defensible — is one of the most important interventions a human can make in the AI development process.
Classification models typically output a probability — a number between 0 and 1 — and then apply a threshold to convert that to a binary decision. At what probability score does a loan applicant get denied? At what recidivism score does a defendant get held without bail? The threshold is chosen by humans, and where you set it determines the balance between false positives and false negatives.
A 2019 study published in Science by researchers at Obermeyer et al. examined a commercial algorithm used by U.S. health insurers to identify patients for care management programs. The algorithm used healthcare spending as a proxy for medical need. Because Black patients historically had less access to care, they spent less on average than white patients with equivalent levels of illness. The model therefore assigned lower risk scores to Black patients, systematically underenrolling them in the care management programs. Setting a single cost-based threshold for enrollment, without auditing its effect by race, produced a system that was algorithmically coherent and medically inequitable. Correcting the bias required not just changing the threshold but replacing the proxy variable entirely.
A large bank is deploying an AI loan underwriting system. The engineering team reports 94% overall accuracy on their test set and is ready to go live. You are a fairness auditor. Your job is to probe what that 94% means, which errors are included in it, and whether the metric adequately captures the system's performance across applicant demographics.
HireVue's AI-powered video interview system analyzed candidates' facial expressions, vocal tone, word choice, and body language to generate a "hirability" score. By 2019, the system had been used to screen more than 10 million job seekers for companies including Goldman Sachs, Unilever, and Hilton. In January 2021, following pressure from the Electronic Privacy Information Center — which had filed an FTC complaint in 2019 — HireVue announced it was dropping facial expression analysis from its assessments. The company acknowledged that the facial analysis component had not been validated for bias and that scientific support for inferring personality from facial movement was weak. The pivot was partially a technical fix — remove the unreliable input — but it was primarily a legal and reputational response. The technical problem had existed since deployment. What changed was external accountability pressure, not internal technical discovery.
Researchers have developed a range of technical approaches to reducing algorithmic bias. These fall into three general categories: pre-processing (modifying training data before training to remove or balance bias), in-processing (modifying the training algorithm itself to incorporate fairness constraints), and post-processing (adjusting the model's outputs after training to equalize outcomes across groups).
Pre-processing techniques include re-sampling to balance underrepresented groups, re-weighting examples so minority-group errors are penalized more heavily, and adversarial debiasing — training a secondary model to detect demographic signals in the primary model's representations and penalizing the primary model for relying on them. IBM's AI Fairness 360 toolkit implements dozens of these approaches.
In-processing approaches include adding fairness constraints directly to the objective function — for example, requiring that the model's false positive rate differs by no more than a specified percentage between groups. This directly trades off against overall accuracy; making the model fairer in this sense makes it less accurate in aggregate, and how much of that trade-off is acceptable is a policy question, not a technical one.
Post-processing approaches include adjusting decision thresholds separately for different groups so that error rates are equalized. This approach was applied in research on COMPAS: researchers showed that by adjusting the risk threshold for Black and white defendants separately, you could equalize false positive rates — but only by also reducing overall predictive accuracy. Again, the trade-off is real and unavoidable.
A common instinct is to fix representation bias by simply adding more examples from underrepresented groups. This helps but does not fully solve the problem if the labels on those new examples are themselves biased. Adding more images of dark-skinned patients to a dermatology dataset is valuable only if those images are labeled with the same quality and consistency as light-skinned images. If the annotation process itself is biased — different standards applied to different groups — more data can amplify the problem.
In the United States, algorithmic discrimination is addressed primarily through existing civil rights laws rather than AI-specific regulation. Title VII of the Civil Rights Act prohibits employment discrimination; the Fair Housing Act prohibits discrimination in housing; the Equal Credit Opportunity Act prohibits discrimination in lending. These laws apply to algorithmic decisions as they apply to human decisions — but enforcement is hampered by the opacity of proprietary systems, the difficulty of proving intent, and the absence of mandatory disclosure requirements.
In 2023, the Equal Employment Opportunity Commission issued guidance stating that AI hiring tools are subject to Title VII and that employers cannot use "business necessity" to justify a tool that produces disparate impact without validating that it is job-related. In 2024, the Consumer Financial Protection Bureau issued guidance that algorithmic credit decisions must comply with the Equal Credit Opportunity Act's adverse action notice requirements — lenders must explain denials in terms a consumer can understand, even when the decision was made by a model.
The European Union's AI Act, which entered force in 2024, takes a more structural approach. It classifies AI systems used in employment, credit, and criminal justice as "high-risk" and requires conformity assessments, bias audits, transparency documentation, and human oversight before deployment. The long-term effect on practice remains to be seen.
The HireVue case illustrates a recurring pattern: technical bias problems that exist from deployment are only addressed when external accountability pressure — legal complaints, investigative journalism, regulatory scrutiny — becomes commercially costly. Internal technical teams often identify problems early. The organizational incentives to disclose and address them are frequently weak. Technical tools for measuring and mitigating bias are now widely available. The constraint is not technical capacity but accountability structure.
Effective oversight of deployed AI systems requires several things that are currently rare in practice. First, mandatory pre-deployment bias audits conducted by parties with no financial stake in approval. The dermatology AI described in Lesson 1 was published in Nature without any audit of its performance on darker skin tones, and the journal's peer reviewers did not require one. This was not malice — it was the absence of a norm requiring it.
Second, ongoing monitoring after deployment, because distribution shift — changes in the population of users, their demographics, and the context of use — can introduce new biases even when a system was reasonably fair at launch. The 2020 paper "Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations" found that AI radiology tools, deployed in new clinical settings serving different patient demographics than those used for validation, exhibited substantially higher error rates. No one had checked.
Third, meaningful recourse for individuals harmed by algorithmic decisions. Currently, most systems that deny loans, flag welfare fraud, or assign criminal risk scores offer no mechanism for individuals to challenge the decision, understand the basis for it, or identify whether bias played a role. The EU AI Act's right to human review of high-risk AI decisions is one model. Whether it will function in practice depends entirely on enforcement capacity that is not yet in place.
A city government is deploying a predictive AI system to flag welfare benefits applications for fraud review. Before it goes live, you have been appointed to design the accountability and oversight framework. The AI assistant will help you think through the requirements — push it to be specific and honest about gaps.