Amazon built a machine-learning recruiting tool intended to automate the screening of job applicants. The system was trained on résumés submitted to Amazon over a ten-year period — the vast majority of which came from men, reflecting the long-standing gender imbalance in the technology industry. The model did not know the word "women" was a problem. It simply learned that the patterns on successful past résumés — all-male majority — were the patterns to reward. By 2015 it was actively penalising résumés that contained the word "women's", such as "women's chess club captain." Amazon disbanded the project in 2017 after internal audits confirmed the bias could not be reliably corrected.
Training data is a compressed record of past human decisions. Those decisions were made in contexts shaped by systemic inequalities — hiring managers who favoured certain names, loan officers who avoided certain zip codes, doctors who undertook fewer clinical trials on women or minority populations. The data doesn't label these moments as biased. It presents them as facts.
An AI system treats the statistical regularities in its training data as ground truth. If historically Black neighbourhoods were denied mortgages at higher rates, a model trained on those approval records will learn that those zip codes are "high risk." The model has no way to know — unless explicitly told — that the pattern was the product of illegal discrimination, not genuine financial risk.
ProPublica's 2016 investigation into Northpointe's COMPAS tool found that it assigned significantly higher recidivism risk scores to Black defendants who did not go on to reoffend, and lower scores to white defendants who did. The model was trained on criminal history data produced by a justice system with documented racial disparities in arrest, prosecution, and sentencing. The historical pattern — more Black defendants in the system — was treated as predictive signal rather than as evidence of systemic bias.
1. Label Bias. Labels — the "correct answers" in supervised learning — are often human decisions. If doctors historically under-diagnosed depression in Black patients due to documented implicit bias, training a diagnostic model on those patient records means the model learns to under-diagnose. The label bias is invisible inside the data.
2. Selection Bias. Who got recorded? Clinical drug trial data from before the 1990s significantly under-represents women because the NIH did not require their inclusion until 1993. AI trained on that data develops predictions optimised for men.
3. Proxy Variables. Zip code, name, school attended — these variables don't explicitly encode race or gender, but in a historically segregated society they correlate strongly with both. A model that "ignores" protected characteristics but retains their proxies is still discriminating.
Historical data is not neutral. Every dataset is a photograph of a particular moment in a particular society with particular power structures. When that photograph becomes training data, its distortions become the model's worldview.
Recognising historical bias is the first step toward addressing it — but recognition alone doesn't fix anything. The next lessons explore how bias enters through the collection process itself, through how variables are labelled, and through the human decisions embedded in what we choose to measure.
You are a data auditor examining three datasets proposed for training AI systems. For each dataset described, identify the type of historical bias present and explain how it would likely manifest in model behaviour.
The AI tutor will guide you through the analysis. Have at least three substantive exchanges to complete this lab.
In October 2019, Ziad Obermeyer and colleagues published research in Science showing that a widely-used commercial healthcare algorithm — used by major US health systems to identify high-risk patients for care management — was exhibiting significant racial bias. The algorithm predicted future healthcare cost as its proxy for health need. But Black patients with the same disease burden as white patients consistently generated lower healthcare costs, because decades of systemic barriers had reduced their access to care. The algorithm interpreted lower cost as lower need. Correcting for this reduced the disparity in algorithm scores, but the core problem — using cost as a stand-in for need — had persisted undetected in a system affecting 200 million people annually.
Machine learning models cannot measure abstract concepts directly — they measure operationalisations of those concepts. "Creditworthiness" becomes credit score. "Recidivism risk" becomes prior arrests. "Health need" becomes healthcare cost. Each operationalisation is a choice, and each choice reflects assumptions about the world.
The Optum case illustrates construct validity failure: the measured variable (cost) does not actually capture the intended construct (need). This failure was not random — it was systematically worse for Black patients, whose access to care had been structurally constrained. The measurement encoded an inequality that it then reproduced.
A 2020 study in the New England Journal of Medicine found that pulse oximeters — which measure blood oxygen saturation — overestimated oxygen levels in Black patients at a rate roughly three times higher than in white patients due to differences in skin pigmentation affecting light absorption. When hospital AI systems used pulse oximeter readings as inputs for COVID-19 severity prediction, Black patients' true hypoxemia was systematically under-detected. The problem wasn't the AI — it was the sensor data the AI was trained and evaluated on. Measurement bias upstream propagated throughout every downstream system.
Criminal justice. "Risk of reoffending" operationalised as prior arrests conflates policing intensity with criminal behaviour. Communities that are over-policed accumulate more arrest records, producing higher risk scores regardless of actual behaviour.
Education. "Teacher quality" operationalised as student test score gains ignores that test performance correlates strongly with socioeconomic status. AI-driven teacher evaluation systems trained on these metrics systematically rate teachers in low-income schools as lower quality.
Hiring. "Job performance" operationalised as manager ratings inherits whatever biases managers hold. Amazon's résumé engine was trained on ratings reflecting a culture that already under-promoted women.
Researchers estimated that to receive the same level of care management as white patients, Black patients would need to be sicker — equivalent to having 26.3% more chronic conditions. The algorithm hadn't been programmed to discriminate. It had simply been given the wrong thing to measure.
| Domain | Intended Construct | Actual Measurement | Bias Introduced |
|---|---|---|---|
| Healthcare | Medical need | Future cost | Under-serves low-access populations |
| Criminal justice | Recidivism risk | Prior arrests | Penalises over-policed communities |
| Credit | Creditworthiness | Credit history length | Disadvantages new immigrants, young adults |
| Education | Teacher quality | Standardised test gains | Under-rates teachers in high-poverty schools |
Measurement bias is particularly dangerous because it hides behind the appearance of objectivity. Numbers feel neutral. But every number is the output of a measurement process, and that process encodes choices — about what to measure, how, and whose reality is treated as the baseline.
You are advising a city government that wants to use AI to allocate social services more efficiently. The proposed system will score neighbourhood need using available data. Your job: audit the proposed variables for measurement bias before the system is deployed.
The AI tutor will present proposed measurement variables. Your task is to identify construct validity problems and suggest better alternatives.
MIT researcher Joy Buolamwini noticed that facial analysis systems couldn't reliably detect her face until she put on a white mask. Her subsequent research, conducted with Timnit Gebru and published as Gender Shades in 2018, systematically audited commercial facial recognition systems from IBM, Microsoft, and Face++. The results were stark: error rates on darker-skinned women were up to 34.7 percentage points higher than on lighter-skinned men. IBM's system correctly classified lighter-skinned males 99.7% of the time — and darker-skinned females only 65.3% of the time. The systems had been trained primarily on datasets like Labeled Faces in the Wild, which contained 77.5% male and 83.5% white faces. The models learned to be excellent at the faces they saw most.
Common intuition holds that larger datasets are better datasets. The Gender Shades finding challenges this: a dataset of a million faces, 83% white, produces a model that is systematically worse for dark-skinned faces regardless of its size. Representativeness matters more than raw volume.
Sampling bias emerges from several distinct mechanisms: convenience sampling (using whatever data is easy to collect, like internet-available images); survivorship bias (only recording outcomes for people who completed a process); and participation bias (certain groups being less likely or willing to appear in datasets due to historical reasons including mistrust of institutions).
A 2019 study in Nature Medicine found that AI systems trained to detect skin cancer from dermatology images performed significantly worse on dark skin tones, because the standard dermatology image databases — including ISIC (International Skin Imaging Collaboration) — were overwhelmingly composed of images from light-skinned patients. Skin cancer is both diagnosable and treatable — but the most capable AI diagnostic tools were least reliable for the populations with historically the least access to dermatological care. Sampling bias amplified an existing healthcare disparity.
AI systems are evaluated against benchmarks — standard test datasets used to compare models. If those benchmark datasets have the same demographic skews as the training data, a model can score extremely well on the benchmark while performing poorly in the real world for under-represented groups. This creates a dangerous illusion of quality.
ImageNet, the benchmark dataset that drove the deep learning revolution in computer vision, was assembled primarily from English-language internet sources. Studies found that object categories in ImageNet are systematically skewed toward Western contexts — "wedding" images reflect Western wedding aesthetics; "home" images reflect Western housing. Models trained on ImageNet learn a particular vision of the world and perform worse when deployed elsewhere.
Facial recognition built on skewed training data doesn't just fail theoretically — it fails in ways that have led to wrongful arrests. Robert Williams, Michael Oliver, and Nijeer Parks are among documented cases of Black men wrongly arrested after facial recognition misidentification. The sampling bias in training data became a real-world harm.
A hospital system wants to deploy an AI tool that reads chest X-rays to flag potential pneumonia cases. You've been asked to audit their proposed training dataset before it goes live.
The dataset contains 80,000 chest X-rays collected from a single large urban hospital over 15 years. Work with the AI tutor to identify sampling bias risks and develop an audit checklist.
PredPol (now Geolitica) and similar predictive policing tools were deployed across dozens of US cities from approximately 2012. These systems used historical crime data to predict where crime was likely to occur. Police were then directed to those areas — predominantly communities of colour — in greater numbers. More police presence generated more arrests, which generated more crime data, which reinforced the model's predictions that those areas required more policing. A 2020 study by the Stop LAPD Spying Coalition found that LAPD's PredPol deployments concentrated police activity in Black and Latino neighbourhoods while under-policing equivalent-crime-rate white neighbourhoods. The feedback loop was self-sealing: the model couldn't learn that its predictions were generating, rather than reflecting, the pattern it observed.
A feedback loop in AI bias occurs when a model's outputs influence the real world in ways that then become inputs to the model's future training. The loop has four stages:
1. Biased prediction. A model trained on historically skewed data makes a prediction reflecting that skew (e.g., "this neighbourhood has high crime risk").
2. Biased action. The prediction influences a real-world decision (more police deployed, loan denied, job application rejected).
3. Biased outcome. The action changes observed outcomes (more arrests recorded in that neighbourhood, applicant doesn't get job).
4. Biased retraining. The new outcome data is added to the training set, reinforcing the original pattern for the next model version.
A 2019 study by Ali et al. (published in IMC 2019) found that Facebook's ad delivery algorithm — even when advertisers set no demographic targeting — automatically skewed delivery along racial and gender lines based on who had historically engaged with similar ads. An ad for lumber-jack jobs was delivered predominantly to white men; nursing job ads went predominantly to women. The algorithm had learned from past engagement data who "should" see which ads, and those past engagement patterns reflected occupational segregation. Facebook's own optimisation system was amplifying labour market discrimination through its feedback loop between user engagement history and future ad delivery.
Standard model evaluation compares predictions against outcomes. But if the model's predictions caused the outcomes, this validation process measures the model's own influence, not its accuracy. A predictive policing model that sends police to neighbourhood X will find crime in neighbourhood X, which makes the model look accurate. The counterfactual — what would have happened in neighbourhood X without the extra policing — is invisible.
This problem is compounded by the fact that feedback loops appear to improve model performance by conventional metrics. The model's predictions keep getting confirmed by reality. But the model is not learning about crime — it is learning about where police are deployed.
Feedback loops are fundamentally a counterfactual problem: to know whether a model is accurate, you need to know what would have happened without it. In systems affecting human lives — policing, credit, hiring — running a control group is often ethically and practically impossible. This makes feedback loop bias among the most difficult forms of data bias to detect, measure, and correct.
| System | Biased Prediction | Action Taken | Outcome Becomes Training Data |
|---|---|---|---|
| Predictive policing | High crime area | More officers deployed | More arrests confirm prediction |
| Credit scoring | High default risk | Loan denied | No loan means no repayment history, confirming "risk" |
| Content recommendation | User likes misinformation | More served | More engagement trains system to serve more |
| Hiring AI | Candidate won't succeed | Not hired | Can never generate disconfirming performance data |
The four lessons of this module — historical data, measurement bias, sampling bias, and feedback loops — describe four distinct but interconnected ways that bias hides in data. Each can operate independently, but they frequently interact: a historically biased dataset (L1) uses flawed operationalisations (L2) collected from an unrepresentative sample (L3) whose outputs feed back into future training (L4). Understanding all four is necessary for any meaningful approach to AI fairness.
A parole board in a US state has been using a risk-scoring algorithm for eight years. The algorithm scores defendants on likelihood of reoffending and influences parole decisions. You've been hired to audit whether a feedback loop is operating — and if so, to design a protocol to break it.
The AI tutor will walk you through the audit and help you design an intervention. Push for specifics — what data would you need, what experiment would you run, what policy change would interrupt the loop?