L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 2 · Lesson 1

Historical Data: When the Past Poisons the Future

Data collected under biased conditions carries that bias forward — silently, permanently.
If a dataset was built in a world that discriminated, what does the AI trained on it actually learn?

Amazon built a machine-learning recruiting tool intended to automate the screening of job applicants. The system was trained on résumés submitted to Amazon over a ten-year period — the vast majority of which came from men, reflecting the long-standing gender imbalance in the technology industry. The model did not know the word "women" was a problem. It simply learned that the patterns on successful past résumés — all-male majority — were the patterns to reward. By 2015 it was actively penalising résumés that contained the word "women's", such as "women's chess club captain." Amazon disbanded the project in 2017 after internal audits confirmed the bias could not be reliably corrected.

What Makes Historical Data Dangerous

Training data is a compressed record of past human decisions. Those decisions were made in contexts shaped by systemic inequalities — hiring managers who favoured certain names, loan officers who avoided certain zip codes, doctors who undertook fewer clinical trials on women or minority populations. The data doesn't label these moments as biased. It presents them as facts.

An AI system treats the statistical regularities in its training data as ground truth. If historically Black neighbourhoods were denied mortgages at higher rates, a model trained on those approval records will learn that those zip codes are "high risk." The model has no way to know — unless explicitly told — that the pattern was the product of illegal discrimination, not genuine financial risk.

Documented Case — COMPAS Recidivism Algorithm

ProPublica's 2016 investigation into Northpointe's COMPAS tool found that it assigned significantly higher recidivism risk scores to Black defendants who did not go on to reoffend, and lower scores to white defendants who did. The model was trained on criminal history data produced by a justice system with documented racial disparities in arrest, prosecution, and sentencing. The historical pattern — more Black defendants in the system — was treated as predictive signal rather than as evidence of systemic bias.

The Three Mechanisms of Historical Bias Transmission

1. Label Bias. Labels — the "correct answers" in supervised learning — are often human decisions. If doctors historically under-diagnosed depression in Black patients due to documented implicit bias, training a diagnostic model on those patient records means the model learns to under-diagnose. The label bias is invisible inside the data.

2. Selection Bias. Who got recorded? Clinical drug trial data from before the 1990s significantly under-represents women because the NIH did not require their inclusion until 1993. AI trained on that data develops predictions optimised for men.

3. Proxy Variables. Zip code, name, school attended — these variables don't explicitly encode race or gender, but in a historically segregated society they correlate strongly with both. A model that "ignores" protected characteristics but retains their proxies is still discriminating.

Label BiasWhen the target variable (the "answer") in training data reflects human decisions made under biased conditions rather than objective ground truth.
Proxy VariableA variable that does not explicitly identify a protected characteristic but is statistically correlated with it strongly enough to reproduce discriminatory outcomes.
Selection BiasSystematic under- or over-representation of certain groups in a dataset, causing models trained on it to perform poorly for under-represented populations.
10
Years of Amazon résumé data — all majority-male
77%
COMPAS flagged Black defendants at higher recidivism risk vs 33% white (ProPublica 2016)
1993
Year NIH first required women in clinical trials
Core Insight

Historical data is not neutral. Every dataset is a photograph of a particular moment in a particular society with particular power structures. When that photograph becomes training data, its distortions become the model's worldview.

Recognising historical bias is the first step toward addressing it — but recognition alone doesn't fix anything. The next lessons explore how bias enters through the collection process itself, through how variables are labelled, and through the human decisions embedded in what we choose to measure.

Lesson 1 Quiz

Historical Data & Bias Transmission
Amazon's recruiting AI penalised résumés containing the word "women's" primarily because:
Correct. The model learned statistical patterns from historical data skewed toward male applicants. No one programmed discrimination — the data itself encoded it.
Not quite. The bias emerged from the data, not deliberate programming. The model simply replicated patterns present in ten years of predominantly male successful résumés.
A "proxy variable" in the context of AI bias refers to:
Correct. Zip code, surname, and school attended can all function as race or socioeconomic proxies in historically segregated societies, allowing discrimination to persist even when protected variables are removed.
That's not the right definition. A proxy variable is one that doesn't explicitly encode a protected characteristic but correlates strongly enough with it to reproduce discriminatory effects.
The NIH's failure to require women in clinical trials until 1993 is an example of which bias type in AI training data?
Correct. Selection bias occurs when certain groups are systematically excluded from the data — in this case, women were largely excluded from clinical trial datasets, making AI trained on that data worse at diagnosing and treating women.
This is selection bias — the systematic under-representation of women in clinical trial data. Label bias involves biased target variables; proxy bias involves correlated stand-ins for protected characteristics.

Lab 1 — The Data Archaeology Lab

Investigate how historical bias embeds itself in training datasets

Your Mission

You are a data auditor examining three datasets proposed for training AI systems. For each dataset described, identify the type of historical bias present and explain how it would likely manifest in model behaviour.

The AI tutor will guide you through the analysis. Have at least three substantive exchanges to complete this lab.

Try asking: "Walk me through the COMPAS dataset — what specific historical patterns in it would cause racial bias in recidivism prediction?" or "If I have a credit scoring model trained on 1990s loan approval data, what proxy variables should I audit first?"
Data Archaeology Lab
AI Tutor
Welcome to the Data Archaeology Lab. I'm here to help you develop a systematic approach to identifying historical bias in training datasets.

Let's start with a core question: when we say data "carries" historical bias, what are we actually claiming? Are we saying the numbers are wrong, or that the numbers accurately record a biased reality? Think through that, then let's dig into some specific cases — COMPAS, Amazon's résumé engine, or medical datasets. Where would you like to begin?
Module 2 · Lesson 2

Measurement Bias: What You Choose to Count

Every dataset encodes decisions about what matters and what doesn't. Those decisions are never neutral.
When a healthcare algorithm measures "cost" instead of "illness," whose health gets optimised — and whose gets ignored?

In October 2019, Ziad Obermeyer and colleagues published research in Science showing that a widely-used commercial healthcare algorithm — used by major US health systems to identify high-risk patients for care management — was exhibiting significant racial bias. The algorithm predicted future healthcare cost as its proxy for health need. But Black patients with the same disease burden as white patients consistently generated lower healthcare costs, because decades of systemic barriers had reduced their access to care. The algorithm interpreted lower cost as lower need. Correcting for this reduced the disparity in algorithm scores, but the core problem — using cost as a stand-in for need — had persisted undetected in a system affecting 200 million people annually.

The Measurement Problem

Machine learning models cannot measure abstract concepts directly — they measure operationalisations of those concepts. "Creditworthiness" becomes credit score. "Recidivism risk" becomes prior arrests. "Health need" becomes healthcare cost. Each operationalisation is a choice, and each choice reflects assumptions about the world.

The Optum case illustrates construct validity failure: the measured variable (cost) does not actually capture the intended construct (need). This failure was not random — it was systematically worse for Black patients, whose access to care had been structurally constrained. The measurement encoded an inequality that it then reproduced.

Documented Case — Pulse Oximetry and COVID-19

A 2020 study in the New England Journal of Medicine found that pulse oximeters — which measure blood oxygen saturation — overestimated oxygen levels in Black patients at a rate roughly three times higher than in white patients due to differences in skin pigmentation affecting light absorption. When hospital AI systems used pulse oximeter readings as inputs for COVID-19 severity prediction, Black patients' true hypoxemia was systematically under-detected. The problem wasn't the AI — it was the sensor data the AI was trained and evaluated on. Measurement bias upstream propagated throughout every downstream system.

Operationalisation Failures in Common AI Applications

Criminal justice. "Risk of reoffending" operationalised as prior arrests conflates policing intensity with criminal behaviour. Communities that are over-policed accumulate more arrest records, producing higher risk scores regardless of actual behaviour.

Education. "Teacher quality" operationalised as student test score gains ignores that test performance correlates strongly with socioeconomic status. AI-driven teacher evaluation systems trained on these metrics systematically rate teachers in low-income schools as lower quality.

Hiring. "Job performance" operationalised as manager ratings inherits whatever biases managers hold. Amazon's résumé engine was trained on ratings reflecting a culture that already under-promoted women.

OperationalisationThe process of translating an abstract concept into a measurable variable. Every operationalisation involves assumptions that may not hold equally across all populations.
Construct Validity FailureWhen the variable being measured does not actually capture the underlying concept it is intended to represent.
Measurement BiasSystematic error introduced when a measurement tool or operationalisation performs differently across demographic groups.
The Obermeyer Finding

Researchers estimated that to receive the same level of care management as white patients, Black patients would need to be sicker — equivalent to having 26.3% more chronic conditions. The algorithm hadn't been programmed to discriminate. It had simply been given the wrong thing to measure.

DomainIntended ConstructActual MeasurementBias Introduced
HealthcareMedical needFuture costUnder-serves low-access populations
Criminal justiceRecidivism riskPrior arrestsPenalises over-policed communities
CreditCreditworthinessCredit history lengthDisadvantages new immigrants, young adults
EducationTeacher qualityStandardised test gainsUnder-rates teachers in high-poverty schools

Measurement bias is particularly dangerous because it hides behind the appearance of objectivity. Numbers feel neutral. But every number is the output of a measurement process, and that process encodes choices — about what to measure, how, and whose reality is treated as the baseline.

Lesson 2 Quiz

Measurement Bias & Operationalisation
The Optum healthcare algorithm exhibited racial bias primarily because it used healthcare cost as a proxy for health need. This is an example of:
Correct. Cost doesn't equal need — especially when access to care has been historically constrained for certain populations. The operationalisation of "need" as "cost" encoded existing structural inequality.
This is a construct validity failure. Cost and need are different things, and they diverge systematically along racial lines due to historical barriers to healthcare access.
Using "prior arrests" to measure recidivism risk produces bias because:
Correct. Arrests are a product of both criminal behaviour and police deployment patterns. Using arrests as a recidivism proxy conflates these two very different things, disadvantaging over-policed communities.
The issue is that arrest rates reflect policing intensity as much as actual crime. Communities that receive more police surveillance accumulate more arrest records — making this a flawed proxy for actual criminal behaviour.
Pulse oximeters overestimating blood oxygen in Black patients represents bias at which stage?
Correct. The bias was in the sensor, not the AI. But any AI trained on or evaluated using pulse oximeter readings inherited that measurement error — demonstrating how upstream data quality problems propagate through entire systems.
The bias originated in the measurement device — the pulse oximeter performed differently on different skin tones. Any AI using those readings as inputs would inherit and potentially amplify this error.

Lab 2 — The Measurement Audit

Challenge the variables being used — ask what they actually measure

Your Mission

You are advising a city government that wants to use AI to allocate social services more efficiently. The proposed system will score neighbourhood need using available data. Your job: audit the proposed variables for measurement bias before the system is deployed.

The AI tutor will present proposed measurement variables. Your task is to identify construct validity problems and suggest better alternatives.

Start by asking: "What variables are being proposed for measuring neighbourhood need?" or "How does healthcare cost fail as a proxy for healthcare need in a city context?"
Measurement Audit Lab
AI Tutor
Welcome to the Measurement Audit. The city has proposed the following variables to score neighbourhood need for social service allocation:

• Emergency room visits per capita
• Reported crime incidents per capita
• Median household income
• School attendance rates
• Property tax revenue

Before we go further — can you spot any immediate construct validity problems? Which of these variables might not mean what the city thinks they mean?
Module 2 · Lesson 3

Sampling Bias and the Gaps That Shape Models

A model is only as good as who was in the room when the data was collected.
When a facial recognition system is trained almost entirely on light-skinned male faces, what happens when it meets everyone else?

MIT researcher Joy Buolamwini noticed that facial analysis systems couldn't reliably detect her face until she put on a white mask. Her subsequent research, conducted with Timnit Gebru and published as Gender Shades in 2018, systematically audited commercial facial recognition systems from IBM, Microsoft, and Face++. The results were stark: error rates on darker-skinned women were up to 34.7 percentage points higher than on lighter-skinned men. IBM's system correctly classified lighter-skinned males 99.7% of the time — and darker-skinned females only 65.3% of the time. The systems had been trained primarily on datasets like Labeled Faces in the Wild, which contained 77.5% male and 83.5% white faces. The models learned to be excellent at the faces they saw most.

Why Sampling Matters More Than Sample Size

Common intuition holds that larger datasets are better datasets. The Gender Shades finding challenges this: a dataset of a million faces, 83% white, produces a model that is systematically worse for dark-skinned faces regardless of its size. Representativeness matters more than raw volume.

Sampling bias emerges from several distinct mechanisms: convenience sampling (using whatever data is easy to collect, like internet-available images); survivorship bias (only recording outcomes for people who completed a process); and participation bias (certain groups being less likely or willing to appear in datasets due to historical reasons including mistrust of institutions).

Documented Case — Dermatology AI and Skin Tone

A 2019 study in Nature Medicine found that AI systems trained to detect skin cancer from dermatology images performed significantly worse on dark skin tones, because the standard dermatology image databases — including ISIC (International Skin Imaging Collaboration) — were overwhelmingly composed of images from light-skinned patients. Skin cancer is both diagnosable and treatable — but the most capable AI diagnostic tools were least reliable for the populations with historically the least access to dermatological care. Sampling bias amplified an existing healthcare disparity.

The Benchmark Trap

AI systems are evaluated against benchmarks — standard test datasets used to compare models. If those benchmark datasets have the same demographic skews as the training data, a model can score extremely well on the benchmark while performing poorly in the real world for under-represented groups. This creates a dangerous illusion of quality.

ImageNet, the benchmark dataset that drove the deep learning revolution in computer vision, was assembled primarily from English-language internet sources. Studies found that object categories in ImageNet are systematically skewed toward Western contexts — "wedding" images reflect Western wedding aesthetics; "home" images reflect Western housing. Models trained on ImageNet learn a particular vision of the world and perform worse when deployed elsewhere.

Sampling BiasSystematic over- or under-representation of certain groups in a dataset relative to their presence in the real-world population the model will be applied to.
Benchmark TrapWhen a model's strong performance on a standard evaluation dataset masks poor performance on under-represented subgroups, because the benchmark has the same sampling biases as the training data.
RepresentativenessThe degree to which a dataset accurately reflects the distribution of the population the model will ultimately serve.
34.7%
Max error rate gap: dark-skinned women vs light-skinned men (Gender Shades)
83.5%
Labeled Faces in the Wild — proportion white faces
77.5%
Labeled Faces in the Wild — proportion male faces
2015
Google Photos image classifier labels Black users' photos as "gorillas" — traced to severe under-representation in training data.
2018
Gender Shades published. IBM, Microsoft, Face++ all show dramatic performance gaps by skin tone and gender.
2019
NIST evaluates 189 facial recognition algorithms, finds most exhibit differential performance by race, with error rates for Black and Asian faces 10–100× higher than white faces.
2022
Chicago suspends facial recognition–assisted policing after a Black man, Robert Williams, had been wrongly arrested in 2020 using a misidentification from an unrepresentative system.
The Policy Consequence

Facial recognition built on skewed training data doesn't just fail theoretically — it fails in ways that have led to wrongful arrests. Robert Williams, Michael Oliver, and Nijeer Parks are among documented cases of Black men wrongly arrested after facial recognition misidentification. The sampling bias in training data became a real-world harm.

Lesson 3 Quiz

Sampling Bias & Representation
The Gender Shades study found that commercial facial recognition systems performed worst on which group?
Correct. The intersection of darker skin tone and female gender produced the worst performance, directly reflecting that training datasets were skewed toward light-skinned male faces.
The Gender Shades study found the largest performance gap for darker-skinned females, with some systems showing error rates 34.7 percentage points higher than for lighter-skinned males.
Why can a very large dataset still produce a biased model?
Correct. Representativeness matters more than raw size. A million biased examples produce a more confidently biased model, not a less biased one.
Size doesn't fix composition. A large dataset that over-represents one demographic group will train a model to be good at that group and systematically worse at under-represented ones.
The "benchmark trap" describes a situation where:
Correct. When benchmarks inherit training data's biases, high benchmark scores are misleading — the model hasn't learned to generalise, it's learned to perform on skewed data that matches its skewed training.
The benchmark trap is when evaluation datasets carry the same biases as training data, making a biased model appear high-performing because it's tested on the same kinds of faces it was trained on.

Lab 3 — The Dataset Auditor

Map the gaps in a proposed computer vision training dataset

Your Mission

A hospital system wants to deploy an AI tool that reads chest X-rays to flag potential pneumonia cases. You've been asked to audit their proposed training dataset before it goes live.

The dataset contains 80,000 chest X-rays collected from a single large urban hospital over 15 years. Work with the AI tutor to identify sampling bias risks and develop an audit checklist.

Start here: "What demographic information do I need to request to audit this X-ray dataset for sampling bias?" or "How would I test whether this pneumonia detector performs equally across age groups and ethnicities?"
Dataset Auditor Lab
AI Tutor
Good — you've been handed a dataset of 80,000 chest X-rays from one hospital, 15 years of data, and a mandate to audit it before a pneumonia-detection AI goes live.

Here's your starting point: the hospital is in an affluent urban area. It serves primarily insured patients. The radiology department was historically under-resourced and under-staffed on weekends.

Before you even look at the images — what are three demographic or structural factors about this dataset that would concern you? What populations might be missing, and why does that matter for a diagnostic AI?
Module 2 · Lesson 4

Feedback Loops: When Bias Compounds Itself

Biased predictions generate biased outcomes, which feed back as training data — making the next model more biased than the last.
If predictive policing sends more officers to flagged neighbourhoods, and those officers make more arrests there, what does the next version of the model learn?

PredPol (now Geolitica) and similar predictive policing tools were deployed across dozens of US cities from approximately 2012. These systems used historical crime data to predict where crime was likely to occur. Police were then directed to those areas — predominantly communities of colour — in greater numbers. More police presence generated more arrests, which generated more crime data, which reinforced the model's predictions that those areas required more policing. A 2020 study by the Stop LAPD Spying Coalition found that LAPD's PredPol deployments concentrated police activity in Black and Latino neighbourhoods while under-policing equivalent-crime-rate white neighbourhoods. The feedback loop was self-sealing: the model couldn't learn that its predictions were generating, rather than reflecting, the pattern it observed.

The Anatomy of a Feedback Loop

A feedback loop in AI bias occurs when a model's outputs influence the real world in ways that then become inputs to the model's future training. The loop has four stages:

1. Biased prediction. A model trained on historically skewed data makes a prediction reflecting that skew (e.g., "this neighbourhood has high crime risk").

2. Biased action. The prediction influences a real-world decision (more police deployed, loan denied, job application rejected).

3. Biased outcome. The action changes observed outcomes (more arrests recorded in that neighbourhood, applicant doesn't get job).

4. Biased retraining. The new outcome data is added to the training set, reinforcing the original pattern for the next model version.

Documented Case — Facebook Ad Delivery Algorithms

A 2019 study by Ali et al. (published in IMC 2019) found that Facebook's ad delivery algorithm — even when advertisers set no demographic targeting — automatically skewed delivery along racial and gender lines based on who had historically engaged with similar ads. An ad for lumber-jack jobs was delivered predominantly to white men; nursing job ads went predominantly to women. The algorithm had learned from past engagement data who "should" see which ads, and those past engagement patterns reflected occupational segregation. Facebook's own optimisation system was amplifying labour market discrimination through its feedback loop between user engagement history and future ad delivery.

Why Feedback Loops Are Hard to Detect

Standard model evaluation compares predictions against outcomes. But if the model's predictions caused the outcomes, this validation process measures the model's own influence, not its accuracy. A predictive policing model that sends police to neighbourhood X will find crime in neighbourhood X, which makes the model look accurate. The counterfactual — what would have happened in neighbourhood X without the extra policing — is invisible.

This problem is compounded by the fact that feedback loops appear to improve model performance by conventional metrics. The model's predictions keep getting confirmed by reality. But the model is not learning about crime — it is learning about where police are deployed.

Feedback LoopA cycle in which a model's predictions influence real-world outcomes, which then become training data for future models, causing the original bias to compound over successive iterations.
Performative PredictionA prediction that changes the very outcome it is trying to predict — making it impossible to evaluate model accuracy by comparing predictions to outcomes.
Self-Sealing LoopA feedback loop that systematically eliminates the evidence needed to detect its own bias — because the model's actions prevent the generation of disconfirming data.
The Counterfactual Problem

Feedback loops are fundamentally a counterfactual problem: to know whether a model is accurate, you need to know what would have happened without it. In systems affecting human lives — policing, credit, hiring — running a control group is often ethically and practically impossible. This makes feedback loop bias among the most difficult forms of data bias to detect, measure, and correct.

SystemBiased PredictionAction TakenOutcome Becomes Training Data
Predictive policingHigh crime areaMore officers deployedMore arrests confirm prediction
Credit scoringHigh default riskLoan deniedNo loan means no repayment history, confirming "risk"
Content recommendationUser likes misinformationMore servedMore engagement trains system to serve more
Hiring AICandidate won't succeedNot hiredCan never generate disconfirming performance data

The four lessons of this module — historical data, measurement bias, sampling bias, and feedback loops — describe four distinct but interconnected ways that bias hides in data. Each can operate independently, but they frequently interact: a historically biased dataset (L1) uses flawed operationalisations (L2) collected from an unrepresentative sample (L3) whose outputs feed back into future training (L4). Understanding all four is necessary for any meaningful approach to AI fairness.

Lesson 4 Quiz

Feedback Loops & Compounding Bias
Why does a predictive policing algorithm's apparent accuracy validate nothing about its fairness?
Correct. This is the performative prediction problem. The model's outputs shape the reality it is then evaluated against. High accuracy tells you the loop is closed, not that the model is fair or correct.
The core problem is circularity: predictions determine deployment, deployment generates arrests, arrests confirm predictions. The model appears accurate because it caused the outcome it was predicting.
Facebook's ad delivery algorithm skewing job ads by race and gender, even without advertiser targeting, is a feedback loop because:
Correct. Historical segregation shaped who engaged with job ads historically. That engagement data trained the delivery algorithm, which then reinforced segregation by showing the same ads to the same demographics in the future.
The loop runs: historical occupational segregation → skewed past engagement data → algorithm learns skewed delivery patterns → future ads reinforce segregation → next generation of engagement data is equally skewed.
A "self-sealing" feedback loop is particularly dangerous because:
Correct. When a hiring model never hires a candidate, it never generates performance data for that candidate. When a policing model avoids certain areas, crime there goes unrecorded. The loop destroys its own disconfirming evidence.
Self-sealing means the loop eliminates the data that would expose its own bias. A hiring AI that never hires from a certain group can never be proven wrong about that group — the disconfirming performance data simply never gets generated.

Lab 4 — Breaking the Loop

Design an intervention to detect and interrupt a feedback loop in a live AI system

Your Mission

A parole board in a US state has been using a risk-scoring algorithm for eight years. The algorithm scores defendants on likelihood of reoffending and influences parole decisions. You've been hired to audit whether a feedback loop is operating — and if so, to design a protocol to break it.

The AI tutor will walk you through the audit and help you design an intervention. Push for specifics — what data would you need, what experiment would you run, what policy change would interrupt the loop?

Start with: "What evidence would tell me a feedback loop is operating in this parole risk score?" or "How do I design a randomised audit to detect whether the algorithm is generating circular validation?"
Breaking the Loop Lab
AI Tutor
You're auditing a parole risk algorithm that's been in use for eight years. Here's what you know:

• The algorithm scores defendants 1–10 on reoffending risk
• Scores of 7 or above typically result in parole denial
• The system is retrained annually on new parole outcomes
• Black defendants receive scores of 7+ at roughly twice the rate of white defendants with similar offence histories

Your hypothesis: the algorithm is in a feedback loop. Parole denial → continued incarceration → no opportunity to reoffend → appears to "confirm" high risk → next model learns to deny parole at same rate.

What's your first step to test whether this feedback loop is actually operating?

Module 2 Test

Where Bias Hides in Data — 15 questions · Pass mark 80%
1. Amazon discontinued its AI recruiting tool in 2017 primarily because:
Correct. The model learned from ten years of predominantly male résumé data that male patterns equated to success, penalising anything coded female.
The system penalised female-coded language because its training data was overwhelmingly from men. This is historical bias embedded through label bias.
2. ProPublica's 2016 investigation found the COMPAS algorithm assigned higher recidivism risk to Black defendants who did not reoffend. The root cause was:
Correct. Historical disparities in who gets arrested and prosecuted — not actual differences in criminal behaviour — produced the skewed patterns the model learned from.
The bias emerged from historical data, not intent. The justice system's racial disparities in arrest and prosecution rates were embedded in the training data as if they were neutral facts.
3. "Label bias" in supervised learning refers to:
Correct. If doctors historically under-diagnosed depression in certain populations, that under-diagnosis becomes the label — and the model learns to replicate it.
Label bias means the target variable embeds human bias. The model learns to predict the biased human decision, not the underlying truth the decision was supposed to reflect.
4. The NIH's exclusion of women from clinical trials before 1993 primarily introduces which type of data bias into AI systems trained on that research?
Correct. Selection bias means certain groups are systematically absent from the data, causing models trained on it to perform poorly for those groups.
This is selection bias — women were not selected for inclusion in the data. Their systematic absence means models trained on that data are optimised for men.
5. The Optum healthcare algorithm's racial bias arose because it used healthcare cost as a proxy for health need. Which bias type best describes this?
Correct. Cost and need diverge systematically by race when access to care has been historically constrained. The operationalisation was fundamentally flawed — a construct validity failure.
This is measurement bias. The variable being measured (cost) didn't capture the intended concept (need) equally across racial groups, because historical access barriers meant lower cost ≠ lower need for Black patients.
6. Using zip code as an input variable in a credit scoring model is an example of bias through:
Correct. In a racially segregated country, zip code is a proxy for race. A model can discriminate by race while claiming not to use racial data, simply by including zip code.
Zip code functions as a proxy variable for race in historically segregated areas. Removing protected characteristics from a model while retaining their proxies does not eliminate discriminatory outcomes.
7. The Gender Shades study found error rates for darker-skinned women in commercial facial recognition systems were up to how much higher than for lighter-skinned men?
Correct. 34.7 percentage points — some systems were accurate over 99% for lighter-skinned males and below 66% for darker-skinned females. The gap directly reflected training data composition.
The Gender Shades study found a gap of up to 34.7 percentage points between performance on lighter-skinned males and darker-skinned females across commercial systems from IBM, Microsoft, and Face++.
8. The "benchmark trap" in AI evaluation means:
Correct. If a benchmark dataset is 83% white faces, a model trained on 83% white faces will score well on it — not because it's fair, but because it's tested on the same skew it learned from.
The benchmark trap: evaluation datasets mirror training data biases, so a biased model scores high on benchmarks while failing in the real world for under-represented groups. High scores become misleading validation.
9. Why do AI diagnostic tools trained on standard dermatology datasets perform worse on dark skin tones?
Correct. The ISIC database and similar collections were built from clinical populations that skewed light-skinned, so models trained on them developed expertise on lighter skin and relative blindness to darker tones.
The bias is in the training data — standard dermatology image databases overwhelmingly contain light-skinned patient images. This is classic sampling bias with real diagnostic consequences.
10. In a predictive policing feedback loop, what makes the loop "self-sealing"?
Correct. The loop is self-sealing because the model's deployment pattern generates exactly the data pattern it predicts — making it look accurate while preventing detection of the circular validation.
Self-sealing means the loop destroys its own disconfirming evidence. Predicting crime → deploying police → recording arrests → looking accurate. The counterfactual (what if we hadn't deployed there?) is never observed.
11. Facebook's ad delivery algorithm showing lumber-jack ads to white men and nursing ads to women — without explicit advertiser targeting — is an example of:
Correct. Historical segregation → skewed past engagement → algorithm learns delivery patterns → reinforces segregation → next round of data equally skewed. Classic feedback loop amplifying historical bias.
This is a feedback loop: historical occupational segregation shaped who engaged with which job ads in the past. The algorithm learned those engagement patterns and reproduced them, reinforcing the original segregation.
12. Using "prior arrests" as an operationalisation of "recidivism risk" introduces bias because:
Correct. This is measurement bias — the operationalisation conflates policing activity with criminal behaviour. Communities that receive more police surveillance generate more arrest data regardless of actual crime rates.
Arrests measure both criminal behaviour and policing intensity. In communities that are over-policed, more criminal behaviour is observed and recorded — but this reflects police deployment, not just crime. The operationalisation is flawed.
13. Labeled Faces in the Wild, a standard facial recognition training dataset, is approximately:
Correct. 77.5% male and 83.5% white — these skews directly explain why models trained on it exhibit severe performance disparities by gender and skin tone.
Labeled Faces in the Wild is approximately 77.5% male and 83.5% white. These composition statistics directly predict the performance disparities observed in the Gender Shades study.
14. Pulse oximeter overestimation of blood oxygen levels in Black patients demonstrates that AI bias can originate:
Correct. The bias was in the sensor. Any AI trained on or evaluated with pulse oximeter readings inherited that error — demonstrating that upstream measurement failures corrupt entire pipelines.
AI bias can enter at any point in the data pipeline, including in the physical instruments used to collect data. Sensor bias is measurement bias that propagates into every downstream analysis.
15. Which combination of bias types most accurately describes the COMPAS recidivism algorithm's documented problems?
Correct. COMPAS illustrates how multiple bias types co-occur and interact: historical justice system disparities feed into a flawed operationalisation that then creates a self-sealing feedback dynamic. Real-world AI bias is rarely a single-factor problem.
COMPAS combines multiple bias types: historical data from a racially disparate justice system, measurement bias in using arrests as a risk proxy, and feedback loop potential because incarcerated individuals can't generate disconfirming outcome data. These types compound.