Lesson 1 · Real World Bias Examples

COMPAS and the Recidivism Algorithm

When a risk-scoring tool used in courtrooms was shown to flag Black defendants as high-risk at nearly twice the rate of white defendants — despite similar actual reoffending rates.

How did a widely deployed criminal justice algorithm embed racial disparity into judicial decisions?

In May 2016, the investigative outlet ProPublica published an analysis titled "Machine Bias." Reporters Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner had obtained risk scores assigned by the COMPAS algorithm — Correctional Offender Management Profiling for Alternative Sanctions — to more than 7,000 defendants in Broward County, Florida. What they found reshaped the national debate on AI in the justice system.

What COMPAS Was

COMPAS, developed by Northpointe (now Equivant), was a proprietary risk-assessment tool sold to courts and corrections departments across the United States. It produced a score from 1 to 10 estimating the likelihood that a defendant would reoffend. Judges used these scores — often without understanding how they were generated — to inform bail, sentencing, and parole decisions.

The algorithm did not directly use race as an input. Instead, it used proxy variables: neighborhood, employment history, prior arrests, family criminal history, and answers to questionnaires about peers and attitudes. In the United States, these factors are structurally correlated with race because of decades of discriminatory policing, housing segregation, and unequal economic opportunity.

The ProPublica Findings

ProPublica tracked the defendants for two years after their COMPAS scores were assigned and compared predicted risk to actual reoffending. The results were striking.

45%

Black defendants labeled higher risk who did not reoffend

23%

White defendants labeled higher risk who did not reoffend

28%

White defendants labeled lower risk who did reoffend

48%

Overall accuracy for both groups' two-year reoffending

Black defendants who did not go on to commit new crimes were nearly twice as likely to be falsely flagged as future criminals compared to white defendants. White defendants who did reoffend were more frequently mislabeled as low-risk. The algorithm was making different types of errors for different racial groups — consistently in the same direction.

Documented Case

ProPublica cited the case of Vernon Prater, a white man with prior armed robbery convictions, who was given a low COMPAS risk score of 3. He was later arrested for breaking into a warehouse. They contrasted him with Brisha Borden, an 18-year-old Black woman with a juvenile record for minor offenses, who scored an 8 — high risk. She was not rearrested. The scores pointed in opposite directions from what subsequently happened.

Why Proxies Encode Discrimination

The core mechanism of the bias was not intentional racism in the algorithm's design. The issue was historical encoding. Prior arrests, for example, reflect not just individual behavior but policing intensity — neighborhoods with heavier police presence accumulate more arrests per crime committed. When a machine learning model trains on arrest data, it learns patterns from a justice system that was itself unevenly applied.

This is sometimes called feedback loop bias: the model learns from outputs of a biased system, and its predictions then influence future decisions in that same system, which generate more biased data, which re-trains the next version of the model. The bias compounds over time.

Proxy Variable: A variable that is not race itself but is statistically correlated with race due to structural inequality — used as an input to a model that then produces racially disparate outputs.

False Positive Rate Disparity: When a model incorrectly flags members of one group at a higher rate than another — in COMPAS, Black defendants were falsely labeled high-risk at nearly double the rate of white defendants.

The Aftermath and Ongoing Use

Northpointe disputed ProPublica's methodology, arguing that the algorithm was equally accurate across racial groups when measured by calibration — meaning the scores correctly predicted the proportion that would reoffend at each score level. This is mathematically true. The controversy exposed a deeper problem: different statistical definitions of fairness are mutually incompatible when base rates differ between groups. You cannot simultaneously achieve equal calibration, equal false positive rates, and equal false negative rates when the underlying reoffending rates differ across groups.

Despite the controversy, COMPAS and similar tools continued to be used. A 2020 study in Science Advances found that untrained humans who were given only a brief description of a defendant's crime performed as accurately as COMPAS at predicting recidivism — suggesting the tool's predictive power was modest at best, while its discriminatory impact was substantial.

Core Insight

COMPAS illustrates that a model can be race-neutral in its inputs and statistically calibrated in aggregate while still producing systematically unfair outcomes for specific groups. Algorithmic fairness is not a single number — it is a set of competing definitions whose trade-offs must be made visible and debated publicly.

Lesson 1 Quiz

COMPAS and Criminal Justice AI

1. What did the ProPublica 2016 investigation find about COMPAS scores for Black defendants who did not reoffend?

Correct. ProPublica found that 45% of Black defendants who did not reoffend were labeled high-risk, compared to 23% of white defendants — nearly double the false positive rate.

Not quite. The key finding was a stark disparity in false positive rates — Black defendants who did not reoffend were flagged at nearly twice the rate of white defendants in the same situation.

2. Why did COMPAS produce racially disparate results even though it did not directly use race as an input variable?

Correct. Proxy variables — prior arrests, employment history, neighborhood — carry racial correlations because of systemic inequality. A model trained on these inputs learns those correlations.

Incorrect. The disparity arose from proxy variables that encode structural inequality, not from intentional racial coding or manual adjustment.

3. What statistical fairness tension did the COMPAS controversy reveal?

Correct. When base rates of reoffending differ between groups, it is mathematically impossible to simultaneously achieve equal calibration, equal false positive rates, and equal false negative rates.

Not correct. The key insight is that different mathematical definitions of fairness are mutually incompatible when base rates differ between groups — forcing a choice about which definition to prioritize.

Lab 1 · Interrogating COMPAS

Discuss the fairness definitions conflict with your AI tutor — 3 exchanges to complete

Your Task

You've learned that COMPAS can be simultaneously "calibrated" yet racially disparate. Explore the competing fairness definitions with your AI tutor. Ask about real-world implications, what courts should prioritize, or why proxy variables are so difficult to remove.

Suggested starter: "If COMPAS is equally calibrated for both groups, why does ProPublica say it's biased? Who is right?"

AI Tutor — COMPAS & Fairness Definitions Lab 1

Welcome to Lab 1. We're examining the COMPAS controversy — specifically how a tool can satisfy one definition of statistical fairness while violating another. Ask me anything about the competing fairness metrics, the proxy variable problem, or what courts should do when algorithms conflict with human judgment.

Lesson 2 · Real World Bias Examples

Amazon's Hiring Algorithm and Gender Bias

A machine learning résumé screening tool trained on a decade of hiring data taught itself that women were less desirable candidates — and had to be abandoned.

How did Amazon's AI recruiter learn to penalize words like "women's" and downgrade graduates of all-women's colleges?

In October 2018, Reuters reported that Amazon had quietly shelved an AI recruiting tool it had been developing since 2014. The system was designed to automate the screening of résumés — rating candidates on a scale of one to five stars, similar to the way Amazon rates products. The team that built it discovered by 2015 that the model was systematically downgrading résumés from women. Amazon disbanded the team working on it by 2017, and the tool was never used operationally in hiring.

How the Bias Emerged

Amazon trained the model on ten years of résumés submitted to the company. The tech industry, and Amazon's own workforce, was — and remains — male-dominated. The model learned from patterns in which résumés had historically led to hires. Because most successful hires over that decade were men, the model learned to associate male-associated language and institutions with hiring success.

The consequences were concrete and specific. The system penalized résumés that included the word "women's" — as in "women's chess club" or "women's leadership forum." It downgraded graduates of the two all-women's colleges it had identified in the data. It learned to favor verbs more commonly found in male-coded language, such as "executed" and "captured," over verbs more associated with female candidates, such as "collaborated."

Mechanism

Training Data as a Mirror of Past Discrimination

The model was not given gender as a feature — it inferred gender from correlated signals. The technical term is disparate impact through proxies. The model found the same pattern that historical discrimination produced and reproduced it at scale, automatically, on every résumé submitted. Had the tool been deployed, Amazon's bias would have been laundered through the appearance of data-driven objectivity.

Attempted Fixes and Why They Failed

Amazon engineers tried to fix the problem by editing the model to be neutral on gender-specific terms. But the problem ran deeper: the model had learned hundreds of other proxies — school names, phrasing patterns, extracurricular signals — that it used to make gender-correlated predictions. Each fix addressed a surface symptom while the underlying learned representation remained encoded in the model's weights.

This illustrates a fundamental challenge in bias mitigation called whack-a-mole debiasing: when you remove one biased feature, the model redistributes its predictive weight onto other correlated features. Unless you address the training data itself — or change the outcome variable — the bias tends to persist in recombined form.

Broader Context

Amazon was not alone. A 2019 study by the National Bureau of Economic Research audited multiple commercial AI hiring tools and found that résumés with distinctively Black names received fewer callbacks than identical résumés with distinctively white names — replicating in AI the same "callback audit" discrimination researchers had documented in human hiring since the 1990s. The AI did not solve historical bias; it automated it.

What This Case Teaches

The Amazon case has become a landmark example because the company had substantial resources, sophisticated engineers, and clear awareness of the problem — and still could not fix it while preserving the tool's usefulness. The lesson is not merely technical. It is institutional: when an organization uses historical hiring patterns as ground truth for what a good hire looks like, it encodes all of its historical biases into the definition of success.

The case also highlights the opacity problem. Automated screening tools used at hiring scale can reject thousands of qualified candidates before any human reviews them. The affected candidates typically receive no explanation and have no mechanism for appeal.

Disparate Impact: When a facially neutral policy or algorithm produces outcomes that disproportionately disadvantage a protected group, regardless of intent.

Whack-a-Mole Debiasing: The phenomenon where removing one biased feature causes the model to compensate by weighting other correlated features, perpetuating the underlying bias through different pathways.

Core Insight

Defining "qualified candidate" using historical hires means defining it using historical discrimination. Any model trained to replicate past hiring success will replicate past hiring bias. Fairness requires changing what the model is optimizing for — not just which features it is allowed to see.

Lesson 2 Quiz

Amazon's Hiring Algorithm and Gender Bias

1. Why did Amazon's résumé screening AI penalize words like "women's chess club" even though gender was not an explicit input feature?

Correct. Without using gender directly, the model learned that male-associated language patterns correlated with historical hires, and penalized female-associated language accordingly.

Not correct. The bias arose because the model learned from ten years of historically male-dominated hiring outcomes and found gender-correlated proxies in the text.

2. What is "whack-a-mole debiasing" as illustrated by the Amazon case?

Correct. Amazon's engineers found that removing gender-associated terms caused the model to redistribute bias onto other features, making surface-level fixes insufficient.

Incorrect. Whack-a-mole debiasing refers to the frustrating cycle where fixing one biased signal causes the model to lean on other correlated signals, not a strategy but a problem.

3. What is the fundamental problem with defining "a good hire" using historical hiring outcomes?

Correct. Using historical hires as ground truth means the model learns to replicate the biases — conscious or structural — that shaped those historical decisions.

Not correct. The core problem is that historical hiring patterns embed historical discrimination, so training a model to replicate those patterns trains it to replicate that discrimination.

Lab 2 · Fixing Hiring Bias

Explore alternative approaches to debiasing résumé screening AI — 3 exchanges to complete

Your Task

Amazon couldn't fix their hiring algorithm by patching individual biased features. Discuss with your AI tutor what alternative approaches might actually work — from redefining success metrics to structural changes in training data collection.

Suggested starter: "If removing biased features doesn't work, what should Amazon have done differently from the start?"

AI Tutor — Debiasing Hiring AI Lab 2

Welcome to Lab 2. We're thinking through how to actually fix algorithmic hiring bias — not just patch symptoms. Ask me about alternative training objectives, counterfactual fairness, blind screening, or what regulators have proposed for AI in hiring.

Lesson 3 · Real World Bias Examples

Pulse Oximeters, Dermatology AI, and Skin Tone Gaps

Medical devices and algorithms trained predominantly on lighter-skinned patients have been shown to systematically misread or misdiagnose patients with darker skin — with life-threatening consequences.

How did decades of designing medical tools around a narrow patient population create a bias that AI inherited and amplified?

In December 2020, researchers published a landmark study in the New England Journal of Medicine examining pulse oximeter accuracy by race. Pulse oximeters — the finger-clip devices that measure blood oxygen — were assumed to work identically for all patients. The study, led by researchers at the University of Michigan, analyzed data from more than 10,000 patients across 178 U.S. hospitals. They found that Black patients were nearly three times more likely to have occult hypoxemia — dangerously low oxygen levels that the pulse oximeter failed to detect — compared to white patients.

The Mechanism: Optical Bias in Medical Hardware

Pulse oximeters work by shining light through the fingertip and measuring how much is absorbed by oxygenated versus deoxygenated hemoglobin. The devices were calibrated in the 1970s and 1980s using volunteer populations that were predominantly white. Melanin — the pigment that creates darker skin — absorbs some of the light frequencies the oximeter uses, interfering with accurate readings. The result: higher melanin concentrations cause the device to systematically overestimate blood oxygen saturation.

This is a design-stage bias. The hardware was built and certified using data that did not represent the diversity of patients it would serve. When AI-assisted clinical decision tools were later trained on oximeter data recorded in electronic health records, they inherited that upstream measurement error.

3×

Higher rate of undetected hypoxemia in Black patients (NEJM 2020)

1970s

Decade when oximeter calibration data was primarily collected

COVID-19

Pandemic that made the disparity clinically critical at mass scale

Dermatology AI: The Imaging Dataset Problem

A parallel problem exists in AI-powered skin condition diagnosis. In 2019, researchers at Stanford published a widely cited AI system for classifying skin lesions that matched dermatologist accuracy on a held-out test set. However, a subsequent audit found that the training dataset — drawn primarily from images submitted by U.S. and European clinical institutions — was overwhelmingly composed of lighter-skinned patients.

A 2021 review in JAMA Dermatology analyzed 70 studies on AI-assisted dermatology tools and found that only 18% of training images included a skin tone classification at all. Of those that did, images of darker skin tones were dramatically underrepresented. The clinical consequence: melanoma and other skin cancers are harder for the algorithms to detect on darker skin, in precisely the populations already underserved by dermatology access.

Compounding Factor

Underrepresentation in Clinical Datasets

Clinical image datasets reflect historical patterns of healthcare access. Patients from higher-income, predominantly white populations have historically had better access to dermatologists who generate high-quality labeled images. When AI systems train on available clinical data, they train on the inequity of who received care — and then produce tools that work best for the populations that already had the best care.

Regulatory Response and Open Problems

In February 2021, the FDA issued a safety communication about pulse oximeter limitations and accuracy differences by skin tone — acknowledging a problem that had existed in cleared medical devices for decades. The FDA began developing new guidance requiring manufacturers to test devices across a broader range of skin tones before certification. As of 2024, updated standards remained under development.

For AI diagnostic tools, the challenge is compounding. Regulatory frameworks for medical AI are still evolving. The FDA's Software as a Medical Device pathway requires clinical validation but does not yet mandate stratified performance reporting by race, skin tone, or other demographic variables in all categories. This means a tool can be cleared for clinical use even if its accuracy varies substantially across patient populations.

Design-Stage Bias: Bias introduced when a device or algorithm is built and calibrated using data that does not represent the full diversity of people who will use it — the bias is structural, not an error in deployment.

Occult Hypoxemia: Dangerously low blood oxygen that is not detected by pulse oximetry — found to occur at nearly three times the rate in Black patients due to device calibration bias.

Core Insight

Medical AI bias often originates upstream, in the hardware, datasets, and clinical systems that AI learns from. A model that trains on biased measurements or underrepresented imaging data will produce biased outputs regardless of how carefully the model itself is constructed. Fairness in medical AI requires auditing the entire data pipeline, from the sensor forward.

Lesson 3 Quiz

Medical AI and Skin Tone Bias

1. What did the 2020 New England Journal of Medicine study find about pulse oximeters and Black patients?

Correct. The study found that Black patients experienced occult hypoxemia — undetected dangerously low oxygen — at nearly three times the rate of white patients, due to melanin interfering with the device's optical readings.

Not correct. The study found Black patients were nearly three times more likely to have dangerously low oxygen levels that the oximeter failed to detect — the device systematically overestimated their blood oxygen levels.

2. Why did dermatology AI trained to detect skin cancer perform worse on darker skin tones?

Correct. The training images came primarily from clinical institutions serving predominantly lighter-skinned populations, so the model had little exposure to the visual patterns of skin conditions on darker skin.

Not quite. The performance gap arose because training images disproportionately represented lighter-skinned patients — a reflection of who historically had access to the specialist care that generated labeled clinical images.

3. What does "design-stage bias" mean in the context of medical AI?

Correct. Design-stage bias means the problem is baked in from the start — like pulse oximeters calibrated in the 1970s on predominantly white volunteers — not something that emerges later from misuse.

Incorrect. Design-stage bias refers to bias introduced during the creation of a device or algorithm, because the calibration or training data didn't represent the people who would ultimately use or be assessed by it.

Lab 3 · Medical AI and Representation

Explore solutions to dataset underrepresentation in clinical AI — 3 exchanges to complete

Your Task

Medical AI inherits the biases of the clinical data it trains on, including who historically received care. Discuss with your AI tutor how the field might address underrepresentation in medical training datasets — and who bears responsibility for fixing it.

Suggested starter: "Should the FDA require stratified performance data by race and skin tone before approving medical AI tools? What would that look like in practice?"

AI Tutor — Medical AI Representation Lab 3

Welcome to Lab 3. We're examining how to build more equitable medical AI — from dataset diversity requirements to regulatory frameworks. Ask me about federated learning across diverse health systems, synthetic data augmentation, FDA Software as a Medical Device guidance, or what "algorithmic auditing" looks like in clinical contexts.

Lesson 4 · Real World Bias Examples

Facial Recognition and the False Match Problem

Multiple documented cases of Black men being falsely identified by facial recognition systems and wrongly arrested have exposed the human cost of deploying inaccurate AI in high-stakes law enforcement contexts.

When Robert Williams was wrongly arrested in 2020 because a facial recognition system misidentified him, what did it reveal about how these tools are deployed — and on whom they fail?

On January 9, 2020, Detroit police officers arrested Robert Williams in his driveway in front of his wife and daughters. He was taken into custody on suspicion of shoplifting watches from a Shinola store in 2018. The evidence against him: a still image from store surveillance footage run through Michigan State Police's facial recognition database, which returned his name. Williams spent 30 hours in jail before a detective acknowledged — reportedly after Williams held a photo of himself next to the surveillance image — that the match was wrong. The charges were eventually dropped.

The Accuracy Gap in Facial Recognition

The Williams case was not an isolated failure. It followed a 2018 landmark study by MIT Media Lab researcher Joy Buolamwini and Timnit Gebru, published as "Gender Shades," which tested three commercial facial analysis systems — from Microsoft, IBM, and Face++ — across a dataset of 1,270 parliamentary figures from Africa and Scandinavia. The findings were stark.

<1%

Error rate for lighter-skinned males (Gender Shades, 2018)

35%

Maximum error rate for darker-skinned females in the same systems

2019

NIST study confirming false positive disparities across 189 algorithms

In December 2019, the National Institute of Standards and Technology (NIST) published a comprehensive evaluation of 189 facial recognition algorithms from 99 developers. The report found that most algorithms showed higher false positive rates for African American and Asian faces compared to Caucasian faces — by a factor of 10 to 100 times in the worst cases. False positives in a criminal identification context mean wrongly identifying a person as a suspect.

Additional Documented Wrongful Arrests

Robert Williams's case was followed by others. In April 2021, Michael Oliver of Michigan was wrongfully arrested after facial recognition software misidentified him in a road rage incident. Oliver spent ten days in jail. In the same period, Nijeer Parks of New Jersey spent ten days in jail in 2019 after being misidentified by facial recognition for a shoplifting and assault incident; he was eventually cleared. All three men were Black. The American Civil Liberties Union documented the cases as part of a broader pattern of law enforcement facial recognition use producing false matches that disproportionately harm Black suspects.

The Deployment Gap

In each documented wrongful arrest case, the facial recognition output was treated as evidence strong enough to justify an arrest, despite official guidance — including from algorithm vendors themselves — that the technology should only be used as an investigative lead requiring human corroboration. The error was not only in the algorithm; it was in how law enforcement trusted and acted on algorithmic output without sufficient verification.

Why the Accuracy Gap Exists

The accuracy disparity in facial recognition traces to training data composition. Benchmark datasets used to develop and validate facial recognition algorithms — including Labeled Faces in the Wild, which became a standard benchmark — were assembled from internet images that skewed toward lighter-skinned, male, and celebrity-adjacent subjects. Models trained and benchmarked on these datasets optimized for the populations they saw most often.

Additionally, lower-resolution or lower-quality surveillance images compound the problem. Police departments often use footage from store cameras that produce compressed, low-resolution images. These images are harder to match accurately in general — but the degradation in performance is not uniform; algorithms fail earlier and more severely on faces with features less represented in training data.

False Positive (in identification): When a facial recognition system incorrectly identifies a person as matching a target image — in law enforcement, this means wrongly flagging an innocent person as a suspect.

Benchmark Dataset Bias: When the dataset used to evaluate an algorithm's performance does not represent the full population it will be deployed on, causing the benchmark to overstate real-world accuracy for underrepresented groups.

Core Insight

Facial recognition bias is not a theoretical problem — it has sent innocent people to jail. The harm is concentrated in the communities that were already underrepresented in training data: predominantly Black and darker-skinned individuals. High-stakes deployment of technology with documented performance disparities across demographic groups is not a technical decision alone; it is an ethical and policy decision with real civil rights implications.

Lesson 4 Quiz

Facial Recognition and the False Match Problem

1. What did the 2018 "Gender Shades" study by Buolamwini and Gebru find about commercial facial analysis systems?

Correct. Gender Shades revealed a massive performance gap — near-perfect accuracy for lighter-skinned males versus error rates as high as 35% for darker-skinned females — across systems from Microsoft, IBM, and Face++.

Not correct. Gender Shades found dramatic disparities — below 1% error for lighter-skinned males and up to 35% error for darker-skinned females, revealing that commercially deployed systems had huge accuracy gaps by demographic group.

2. In the Robert Williams case, what was the fundamental error in how facial recognition was used?

Correct. Even algorithm vendors state their outputs should be investigative leads, not standalone evidence. Treating the match as sufficient for arrest — without robust corroboration — was a critical failure in how law enforcement deployed the technology.

Not correct. The core error was treating a facial recognition match as sufficient justification for arrest, bypassing the verification steps that even the algorithm vendors recommend as necessary before acting on a match.

3. Why do standard facial recognition benchmark datasets contribute to performance disparities across skin tones?

Correct. Benchmark composition shapes what algorithms optimize for. When training and evaluation datasets skew toward lighter-skinned subjects, the resulting models perform better on that demographic and the benchmark scores don't reveal real-world disparities.

Not correct. The issue is dataset composition — benchmarks like Labeled Faces in the Wild contained predominantly lighter-skinned, male images sourced from the internet, so models trained and evaluated on them are not exposed to nor optimized for darker skin tones.

Lab 4 · Facial Recognition Policy

Debate the governance questions around facial recognition in law enforcement — 3 exchanges to complete

Your Task

Innocent people have been jailed due to facial recognition errors. Some cities have banned the technology for law enforcement use. Others argue improved accuracy will solve the problem. Debate the governance and ethical questions with your AI tutor.

Suggested starter: "Should cities ban facial recognition in law enforcement entirely, or is better regulation and accuracy improvement enough? What does the evidence say?"

AI Tutor — Facial Recognition Governance Lab 4

Welcome to Lab 4. We're examining the governance questions around facial recognition in law enforcement — from municipal bans in San Francisco and Boston to NIST accuracy standards to civil rights frameworks. Ask me about the ban debate, what "good enough" accuracy means when false positives mean jail, or how other countries regulate this technology.

Module 2 Test

15 questions · 80% to pass · Real World Bias Examples

1. ProPublica's 2016 analysis of COMPAS found that Black defendants who did not reoffend were falsely labeled high-risk at what approximate rate compared to white defendants in the same situation?

Correct. 45% of Black defendants who didn't reoffend were labeled high-risk, vs. 23% of white defendants — nearly double.

Incorrect. ProPublica found Black defendants who didn't reoffend were labeled high-risk at nearly twice the rate: 45% vs. 23%.

2. COMPAS did not use race as a direct input. What mechanism produced racially disparate outputs?

Correct. Variables like arrest history reflect unequal policing by neighborhood and race, encoding structural inequality as a proxy for race.

Incorrect. Proxy variables — arrest history, employment, neighborhood — carry racial correlations due to systemic inequality, producing disparate outcomes without using race directly.

3. Northpointe argued COMPAS was fair because it was equally "calibrated" across racial groups. Why does this not resolve the fairness concern raised by ProPublica?

Correct. When base rates differ between groups, it is mathematically impossible to satisfy calibration and equal false positive rates simultaneously — each definition of fairness produces different outcomes.

Incorrect. The key point is that multiple definitions of fairness are mathematically incompatible when base rates differ — a model can satisfy calibration while still having disparate false positive rates.

4. Amazon abandoned its AI résumé screening tool because it exhibited which specific behavior?

Correct. The tool penalized women's-affiliated terms and institutions because it trained on a decade of male-dominated successful hires.

Incorrect. The tool was discovered to systematically downgrade women's résumés — penalizing terms like "women's" and all-women's college degrees — because it learned from historically male-dominated hiring outcomes.

5. When Amazon's engineers removed gender-associated terms from the model's features, what happened?

Correct. This is the "whack-a-mole" problem — removing one biased feature causes the model to find and use other correlated signals to achieve the same biased outcome.

Incorrect. The bias persisted through other features — this is called whack-a-mole debiasing, where fixing one signal causes the model to compensate with others.

6. The Amazon hiring AI was trained on ten years of résumés that led to successful hires. What was fundamentally wrong with using historical hires as the definition of "a good candidate"?

Correct. Past hiring reflected bias — conscious or structural — so "success" in the training data was defined within a discriminatory context that the model learned to reproduce.

Incorrect. The core problem is that historical hires were shaped by historical discrimination, so training a model to replicate those outcomes trains it to replicate those biases.

7. What did the 2020 New England Journal of Medicine study find about pulse oximeter accuracy by race?

Correct. The study found Black patients experienced occult hypoxemia — undetected dangerously low oxygen — at nearly triple the rate of white patients.

Incorrect. The study found Black patients were nearly three times more likely to have dangerously low oxygen undetected by pulse oximeters due to melanin interference with the device's optical sensors.

8. Why were pulse oximeters inaccurate for patients with darker skin tones?

Correct. Calibration data was collected decades ago from predominantly white volunteers. Melanin absorbs some of the light frequencies the device relies on, causing systematic overestimation of blood oxygen in darker-skinned patients.

Incorrect. The problem was that calibration data used to design the devices was collected primarily from white volunteers in the 1970s–80s. Melanin interferes with the optical wavelengths used, causing inaccurate readings on darker skin.

9. A 2021 JAMA Dermatology review of 70 AI dermatology studies found what about skin tone representation in training data?

Correct. The review found only 18% of studies classified skin tone at all, and among those, darker skin tones were dramatically underrepresented — a systematic gap in medical AI development.

Incorrect. Only 18% of the 70 studies even classified skin tone, and darker tones were dramatically underrepresented — reflecting historical inequities in who received specialist dermatology care.

10. What does "design-stage bias" mean?

Correct. Design-stage bias is baked in from creation — like pulse oximeters calibrated only on white volunteers — rather than appearing through later misuse or deployment errors.

Incorrect. Design-stage bias refers to bias introduced during the device's or algorithm's creation, because the calibration or training data didn't represent the full range of people it would ultimately serve.

11. The 2018 "Gender Shades" study found that one commercial facial analysis system had an error rate below 1% for lighter-skinned males but up to what percentage for darker-skinned females?

Correct. Gender Shades found error rates as high as 35% for darker-skinned females — compared to below 1% for lighter-skinned males — across commercial systems from Microsoft, IBM, and Face++.

Incorrect. The maximum error rate found was 35% for darker-skinned females, compared to below 1% for lighter-skinned males — a gap that exposed dramatic demographic performance disparities in commercial AI tools.

12. What did the 2019 NIST study of 189 facial recognition algorithms find about false positive rates?

Correct. The NIST study found that most of the 189 algorithms had dramatically higher false positive rates for African American and Asian faces — in the worst cases, 10 to 100 times higher than for Caucasian faces.

Incorrect. NIST found that most algorithms showed false positive rates for African American and Asian faces that were 10 to 100 times higher than for Caucasian faces — a finding across the overwhelming majority of the 189 algorithms tested.

13. What happened to Robert Williams in January 2020?

Correct. Williams spent 30 hours in jail after a Michigan State Police facial recognition system misidentified him from surveillance footage. The error was only acknowledged after he compared the image to himself in front of a detective.

Incorrect. Robert Williams was wrongfully arrested and held for 30 hours after a facial recognition system misidentified him as a shoplifting suspect — one of several documented wrongful arrests tied to facial recognition misidentification.

14. Why do benchmark datasets like "Labeled Faces in the Wild" contribute to facial recognition performance disparities by skin tone?

Correct. Dataset composition from internet scraping skewed toward lighter-skinned, male, and celebrity faces — causing models to be trained on and optimized for those groups, with benchmark accuracy scores that don't reveal disparities on underrepresented groups.

Incorrect. These benchmarks were assembled from internet images dominated by lighter-skinned and male subjects. Models trained and evaluated on them perform well on those groups but have inflated benchmark scores that hide poor performance on underrepresented groups.

15. Which common thread best describes the root cause of bias across all four cases in this module — COMPAS, Amazon's hiring tool, medical AI, and facial recognition?

Correct. Whether it was arrest records shaped by unequal policing, male-dominated hiring outcomes, clinical datasets reflecting unequal healthcare access, or benchmark images scraped from a non-representative internet — each case shows AI inheriting and scaling structural inequality from its training data.

Incorrect. The common thread across all four cases is that AI systems trained on outputs of historically unequal systems learned those inequalities and reproduced them at scale — concentrating harm on populations that were already disadvantaged in the systems that generated the training data.