In May 2016, the investigative outlet ProPublica published an analysis titled "Machine Bias." Reporters Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner had obtained risk scores assigned by the COMPAS algorithm — Correctional Offender Management Profiling for Alternative Sanctions — to more than 7,000 defendants in Broward County, Florida. What they found reshaped the national debate on AI in the justice system.
COMPAS, developed by Northpointe (now Equivant), was a proprietary risk-assessment tool sold to courts and corrections departments across the United States. It produced a score from 1 to 10 estimating the likelihood that a defendant would reoffend. Judges used these scores — often without understanding how they were generated — to inform bail, sentencing, and parole decisions.
The algorithm did not directly use race as an input. Instead, it used proxy variables: neighborhood, employment history, prior arrests, family criminal history, and answers to questionnaires about peers and attitudes. In the United States, these factors are structurally correlated with race because of decades of discriminatory policing, housing segregation, and unequal economic opportunity.
ProPublica tracked the defendants for two years after their COMPAS scores were assigned and compared predicted risk to actual reoffending. The results were striking.
Black defendants who did not go on to commit new crimes were nearly twice as likely to be falsely flagged as future criminals compared to white defendants. White defendants who did reoffend were more frequently mislabeled as low-risk. The algorithm was making different types of errors for different racial groups — consistently in the same direction.
ProPublica cited the case of Vernon Prater, a white man with prior armed robbery convictions, who was given a low COMPAS risk score of 3. He was later arrested for breaking into a warehouse. They contrasted him with Brisha Borden, an 18-year-old Black woman with a juvenile record for minor offenses, who scored an 8 — high risk. She was not rearrested. The scores pointed in opposite directions from what subsequently happened.
The core mechanism of the bias was not intentional racism in the algorithm's design. The issue was historical encoding. Prior arrests, for example, reflect not just individual behavior but policing intensity — neighborhoods with heavier police presence accumulate more arrests per crime committed. When a machine learning model trains on arrest data, it learns patterns from a justice system that was itself unevenly applied.
This is sometimes called feedback loop bias: the model learns from outputs of a biased system, and its predictions then influence future decisions in that same system, which generate more biased data, which re-trains the next version of the model. The bias compounds over time.
Northpointe disputed ProPublica's methodology, arguing that the algorithm was equally accurate across racial groups when measured by calibration — meaning the scores correctly predicted the proportion that would reoffend at each score level. This is mathematically true. The controversy exposed a deeper problem: different statistical definitions of fairness are mutually incompatible when base rates differ between groups. You cannot simultaneously achieve equal calibration, equal false positive rates, and equal false negative rates when the underlying reoffending rates differ across groups.
Despite the controversy, COMPAS and similar tools continued to be used. A 2020 study in Science Advances found that untrained humans who were given only a brief description of a defendant's crime performed as accurately as COMPAS at predicting recidivism — suggesting the tool's predictive power was modest at best, while its discriminatory impact was substantial.
COMPAS illustrates that a model can be race-neutral in its inputs and statistically calibrated in aggregate while still producing systematically unfair outcomes for specific groups. Algorithmic fairness is not a single number — it is a set of competing definitions whose trade-offs must be made visible and debated publicly.
You've learned that COMPAS can be simultaneously "calibrated" yet racially disparate. Explore the competing fairness definitions with your AI tutor. Ask about real-world implications, what courts should prioritize, or why proxy variables are so difficult to remove.
In October 2018, Reuters reported that Amazon had quietly shelved an AI recruiting tool it had been developing since 2014. The system was designed to automate the screening of résumés — rating candidates on a scale of one to five stars, similar to the way Amazon rates products. The team that built it discovered by 2015 that the model was systematically downgrading résumés from women. Amazon disbanded the team working on it by 2017, and the tool was never used operationally in hiring.
Amazon trained the model on ten years of résumés submitted to the company. The tech industry, and Amazon's own workforce, was — and remains — male-dominated. The model learned from patterns in which résumés had historically led to hires. Because most successful hires over that decade were men, the model learned to associate male-associated language and institutions with hiring success.
The consequences were concrete and specific. The system penalized résumés that included the word "women's" — as in "women's chess club" or "women's leadership forum." It downgraded graduates of the two all-women's colleges it had identified in the data. It learned to favor verbs more commonly found in male-coded language, such as "executed" and "captured," over verbs more associated with female candidates, such as "collaborated."
Amazon engineers tried to fix the problem by editing the model to be neutral on gender-specific terms. But the problem ran deeper: the model had learned hundreds of other proxies — school names, phrasing patterns, extracurricular signals — that it used to make gender-correlated predictions. Each fix addressed a surface symptom while the underlying learned representation remained encoded in the model's weights.
This illustrates a fundamental challenge in bias mitigation called whack-a-mole debiasing: when you remove one biased feature, the model redistributes its predictive weight onto other correlated features. Unless you address the training data itself — or change the outcome variable — the bias tends to persist in recombined form.
Amazon was not alone. A 2019 study by the National Bureau of Economic Research audited multiple commercial AI hiring tools and found that résumés with distinctively Black names received fewer callbacks than identical résumés with distinctively white names — replicating in AI the same "callback audit" discrimination researchers had documented in human hiring since the 1990s. The AI did not solve historical bias; it automated it.
The Amazon case has become a landmark example because the company had substantial resources, sophisticated engineers, and clear awareness of the problem — and still could not fix it while preserving the tool's usefulness. The lesson is not merely technical. It is institutional: when an organization uses historical hiring patterns as ground truth for what a good hire looks like, it encodes all of its historical biases into the definition of success.
The case also highlights the opacity problem. Automated screening tools used at hiring scale can reject thousands of qualified candidates before any human reviews them. The affected candidates typically receive no explanation and have no mechanism for appeal.
Defining "qualified candidate" using historical hires means defining it using historical discrimination. Any model trained to replicate past hiring success will replicate past hiring bias. Fairness requires changing what the model is optimizing for — not just which features it is allowed to see.
Amazon couldn't fix their hiring algorithm by patching individual biased features. Discuss with your AI tutor what alternative approaches might actually work — from redefining success metrics to structural changes in training data collection.
In December 2020, researchers published a landmark study in the New England Journal of Medicine examining pulse oximeter accuracy by race. Pulse oximeters — the finger-clip devices that measure blood oxygen — were assumed to work identically for all patients. The study, led by researchers at the University of Michigan, analyzed data from more than 10,000 patients across 178 U.S. hospitals. They found that Black patients were nearly three times more likely to have occult hypoxemia — dangerously low oxygen levels that the pulse oximeter failed to detect — compared to white patients.
Pulse oximeters work by shining light through the fingertip and measuring how much is absorbed by oxygenated versus deoxygenated hemoglobin. The devices were calibrated in the 1970s and 1980s using volunteer populations that were predominantly white. Melanin — the pigment that creates darker skin — absorbs some of the light frequencies the oximeter uses, interfering with accurate readings. The result: higher melanin concentrations cause the device to systematically overestimate blood oxygen saturation.
This is a design-stage bias. The hardware was built and certified using data that did not represent the diversity of patients it would serve. When AI-assisted clinical decision tools were later trained on oximeter data recorded in electronic health records, they inherited that upstream measurement error.
A parallel problem exists in AI-powered skin condition diagnosis. In 2019, researchers at Stanford published a widely cited AI system for classifying skin lesions that matched dermatologist accuracy on a held-out test set. However, a subsequent audit found that the training dataset — drawn primarily from images submitted by U.S. and European clinical institutions — was overwhelmingly composed of lighter-skinned patients.
A 2021 review in JAMA Dermatology analyzed 70 studies on AI-assisted dermatology tools and found that only 18% of training images included a skin tone classification at all. Of those that did, images of darker skin tones were dramatically underrepresented. The clinical consequence: melanoma and other skin cancers are harder for the algorithms to detect on darker skin, in precisely the populations already underserved by dermatology access.
In February 2021, the FDA issued a safety communication about pulse oximeter limitations and accuracy differences by skin tone — acknowledging a problem that had existed in cleared medical devices for decades. The FDA began developing new guidance requiring manufacturers to test devices across a broader range of skin tones before certification. As of 2024, updated standards remained under development.
For AI diagnostic tools, the challenge is compounding. Regulatory frameworks for medical AI are still evolving. The FDA's Software as a Medical Device pathway requires clinical validation but does not yet mandate stratified performance reporting by race, skin tone, or other demographic variables in all categories. This means a tool can be cleared for clinical use even if its accuracy varies substantially across patient populations.
Medical AI bias often originates upstream, in the hardware, datasets, and clinical systems that AI learns from. A model that trains on biased measurements or underrepresented imaging data will produce biased outputs regardless of how carefully the model itself is constructed. Fairness in medical AI requires auditing the entire data pipeline, from the sensor forward.
Medical AI inherits the biases of the clinical data it trains on, including who historically received care. Discuss with your AI tutor how the field might address underrepresentation in medical training datasets — and who bears responsibility for fixing it.
On January 9, 2020, Detroit police officers arrested Robert Williams in his driveway in front of his wife and daughters. He was taken into custody on suspicion of shoplifting watches from a Shinola store in 2018. The evidence against him: a still image from store surveillance footage run through Michigan State Police's facial recognition database, which returned his name. Williams spent 30 hours in jail before a detective acknowledged — reportedly after Williams held a photo of himself next to the surveillance image — that the match was wrong. The charges were eventually dropped.
The Williams case was not an isolated failure. It followed a 2018 landmark study by MIT Media Lab researcher Joy Buolamwini and Timnit Gebru, published as "Gender Shades," which tested three commercial facial analysis systems — from Microsoft, IBM, and Face++ — across a dataset of 1,270 parliamentary figures from Africa and Scandinavia. The findings were stark.
In December 2019, the National Institute of Standards and Technology (NIST) published a comprehensive evaluation of 189 facial recognition algorithms from 99 developers. The report found that most algorithms showed higher false positive rates for African American and Asian faces compared to Caucasian faces — by a factor of 10 to 100 times in the worst cases. False positives in a criminal identification context mean wrongly identifying a person as a suspect.
Robert Williams's case was followed by others. In April 2021, Michael Oliver of Michigan was wrongfully arrested after facial recognition software misidentified him in a road rage incident. Oliver spent ten days in jail. In the same period, Nijeer Parks of New Jersey spent ten days in jail in 2019 after being misidentified by facial recognition for a shoplifting and assault incident; he was eventually cleared. All three men were Black. The American Civil Liberties Union documented the cases as part of a broader pattern of law enforcement facial recognition use producing false matches that disproportionately harm Black suspects.
In each documented wrongful arrest case, the facial recognition output was treated as evidence strong enough to justify an arrest, despite official guidance — including from algorithm vendors themselves — that the technology should only be used as an investigative lead requiring human corroboration. The error was not only in the algorithm; it was in how law enforcement trusted and acted on algorithmic output without sufficient verification.
The accuracy disparity in facial recognition traces to training data composition. Benchmark datasets used to develop and validate facial recognition algorithms — including Labeled Faces in the Wild, which became a standard benchmark — were assembled from internet images that skewed toward lighter-skinned, male, and celebrity-adjacent subjects. Models trained and benchmarked on these datasets optimized for the populations they saw most often.
Additionally, lower-resolution or lower-quality surveillance images compound the problem. Police departments often use footage from store cameras that produce compressed, low-resolution images. These images are harder to match accurately in general — but the degradation in performance is not uniform; algorithms fail earlier and more severely on faces with features less represented in training data.
Facial recognition bias is not a theoretical problem — it has sent innocent people to jail. The harm is concentrated in the communities that were already underrepresented in training data: predominantly Black and darker-skinned individuals. High-stakes deployment of technology with documented performance disparities across demographic groups is not a technical decision alone; it is an ethical and policy decision with real civil rights implications.
Innocent people have been jailed due to facial recognition errors. Some cities have banned the technology for law enforcement use. Others argue improved accuracy will solve the problem. Debate the governance and ethical questions with your AI tutor.