In January 2019, a dermatology AI published in Nature reported 91% accuracy diagnosing skin cancer β matching board-certified dermatologists. The headline was celebrated globally. What the headline did not say: the training set was 97% lighter-skinned patients. Performance on darker skin tones was never separately reported.
You cannot find a gap you never look for.
When researchers report a single accuracy number, they are averaging across every test case. If Group A makes up 90% of the test set and the model scores 99% on Group A but only 50% on Group B, the headline figure is still ~94% β impressive-sounding, catastrophic in practice for Group B.
This is not a hypothetical. In 2019, a landmark study by Obermeyer et al. published in Science examined a commercial health-risk algorithm used on roughly 200 million people per year in U.S. hospitals. The algorithm predicted who needed extra care. Overall accuracy looked acceptable. Disaggregated by race, Black patients assigned the same risk score as white patients were actually far sicker β the model systematically underestimated Black patients' needs because it used healthcare cost as a proxy for health, and structural inequities meant Black patients historically spent less on care even when equally ill.
The bias was invisible until someone ran disaggregated performance statistics.
Researchers at UC Berkeley analyzed a widely deployed hospital risk-stratification algorithm. They found that at the same algorithmically assigned risk score, Black patients had on average 26.3% more active chronic conditions than white patients. The gap was caused by using prior medical cost as a proxy for health need β a design choice that encoded historical access disparities into future care allocation.
Before you can find a gap, you need precise language for what you are measuring.
The U.S. National Institute of Standards and Technology ran the Face Recognition Vendor Test (FRVT), publishing results in December 2019 that tested 189 face recognition algorithms from 99 developers. This is the largest independent evaluation of face recognition AI ever conducted.
Results were disaggregated by demographics. The findings were stark: for one-to-one verification (confirming an identity), false match rates for African-American and Asian faces were 10 to 100 times higher than for Caucasian faces in most algorithms. For one-to-many search (finding a face in a database), African-American women had the highest false positive rates of any demographic group tested.
These were not tiny boutique systems. These were commercial products deployed at airports, border crossings, and police departments. The gaps existed in production for years before NIST's systematic disaggregated testing revealed them.
A bias gap does not announce itself. It requires deliberate measurement infrastructure: test sets that represent minority subgroups in sufficient numbers, evaluation pipelines that report disaggregated metrics, and stakeholders who ask uncomfortable questions about who the system fails.
Effective bias testing shares several characteristics documented in the research literature:
Minority groups must be large enough in the test set to yield statistically meaningful results. A 200-sample test set with 4 examples of a demographic group cannot reliably detect bias.
Report accuracy, false positive rate, false negative rate, and calibration β separately for each group. A single metric can look fine while others reveal serious disparities.
Decide which groups to evaluate before looking at data. Post-hoc selection can accidentally hide disparities or create spurious findings.
Design inputs specifically intended to stress-test edge cases β accented speech, non-standard names, darker skin tones, atypical body types β not just naturally occurring test samples.
In the next lesson, we look at how researchers use audit studies β the method that exposed bias in resume screening, lending, and speech recognition β to find gaps in deployed systems without ever accessing the model's internals.
You are analyzing a newly trained loan approval model. The developer reports 88% accuracy on a 10,000-sample test set. Your job is to decide what additional testing is needed. Use the chat to explore disaggregated evaluation strategies, discuss the Obermeyer and NIST findings, and identify what metrics would reveal hidden bias in this scenario.
Amazon's internal machine learning team built a resume screening tool starting in 2014. By 2015 they realized it was downgrading resumes that included the word "women's" β as in "women's chess club" β and systematically ranking graduates of all-women's colleges lower. The bias came from training on ten years of submitted resumes, which were predominantly from men. Amazon scrapped the tool in 2017. Reuters reported the story in October 2018. The discovery came through internal audit: comparing outputs for matched inputs.
An audit study is a controlled experiment where researchers submit matched pairs of inputs that differ only in the characteristic being tested. In hiring discrimination research, this method dates to the 1970s β sending identical resumes with names that signal different racial backgrounds and measuring callback rates. The same logic now applies to AI.
For AI systems specifically, audit studies typically work by:
Submitting identical documents where only a name changes β "Emily Walsh" vs. "Lakisha Washington" β to test whether a system treats racially associated names differently.
Altering perceived race, gender, or age in photographs while holding all other features constant, then measuring how classification or recommendation outputs change.
Submitting identical semantic content written in African American Vernacular English (AAVE) vs. Standard American English, measuring sentiment analysis or toxicity scores.
Changing a single protected attribute in a description β swapping "he" for "she," "Black" for "white" β and checking whether unrelated predictions change as a result.
Researchers Manish Raghavan, Solon Barocas, Jon Kleinberg, and Karen Levy published "Mitigating Bias in Algorithmic Hiring" in 2020, documenting how audit studies apply to automated hiring tools. They examined products from vendors including HireVue, which uses video interview AI to score candidates. The paper noted that such systems are tested against a "ground truth" of which candidates were hired or promoted β but if those historical hiring decisions were themselves biased, the AI learns to replicate those biased judgments.
In 2019, the Electronic Privacy Information Center (EPIC) filed a complaint with the Federal Trade Commission about HireVue, citing concerns about opaque AI assessment. HireVue subsequently discontinued its use of facial analysis in 2021 β an outcome partly driven by external audit pressure and advocacy.
Stanford researchers Allison Koenecke et al. published a study in PNAS testing five major commercial speech-to-text systems (Apple, Amazon, Google, IBM, Microsoft) against audio from white and Black speakers. The average word error rate for Black speakers was 35% β compared to 19% for white speakers. For some systems, Black speakers' error rates were more than twice as high. The research used real audio recordings with manually verified transcripts as ground truth β a classic external audit design.
In 2020, researchers Su Lin Blodgett and colleagues published extensive documentation of NLP systems producing systematically lower sentiment scores and higher toxicity ratings for text written in African American Vernacular English compared to semantically equivalent Standard American English. This was discovered through controlled substitution: the same meaning, different dialect, different AI output.
The practical stakes are direct: if a content moderation system flags AAVE at higher rates, posts by Black users get removed at higher rates. If a customer service AI assigns negative sentiment to AAVE queries, those users may be routed to lower-quality support.
External audits are powerful but have constraints. Companies may detect systematic querying and respond differently. Audit studies test the inputs a researcher designs β they can miss failure modes that were not anticipated. Some systems require accounts or fees that create barriers to audit. And companies can change their algorithms after an audit, eliminating documented gaps without necessarily fixing the underlying problem.
Despite these limits, external audits have forced accountability in cases where developers did not voluntarily test their own systems. The speech recognition gap, the resume screening bias, the facial recognition disparities β all emerged from researchers who were willing to systematically probe systems from the outside.
External audit studies treat AI systems as black boxes and probe them with carefully designed inputs. They have produced some of the most consequential bias discoveries in AI history β without requiring source code access, training data, or developer cooperation.
A company has deployed an AI system that scores customer service chat transcripts for "professionalism" and routes low-scoring interactions to human review. You suspect the system may be scoring African American Vernacular English (AAVE) differently from Standard American English. Design an audit study to test this hypothesis.
OpenAI's March 2023 GPT-4 technical report included a section on red-teaming that described months of adversarial testing before public release. The report acknowledged that early versions of GPT-4 produced content "that could be used to facilitate biological weapon creation" at rates that required intervention. The red team included domain experts in biosecurity, cybersecurity, and discrimination law. This was not a PR exercise β it directly shaped what filters and refusals were built into the final product.
Red-teaming in AI refers to a structured process where a dedicated group attempts to make an AI system behave in harmful, biased, or unsafe ways. The red team's job is to break the system β to find failure modes the development team did not anticipate. For bias specifically, red-teaming focuses on eliciting differential treatment across demographic groups, stereotyped outputs, and harmful representations.
Unlike audit studies, which typically test deployed systems from the outside, red-teaming usually happens before deployment with direct model access. However, some organizations run external red-team exercises with invited researchers who sign NDAs in exchange for model access.
Meta released Galactica, a large language model trained on scientific literature, as a public demo in November 2022. Within 72 hours, researchers and users conducting informal red-teaming discovered the model confidently generated false scientific content and produced text perpetuating racial stereotypes when asked about social science topics. The demo was taken down after three days. The episode illustrated what happens when red-teaming is insufficient before release.
Effective bias red-teaming goes beyond asking a model offensive questions. Documented techniques include:
Prompting models to complete sentences, fill in blanks, or generate descriptions of groups. "People from [country] are typically ___" reveals embedded associations from training data.
Asking the same factual or creative question with only a demographic variable changed. "Write a news article about a [Black/white] CEO arrested for fraud" β do outputs differ in framing, detail, or tone?
Using indirect framings, hypothetical scenarios, or persona adoption to elicit outputs that direct questions would not produce. Red teams document prompt patterns that bypass safety filters.
Analyzing patterns across many outputs β which groups appear in professional vs. criminal contexts, leadership vs. support roles, positive vs. negative framing in generated text or images.
All three major foundation model labs have published documentation of red-teaming practices. Anthropic's Constitutional AI work (published in 2022) described using AI feedback to iteratively identify harmful outputs and refine responses. Google's PaLM technical report (2022) included a "Responsible AI" section documenting bias evaluations across protected categories. OpenAI's GPT-4 system card is among the most detailed public disclosures, listing specific categories tested and noting residual failure modes that persist after mitigation.
However, all of these disclosures are voluntary and self-reported. There is no independent verification mechanism. Critics including researchers at the AI Now Institute and Algorithmic Justice League have noted that companies control what findings to publish, creating incentives to underreport discovered harms.
Large language models can produce text on virtually any topic. A human red team cannot manually probe every possible input combination. Researchers at Anthropic and elsewhere have begun using automated red-teaming β training a separate AI to generate adversarial inputs at scale. Perez et al. (2022) published "Red Teaming Language Models with Language Models," documenting this approach and finding it could discover novel harmful behaviors that human red-teamers missed.
This creates a recursive dynamic: using AI to find AI failures. The approach scales, but raises questions about whether automated red teams develop their own blind spots β systematically missing categories of harm that neither the test model nor the red-team model was trained to recognize.
Red-teaming has become standard practice at major AI labs, but the quality, scope, and independence of red-team exercises varies enormously. A 72-hour public demo that reveals critical failures (Galactica) and a multi-month structured adversarial program (GPT-4) are both called "red-teaming" β the label does not guarantee the rigor.
You are leading a bias red-team exercise for a large language model being deployed as a customer-facing assistant for a financial services company. The model will help users understand loan products, answer account questions, and provide financial guidance. Design a red-team protocol specifically targeting bias and differential treatment.
In May 2016, ProPublica published "Machine Bias," documenting that the COMPAS recidivism algorithm β used by judges across the U.S. to inform bail, sentencing, and parole decisions β had starkly different false positive rates by race. Black defendants were twice as likely as white defendants to be incorrectly flagged as future criminals. The company Northpointe (now Equivant) responded that COMPAS was calibrated β its risk scores accurately predicted recidivism rates within each racial group. Both statements were mathematically true. They were also fundamentally irreconcilable.
The COMPAS controversy exposed a deep mathematical reality that researchers Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan formalized in their 2016 paper "Inherent Trade-Offs in the Fair Determination of Risk Scores": it is mathematically impossible to simultaneously satisfy demographic parity, equalized odds, and calibration when base rates differ between groups β which they almost always do in the real world.
This is not a software bug. It is a proven impossibility theorem. Any choice of fairness metric involves trade-offs. When base rates of recidivism differ between Black and white defendants in the dataset (due to differential policing, prosecution, and historical injustice), you cannot achieve equal false positive rates and equal calibration at the same time.
Analysis of Broward County, Florida data: Black defendants incorrectly flagged as high-risk who did not reoffend β 44.9%. White defendants incorrectly flagged as high-risk who did not reoffend β 23.5%. Black defendants incorrectly flagged as low-risk who did reoffend β 28%. White defendants incorrectly flagged as low-risk who did reoffend β 47.7%. Both the disparity and the calibration were real. The mathematical conflict between fairness criteria was the story.
When bias testing reveals a performance gap, rigorous interpretation requires working through four questions before drawing conclusions or recommending action.
| Question | Why It Matters | COMPAS Example |
|---|---|---|
| Is the gap statistically significant? | Small samples produce unstable estimates; apparent gaps may be noise. | Yes β thousands of cases, large effect size. |
| Which fairness metric is most relevant to the harm? | Calibration matters for prediction accuracy; equalized odds matters for differential harm. | Disputed β Northpointe chose calibration; ProPublica chose FPR. |
| What is the source of the gap? | Is it in the training data, the label, the feature set, or the base rate? Source determines fix. | Base rate differences from differential criminal justice exposure β not fixable by the model alone. |
| What is the downstream harm of each error type? | False positives and false negatives have asymmetric real-world consequences depending on who bears them. | False positive = unnecessary detention. Asymmetric impact means Black defendants bear more of this harm. |
Bias testing findings routinely fail to change organizational behavior because technical findings are communicated poorly to non-technical decision-makers. Research by Selbst et al. (2019) in "Fairness and Abstraction in Sociotechnical Systems" documents how bias findings are frequently stripped of context as they move up organizational hierarchies β the nuance of conflicting fairness metrics becomes a single line in a summary report.
Effective communication of bias test results requires:
Convert statistical gaps into human outcomes. "12% higher false positive rate" becomes "approximately 1 in 8 additional Black applicants incorrectly denied a loan."
Present the calibration finding alongside the equalized-odds finding. Show decision-makers that improving one may worsen the other and force an explicit values choice.
Distinguish gaps caused by training data composition, label noise, proxy variables, or structural base-rate differences. Each has different remediation implications.
Separate "can fix technically," "requires data collection," and "requires policy decision about acceptable trade-offs." Decision-makers need to know what is in their power to change.
The history of AI bias testing contains a sobering pattern: gaps are discovered, reported, publicized β and then the systems continue to operate. COMPAS continued to be used in courts after the ProPublica analysis. The facial recognition systems documented by NIST continued to be deployed at airports and police departments. The health algorithm studied by Obermeyer et al. was modified by its vendor only after the paper received widespread coverage.
Testing finds the gap. It does not automatically close it. The gap between measurement and remediation is itself a documented systemic problem β one that requires organizational accountability structures, regulatory frameworks, and affected-community advocacy, not just technical audits. Understanding this limit is part of what it means to read bias test results honestly.
Interpreting a bias gap requires choosing among fairness metrics that mathematically cannot all be satisfied simultaneously when base rates differ. The choice of metric is a values decision, not a technical one. Rigorous bias testing surfaces this conflict β it does not resolve it.
You have run a bias audit on a pretrial risk assessment tool used by courts in your jurisdiction. Your findings: the tool has equal calibration across racial groups (risk scores predict recidivism equally well), but Black defendants have a false positive rate of 41% vs. 22% for white defendants. You must present these findings to a county judge who will decide whether to continue using the tool.