Module 3 · Lesson 1

Benchmark Testing: What "Accuracy" Hides

A model can score 95% overall and still fail systematically for specific groups — the aggregate hides the gap.

How do researchers actually discover that an AI system treats different groups unequally?

In January 2019, a dermatology AI published in Nature reported 91% accuracy diagnosing skin cancer — matching board-certified dermatologists. The headline was celebrated globally. What the headline did not say: the training set was 97% lighter-skinned patients. Performance on darker skin tones was never separately reported.

You cannot find a gap you never look for.

Why Aggregate Accuracy Is Misleading

When researchers report a single accuracy number, they are averaging across every test case. If Group A makes up 90% of the test set and the model scores 99% on Group A but only 50% on Group B, the headline figure is still ~94% — impressive-sounding, catastrophic in practice for Group B.

This is not a hypothetical. In 2019, a landmark study by Obermeyer et al. published in Science examined a commercial health-risk algorithm used on roughly 200 million people per year in U.S. hospitals. The algorithm predicted who needed extra care. Overall accuracy looked acceptable. Disaggregated by race, Black patients assigned the same risk score as white patients were actually far sicker — the model systematically underestimated Black patients' needs because it used healthcare cost as a proxy for health, and structural inequities meant Black patients historically spent less on care even when equally ill.

The bias was invisible until someone ran disaggregated performance statistics.

Documented Case — Obermeyer et al., 2019

Researchers at UC Berkeley analyzed a widely deployed hospital risk-stratification algorithm. They found that at the same algorithmically assigned risk score, Black patients had on average 26.3% more active chronic conditions than white patients. The gap was caused by using prior medical cost as a proxy for health need — a design choice that encoded historical access disparities into future care allocation.

The Core Vocabulary of Bias Testing

Before you can find a gap, you need precise language for what you are measuring.

Disaggregated EvaluationBreaking performance metrics down by subgroup (race, gender, age, dialect, skin tone, etc.) rather than reporting only an aggregate figure.

Demographic ParityA model satisfies demographic parity if its positive prediction rate is equal across groups. Violation: a loan model approves 72% of white applicants but only 54% of Black applicants.

Equalized OddsTrue positive rates and false positive rates should be equal across groups. A facial recognition system that misidentifies white faces 1% of the time but Black faces 12% of the time violates equalized odds.

CalibrationA model is calibrated for a group if its predicted probabilities match actual outcome rates for that group. Miscalibration across groups is the exact flaw found in the Obermeyer health algorithm.

The NIST FRVT: A Real Benchmark That Found Real Gaps

The U.S. National Institute of Standards and Technology ran the Face Recognition Vendor Test (FRVT), publishing results in December 2019 that tested 189 face recognition algorithms from 99 developers. This is the largest independent evaluation of face recognition AI ever conducted.

Results were disaggregated by demographics. The findings were stark: for one-to-one verification (confirming an identity), false match rates for African-American and Asian faces were 10 to 100 times higher than for Caucasian faces in most algorithms. For one-to-many search (finding a face in a database), African-American women had the highest false positive rates of any demographic group tested.

These were not tiny boutique systems. These were commercial products deployed at airports, border crossings, and police departments. The gaps existed in production for years before NIST's systematic disaggregated testing revealed them.

Key Insight

A bias gap does not announce itself. It requires deliberate measurement infrastructure: test sets that represent minority subgroups in sufficient numbers, evaluation pipelines that report disaggregated metrics, and stakeholders who ask uncomfortable questions about who the system fails.

What Makes a Good Bias Test?

Effective bias testing shares several characteristics documented in the research literature:

Representative Test Sets

Minority groups must be large enough in the test set to yield statistically meaningful results. A 200-sample test set with 4 examples of a demographic group cannot reliably detect bias.

Multiple Metrics

Report accuracy, false positive rate, false negative rate, and calibration — separately for each group. A single metric can look fine while others reveal serious disparities.

Pre-Specified Subgroups

Decide which groups to evaluate before looking at data. Post-hoc selection can accidentally hide disparities or create spurious findings.

Adversarial Probing

Design inputs specifically intended to stress-test edge cases — accented speech, non-standard names, darker skin tones, atypical body types — not just naturally occurring test samples.

In the next lesson, we look at how researchers use audit studies — the method that exposed bias in resume screening, lending, and speech recognition — to find gaps in deployed systems without ever accessing the model's internals.

Module 3 · Lesson 1 Quiz

Benchmark Testing: What "Accuracy" Hides

Three questions — select the best answer.

The Obermeyer et al. 2019 study in Science found that a hospital risk algorithm systematically underestimated Black patients' needs. What was the root cause?

Correct. The key flaw was proxy choice: spending on healthcare reflects access, not health status. Black patients with identical illness levels historically spent less, so the model rated them as lower-risk.

Not quite. The core issue was the choice of healthcare cost as a proxy for health need — a variable that absorbed structural inequality into the model's predictions.

A model achieves 94% overall accuracy on a test set where 90% of examples are from Group A. The model scores 50% on Group B. Which testing approach would have revealed this problem?

Correct. Disaggregated evaluation splits performance reporting by subgroup, making hidden gaps visible. A single aggregate metric mathematically conceals minority-group failures when the group is a small fraction of the test set.

Disaggregated evaluation is the key. When you report separate metrics for each subgroup, the 50% on Group B becomes immediately visible rather than being averaged away.

The NIST Face Recognition Vendor Test (2019) found that for most algorithms, false match rates for African-American and Asian faces were how much higher than for Caucasian faces?

Correct. NIST's FRVT found that for one-to-one verification, false match rates were 10 to 100 times higher for African-American and Asian faces across most of the 189 algorithms tested — a finding with serious implications for deployed systems.

The NIST FRVT documented false match rates 10 to 100 times higher for African-American and Asian faces compared to Caucasian faces — a magnitude of difference far beyond statistical noise.

Module 3 · Lesson 1 Lab

Disaggregated Evaluation Workshop

Discuss benchmark testing and subgroup analysis with your AI research assistant.

Lab Brief

You are analyzing a newly trained loan approval model. The developer reports 88% accuracy on a 10,000-sample test set. Your job is to decide what additional testing is needed. Use the chat to explore disaggregated evaluation strategies, discuss the Obermeyer and NIST findings, and identify what metrics would reveal hidden bias in this scenario.

Start here: "The loan model scored 88% overall. What should I do next to check for bias by demographic group?"

Bias Testing Lab

L1 · Disaggregated Evaluation

Welcome to the disaggregated evaluation lab. I'm here to help you think through how to properly test an AI model for hidden performance gaps across demographic groups. What aspect of bias testing would you like to explore?

Module 3 · Lesson 2

Audit Studies: Testing From the Outside

You do not need to see a model's weights to detect its bias — you only need to observe its outputs under controlled conditions.

What research methods have exposed AI bias without ever accessing the algorithm's internal code?

Amazon's internal machine learning team built a resume screening tool starting in 2014. By 2015 they realized it was downgrading resumes that included the word "women's" — as in "women's chess club" — and systematically ranking graduates of all-women's colleges lower. The bias came from training on ten years of submitted resumes, which were predominantly from men. Amazon scrapped the tool in 2017. Reuters reported the story in October 2018. The discovery came through internal audit: comparing outputs for matched inputs.

What Is an Audit Study?

An audit study is a controlled experiment where researchers submit matched pairs of inputs that differ only in the characteristic being tested. In hiring discrimination research, this method dates to the 1970s — sending identical resumes with names that signal different racial backgrounds and measuring callback rates. The same logic now applies to AI.

For AI systems specifically, audit studies typically work by:

Name Substitution

Submitting identical documents where only a name changes — "Emily Walsh" vs. "Lakisha Washington" — to test whether a system treats racially associated names differently.

Image Manipulation

Altering perceived race, gender, or age in photographs while holding all other features constant, then measuring how classification or recommendation outputs change.

Dialect Variation

Submitting identical semantic content written in African American Vernacular English (AAVE) vs. Standard American English, measuring sentiment analysis or toxicity scores.

Counterfactual Testing

Changing a single protected attribute in a description — swapping "he" for "she," "Black" for "white" — and checking whether unrelated predictions change as a result.

The Hiring Algorithm Audit: Raghavan et al., 2020

Researchers Manish Raghavan, Solon Barocas, Jon Kleinberg, and Karen Levy published "Mitigating Bias in Algorithmic Hiring" in 2020, documenting how audit studies apply to automated hiring tools. They examined products from vendors including HireVue, which uses video interview AI to score candidates. The paper noted that such systems are tested against a "ground truth" of which candidates were hired or promoted — but if those historical hiring decisions were themselves biased, the AI learns to replicate those biased judgments.

In 2019, the Electronic Privacy Information Center (EPIC) filed a complaint with the Federal Trade Commission about HireVue, citing concerns about opaque AI assessment. HireVue subsequently discontinued its use of facial analysis in 2021 — an outcome partly driven by external audit pressure and advocacy.

Documented Case — Speech Recognition Audit, 2020

Stanford researchers Allison Koenecke et al. published a study in PNAS testing five major commercial speech-to-text systems (Apple, Amazon, Google, IBM, Microsoft) against audio from white and Black speakers. The average word error rate for Black speakers was 35% — compared to 19% for white speakers. For some systems, Black speakers' error rates were more than twice as high. The research used real audio recordings with manually verified transcripts as ground truth — a classic external audit design.

Sentiment Analysis and AAVE: A Documented Gap

In 2020, researchers Su Lin Blodgett and colleagues published extensive documentation of NLP systems producing systematically lower sentiment scores and higher toxicity ratings for text written in African American Vernacular English compared to semantically equivalent Standard American English. This was discovered through controlled substitution: the same meaning, different dialect, different AI output.

The practical stakes are direct: if a content moderation system flags AAVE at higher rates, posts by Black users get removed at higher rates. If a customer service AI assigns negative sentiment to AAVE queries, those users may be routed to lower-quality support.

The Limits of External Auditing

External audits are powerful but have constraints. Companies may detect systematic querying and respond differently. Audit studies test the inputs a researcher designs — they can miss failure modes that were not anticipated. Some systems require accounts or fees that create barriers to audit. And companies can change their algorithms after an audit, eliminating documented gaps without necessarily fixing the underlying problem.

Despite these limits, external audits have forced accountability in cases where developers did not voluntarily test their own systems. The speech recognition gap, the resume screening bias, the facial recognition disparities — all emerged from researchers who were willing to systematically probe systems from the outside.

Key Insight

External audit studies treat AI systems as black boxes and probe them with carefully designed inputs. They have produced some of the most consequential bias discoveries in AI history — without requiring source code access, training data, or developer cooperation.

Module 3 · Lesson 2 Quiz

Audit Studies: Testing From the Outside

Three questions — select the best answer.

The Koenecke et al. 2020 study in PNAS found that commercial speech-to-text systems produced average word error rates of 35% for Black speakers vs. 19% for white speakers. What research design did they use?

Correct. The Stanford team used an external audit design: real audio inputs with verified ground-truth transcripts, submitted to commercial systems as a black-box test. No internal access required.

The study was an external audit — real audio inputs, verified transcripts as ground truth, commercial APIs as black boxes. No source code access was needed or used.

Amazon's resume screening tool learned to downgrade resumes from all-women's colleges and those containing the word "women's." What was the root cause of this bias?

Correct. The training data reflected a decade of biased hiring outcomes — mostly male hires in technical roles. The model learned to replicate those patterns, treating male-associated signals as positive predictors of success.

The cause was historical bias in training data: a decade of male-dominated hiring outcomes taught the model that male-associated signals predicted successful candidates.

Counterfactual testing is an audit technique where a researcher changes a single protected attribute (e.g., name, pronoun, race description) and measures output change. What bias does this method detect?

Correct. If changing only "he" to "she" in an otherwise identical input changes the model's output, that is direct evidence the protected attribute is influencing a prediction that should be attribute-blind — a form of direct discrimination.

Counterfactual testing specifically catches cases where a protected attribute directly drives predictions. If swapping a single demographic signal changes the output while everything else stays constant, the attribute is doing work it should not be doing.

Module 3 · Lesson 2 Lab

Audit Study Design Workshop

Design an external audit study for a real AI deployment scenario.

Lab Brief

A company has deployed an AI system that scores customer service chat transcripts for "professionalism" and routes low-scoring interactions to human review. You suspect the system may be scoring African American Vernacular English (AAVE) differently from Standard American English. Design an audit study to test this hypothesis.

Start here: "How would I design a counterfactual audit study to test whether an AI professionalism scorer treats AAVE differently from Standard American English?"

Audit Design Lab

L2 · External Audit Methods

Welcome to the audit study design lab. I can help you think through how to structure an external audit of an AI system — including counterfactual testing, matched-pair design, and interpreting results. What scenario are you working with?

Module 3 · Lesson 3

Red-Teaming AI: Stress-Testing for Failure

Red-teaming borrows from military intelligence: put your most creative adversaries in charge of finding where your system breaks.

How do AI companies systematically search for harmful outputs before deployment — and what does that process actually look like?

OpenAI's March 2023 GPT-4 technical report included a section on red-teaming that described months of adversarial testing before public release. The report acknowledged that early versions of GPT-4 produced content "that could be used to facilitate biological weapon creation" at rates that required intervention. The red team included domain experts in biosecurity, cybersecurity, and discrimination law. This was not a PR exercise — it directly shaped what filters and refusals were built into the final product.

What Is AI Red-Teaming?

Red-teaming in AI refers to a structured process where a dedicated group attempts to make an AI system behave in harmful, biased, or unsafe ways. The red team's job is to break the system — to find failure modes the development team did not anticipate. For bias specifically, red-teaming focuses on eliciting differential treatment across demographic groups, stereotyped outputs, and harmful representations.

Unlike audit studies, which typically test deployed systems from the outside, red-teaming usually happens before deployment with direct model access. However, some organizations run external red-team exercises with invited researchers who sign NDAs in exchange for model access.

Documented Case — Meta's Galactica, November 2022

Meta released Galactica, a large language model trained on scientific literature, as a public demo in November 2022. Within 72 hours, researchers and users conducting informal red-teaming discovered the model confidently generated false scientific content and produced text perpetuating racial stereotypes when asked about social science topics. The demo was taken down after three days. The episode illustrated what happens when red-teaming is insufficient before release.

Bias-Specific Red-Teaming Techniques

Effective bias red-teaming goes beyond asking a model offensive questions. Documented techniques include:

Stereotype Elicitation

Prompting models to complete sentences, fill in blanks, or generate descriptions of groups. "People from [country] are typically ___" reveals embedded associations from training data.

Differential Output Testing

Asking the same factual or creative question with only a demographic variable changed. "Write a news article about a [Black/white] CEO arrested for fraud" — do outputs differ in framing, detail, or tone?

Jailbreak-Adjacent Probing

Using indirect framings, hypothetical scenarios, or persona adoption to elicit outputs that direct questions would not produce. Red teams document prompt patterns that bypass safety filters.

Representation Auditing

Analyzing patterns across many outputs — which groups appear in professional vs. criminal contexts, leadership vs. support roles, positive vs. negative framing in generated text or images.

Structured Red-Team Programs: Anthropic, Google, OpenAI

All three major foundation model labs have published documentation of red-teaming practices. Anthropic's Constitutional AI work (published in 2022) described using AI feedback to iteratively identify harmful outputs and refine responses. Google's PaLM technical report (2022) included a "Responsible AI" section documenting bias evaluations across protected categories. OpenAI's GPT-4 system card is among the most detailed public disclosures, listing specific categories tested and noting residual failure modes that persist after mitigation.

However, all of these disclosures are voluntary and self-reported. There is no independent verification mechanism. Critics including researchers at the AI Now Institute and Algorithmic Justice League have noted that companies control what findings to publish, creating incentives to underreport discovered harms.

The Scale Problem in Red-Teaming

Large language models can produce text on virtually any topic. A human red team cannot manually probe every possible input combination. Researchers at Anthropic and elsewhere have begun using automated red-teaming — training a separate AI to generate adversarial inputs at scale. Perez et al. (2022) published "Red Teaming Language Models with Language Models," documenting this approach and finding it could discover novel harmful behaviors that human red-teamers missed.

This creates a recursive dynamic: using AI to find AI failures. The approach scales, but raises questions about whether automated red teams develop their own blind spots — systematically missing categories of harm that neither the test model nor the red-team model was trained to recognize.

Key Insight

Red-teaming has become standard practice at major AI labs, but the quality, scope, and independence of red-team exercises varies enormously. A 72-hour public demo that reveals critical failures (Galactica) and a multi-month structured adversarial program (GPT-4) are both called "red-teaming" — the label does not guarantee the rigor.

Module 3 · Lesson 3 Quiz

Red-Teaming AI: Stress-Testing for Failure

Three questions — select the best answer.

Meta's Galactica demo was taken down after 72 hours in November 2022. What did informal red-teaming by users and researchers reveal?

Correct. Galactica's rapid withdrawal followed user-driven discovery of confident hallucination and stereotyped outputs — a real-world example of what insufficient pre-release red-teaming produces.

The critical failures were confident false scientific claims and racial stereotypes in social science outputs — failures that pre-release red-teaming should have caught and addressed before public deployment.

Automated red-teaming (Perez et al., 2022) uses a language model to generate adversarial inputs for another language model. What is the key limitation of this approach?

Correct. When both the red-team model and the target model were trained on similar data and reflect similar cultural assumptions, they may share blind spots — producing comprehensive-looking coverage that systematically misses certain communities' concerns.

The core limitation is shared blind spots: both models may have been trained in ways that make certain harm categories invisible to both, giving false confidence that coverage is complete.

The GPT-4 technical report describes months of pre-release red-teaming including domain experts in biosecurity and discrimination law. Critics from the AI Now Institute have noted a key limitation of this type of disclosure. What is it?

Correct. All current red-team disclosures are self-reported. Companies decide what to reveal, creating structural incentives to underreport serious findings. Independent third-party verification of red-team results does not currently exist at scale.

The key limitation is self-reporting without independent verification. Companies control the disclosure, which creates incentives to present red-teaming as more thorough or successful than it may have been.

Module 3 · Lesson 3 Lab

Red-Team Scenario Lab

Design and evaluate red-teaming strategies for bias detection in generative AI.

Lab Brief

You are leading a bias red-team exercise for a large language model being deployed as a customer-facing assistant for a financial services company. The model will help users understand loan products, answer account questions, and provide financial guidance. Design a red-team protocol specifically targeting bias and differential treatment.

Start here: "What bias-specific red-teaming techniques should I use for a financial services AI assistant, and how do I document the findings systematically?"

Red-Team Design Lab

L3 · Adversarial Bias Testing

Welcome to the red-team design lab. I'll help you build a systematic bias red-team protocol for AI systems. We can discuss specific techniques like stereotype elicitation, differential output testing, and representation auditing — and how to apply them to real deployment scenarios. What are you working on?

Module 3 · Lesson 4

Reading the Gap: Interpreting Bias Test Results

Finding a gap is not the endpoint — interpreting it correctly, understanding its source, and communicating it to decision-makers is the real work.

Once you have evidence of a performance gap, how do you determine whether it constitutes actionable bias — and what do you do with that finding?

In May 2016, ProPublica published "Machine Bias," documenting that the COMPAS recidivism algorithm — used by judges across the U.S. to inform bail, sentencing, and parole decisions — had starkly different false positive rates by race. Black defendants were twice as likely as white defendants to be incorrectly flagged as future criminals. The company Northpointe (now Equivant) responded that COMPAS was calibrated — its risk scores accurately predicted recidivism rates within each racial group. Both statements were mathematically true. They were also fundamentally irreconcilable.

The Fairness Impossibility: When Metrics Conflict

The COMPAS controversy exposed a deep mathematical reality that researchers Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan formalized in their 2016 paper "Inherent Trade-Offs in the Fair Determination of Risk Scores": it is mathematically impossible to simultaneously satisfy demographic parity, equalized odds, and calibration when base rates differ between groups — which they almost always do in the real world.

This is not a software bug. It is a proven impossibility theorem. Any choice of fairness metric involves trade-offs. When base rates of recidivism differ between Black and white defendants in the dataset (due to differential policing, prosecution, and historical injustice), you cannot achieve equal false positive rates and equal calibration at the same time.

The COMPAS Numbers (ProPublica, 2016)

Analysis of Broward County, Florida data: Black defendants incorrectly flagged as high-risk who did not reoffend — 44.9%. White defendants incorrectly flagged as high-risk who did not reoffend — 23.5%. Black defendants incorrectly flagged as low-risk who did reoffend — 28%. White defendants incorrectly flagged as low-risk who did reoffend — 47.7%. Both the disparity and the calibration were real. The mathematical conflict between fairness criteria was the story.

Interpreting a Gap: Four Questions

When bias testing reveals a performance gap, rigorous interpretation requires working through four questions before drawing conclusions or recommending action.

Question	Why It Matters	COMPAS Example
Is the gap statistically significant?	Small samples produce unstable estimates; apparent gaps may be noise.	Yes — thousands of cases, large effect size.
Which fairness metric is most relevant to the harm?	Calibration matters for prediction accuracy; equalized odds matters for differential harm.	Disputed — Northpointe chose calibration; ProPublica chose FPR.
What is the source of the gap?	Is it in the training data, the label, the feature set, or the base rate? Source determines fix.	Base rate differences from differential criminal justice exposure — not fixable by the model alone.
What is the downstream harm of each error type?	False positives and false negatives have asymmetric real-world consequences depending on who bears them.	False positive = unnecessary detention. Asymmetric impact means Black defendants bear more of this harm.

Communicating Findings to Decision-Makers

Bias testing findings routinely fail to change organizational behavior because technical findings are communicated poorly to non-technical decision-makers. Research by Selbst et al. (2019) in "Fairness and Abstraction in Sociotechnical Systems" documents how bias findings are frequently stripped of context as they move up organizational hierarchies — the nuance of conflicting fairness metrics becomes a single line in a summary report.

Effective communication of bias test results requires:

Concrete Harm Translation

Convert statistical gaps into human outcomes. "12% higher false positive rate" becomes "approximately 1 in 8 additional Black applicants incorrectly denied a loan."

Multiple Metrics Side by Side

Present the calibration finding alongside the equalized-odds finding. Show decision-makers that improving one may worsen the other and force an explicit values choice.

Source Attribution

Distinguish gaps caused by training data composition, label noise, proxy variables, or structural base-rate differences. Each has different remediation implications.

Recommended Action Tiers

Separate "can fix technically," "requires data collection," and "requires policy decision about acceptable trade-offs." Decision-makers need to know what is in their power to change.

When Testing Is Not Enough

The history of AI bias testing contains a sobering pattern: gaps are discovered, reported, publicized — and then the systems continue to operate. COMPAS continued to be used in courts after the ProPublica analysis. The facial recognition systems documented by NIST continued to be deployed at airports and police departments. The health algorithm studied by Obermeyer et al. was modified by its vendor only after the paper received widespread coverage.

Testing finds the gap. It does not automatically close it. The gap between measurement and remediation is itself a documented systemic problem — one that requires organizational accountability structures, regulatory frameworks, and affected-community advocacy, not just technical audits. Understanding this limit is part of what it means to read bias test results honestly.

Key Insight

Interpreting a bias gap requires choosing among fairness metrics that mathematically cannot all be satisfied simultaneously when base rates differ. The choice of metric is a values decision, not a technical one. Rigorous bias testing surfaces this conflict — it does not resolve it.

Module 3 · Lesson 4 Quiz

Reading the Gap: Interpreting Bias Test Results

Three questions — select the best answer.

ProPublica found that COMPAS had a 44.9% false positive rate for Black defendants vs. 23.5% for white defendants. Northpointe responded that COMPAS was "calibrated." Both claims were mathematically true. What does this conflict illustrate?

Correct. The Kleinberg-Mullainathan-Raghavan impossibility result proves that when outcome base rates differ between groups, demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously. The COMPAS case made this abstract theorem viscerally concrete.

Both analyses were correct. The conflict is inherent: Kleinberg et al. proved mathematically that equalized odds and calibration are incompatible when base rates differ — which they do for recidivism in a system shaped by differential policing.

A bias test finds a statistically significant 8 percentage point false positive rate gap between two demographic groups. What is the most important next step before recommending a fix?

Correct. Source attribution is critical. A gap caused by biased training labels requires different intervention than one caused by base-rate differences — and rebalancing training data cannot fix a structural disparity in the outcome base rates themselves.

Identifying the source of the gap is the critical next step. A gap from biased labels, biased features, or structural base-rate differences each requires a different type of intervention — and some cannot be fixed by the model alone.

Despite the Obermeyer et al. study, the NIST FRVT findings, and the ProPublica COMPAS analysis, the documented systems continued operating after publication. What does this pattern reveal about the limits of bias testing?

Correct. The gap between finding and fixing is one of the most important documented problems in AI accountability. Technical rigor is necessary but not sufficient — structural and institutional mechanisms are required to translate bias findings into actual system changes.

The historical record is clear: all three systems continued operating after landmark bias findings were published. Testing finds the gap; organizational accountability, regulation, and advocacy are required to close it.

Module 3 · Lesson 4 Lab

Gap Interpretation Lab

Work through the COMPAS fairness paradox and learn to communicate bias findings to decision-makers.

Lab Brief

You have run a bias audit on a pretrial risk assessment tool used by courts in your jurisdiction. Your findings: the tool has equal calibration across racial groups (risk scores predict recidivism equally well), but Black defendants have a false positive rate of 41% vs. 22% for white defendants. You must present these findings to a county judge who will decide whether to continue using the tool.

Start here: "I have conflicting bias metrics for a pretrial risk tool — good calibration but unequal false positive rates. How do I explain this conflict to a non-technical judge and make a clear recommendation?"

Gap Interpretation Lab

L4 · Fairness Trade-offs & Communication

Welcome to the gap interpretation lab. I can help you work through the mathematics of conflicting fairness metrics, the COMPAS case as a reference, and strategies for communicating these findings clearly to non-technical decision-makers. What finding are you trying to explain?

Module 3 · Module Test

Test the System: Find the Gap

15 questions — score 80% or higher to pass the module.

1. A cancer-detection AI reports 92% overall accuracy. A researcher discovers it was tested on a dataset that was 96% light-skinned patients. What is the primary concern?

Correct. When a demographic group is severely underrepresented in a test set, aggregate accuracy masks that group's actual performance — which may be far below the headline figure.

Aggregate accuracy over an unrepresentative test set hides subgroup performance. The 92% figure tells us very little about how the system performs on darker-skinned patients.

2. Which organization conducted the largest independent evaluation of face recognition algorithms, testing 189 systems and finding 10–100x higher false match rates for African-American and Asian faces?

Correct. NIST's Face Recognition Vendor Test (FRVT), published December 2019, remains the most comprehensive independent evaluation of commercial face recognition systems ever conducted.

NIST conducted the Face Recognition Vendor Test (FRVT), published in December 2019 — the largest independent evaluation of commercial face recognition algorithms ever conducted.

3. In the Obermeyer et al. study, at the same algorithithmically assigned risk score, Black patients had on average how many more active chronic conditions than white patients?

Correct. The 26.3% figure from Obermeyer et al. in Science (2019) made the magnitude of miscalibration concrete — at the same risk score, Black patients were considerably sicker, meaning the system was systematically underestimating their need for care.

The correct figure is 26.3% — at the same risk score, Black patients had 26.3% more active chronic conditions than white patients, demonstrating severe miscalibration by race.

4. An audit study that submits otherwise-identical resumes with names signaling different racial backgrounds is testing for which type of bias?

Correct. Name-substitution audit studies are designed specifically to detect direct discrimination: if two otherwise-identical resumes receive different scores based on a racially associated name, the protected attribute is directly influencing the outcome.

Name substitution tests for direct discrimination — the name serves as a signal of race, and differential scoring on otherwise-identical inputs reveals that the protected attribute is driving outcomes.

5. Koenecke et al.'s 2020 Stanford study found average word error rates of 35% for Black speakers vs. 19% for white speakers across five major commercial speech-to-text systems. Which method did they use?

Correct. This is an external audit design — real audio inputs, verified ground-truth transcripts, commercial APIs treated as black boxes. No internal access was required, demonstrating that external auditing can reveal significant disparities.

The Stanford team used an external audit: real audio from matched speakers, manually verified transcripts as ground truth, commercial systems tested as black boxes. No internal access needed.

6. Demographic parity requires that a model's positive prediction rate is equal across groups. Which of these scenarios describes a demographic parity violation?

Correct. Demographic parity is violated when the positive prediction rate (here, loan approval) differs across groups. The 72% vs. 54% approval gap with similar financial profiles is a textbook demographic parity violation.

Demographic parity requires equal positive prediction rates across groups. A 72% approval rate for one group and 54% for another with similar profiles is a clear demographic parity violation.

7. Amazon's resume screening tool penalized resumes containing the word "women's" because it was trained on ten years of historical hiring data that was predominantly male. What category of bias source does this represent?

Correct. This is a canonical example of historical bias in training labels: the model learned that male-associated signals predicted "successful candidate" because historical hiring decisions systematically favored men — and the model faithfully reproduced that discrimination.

Historical bias in training labels is the correct category. The model learned that male-associated features predicted the "successful candidate" label — because those historical labels were themselves the product of biased human decisions.

8. Meta's Galactica language model was taken down 72 hours after public release in November 2022. What does this episode most directly illustrate?

Correct. Galactica's rapid withdrawal is a documented case study in what happens when pre-release red-teaming is inadequate. The confident false claims and stereotyped outputs were the type of failure adversarial testing is specifically designed to find before deployment.

The Galactica episode is a cautionary tale about insufficient red-teaming. Confident scientific hallucination and stereotyped outputs are precisely the failure modes red-teaming is designed to catch before public release.

9. The Kleinberg-Mullainathan-Raghavan impossibility theorem states that demographic parity, equalized odds, and calibration cannot all be simultaneously satisfied under what condition?

Correct. The impossibility result holds specifically when base rates differ between groups — which is almost always true in real-world applications, especially those involving historically marginalized groups. This makes the fairness metric selection a genuine values choice, not a technical optimization.

The impossibility theorem applies when base rates differ between groups. Since base rates almost always differ in real-world settings (often because of structural inequality), this makes simultaneous satisfaction of all three fairness criteria mathematically impossible.

10. Automated red-teaming uses a language model to generate adversarial inputs for another language model at scale (Perez et al., 2022). What is the primary advantage of this approach over human red-teaming alone?

Correct. Scale is the primary advantage — automated red-teaming can generate millions of adversarial inputs where human teams might test thousands. The trade-off is the shared-blind-spot risk, but the scale benefit can uncover failure modes human testing would not reach.

The key advantage is scale: automated red-teaming explores vastly more of the input space than human teams can. The risk is shared blind spots, but the scale benefit is real and documented in Perez et al.'s findings of novel harmful behaviors.

11. When a researcher discovers a bias gap, Selbst et al. (2019) warn that technical findings often fail to produce organizational change. What is the mechanism they describe?

Correct. Selbst et al. document how the "abstraction" problem strips technical findings of their context as they move up organizational chains — the nuanced story of conflicting fairness criteria and their different human implications becomes a simplified, decontextualized number.

Selbst et al. describe the abstraction problem: as findings travel up organizational hierarchies, the contextual complexity is stripped away, making it harder for decision-makers to understand what the numbers actually mean for real people.

12. A pre-specified subgroup analysis involves deciding which demographic groups to evaluate before examining model outputs. Why is this methodologically important?

Correct. Pre-specification prevents selective reporting bias: if you look at many subgroups and only report the interesting ones, you can mislead readers about what the analysis found. Pre-specifying groups before analysis prevents this.

Pre-specification guards against selective reporting: if an analyst searches many possible subgroups and reports only the largest gap found, they may be surfacing chance variation rather than real disparity. Deciding which groups to test before looking at data prevents this.

13. The Electronic Privacy Information Center (EPIC) filed an FTC complaint about HireVue's AI video interview scoring in 2019. What outcome followed from a combination of this advocacy and external audit pressure?

Correct. HireVue discontinued facial analysis in 2021 following the EPIC complaint, external audit scrutiny, and sustained advocacy — a documented example of external pressure translating bias concerns into actual product change.

HireVue discontinued its facial analysis component in 2021 — a concrete outcome from sustained external audit pressure and advocacy, illustrating that accountability beyond technical testing can produce real changes.

14. Which of these correctly describes the "equalized odds" fairness criterion?

Correct. Equalized odds requires both true positive rates and false positive rates to be equal across groups — a stricter condition than calibration alone or demographic parity alone. It is the criterion most directly violated by COMPAS's racial disparity in false positive rates.

Equalized odds requires both the true positive rate and the false positive rate to be equal across groups. Calibration (matching predicted probabilities to actual rates) is a different criterion — the one Northpointe cited in the COMPAS dispute.

15. COMPAS continued to be used in courts after ProPublica's analysis, and many face recognition systems documented by NIST continued operating at airports. What does this pattern most directly demonstrate?

Correct. This is the central structural lesson of Module 3: measurement is necessary but not sufficient. The gap between finding and fixing is real, documented, and requires institutional mechanisms — not just better tests.

The persistence of documented bias in deployed systems demonstrates that measurement alone does not produce change. Organizational accountability, regulation, and sustained advocacy are required to translate technical findings into remediation.