Amazon quietly built a machine learning tool to screen résumés beginning in 2014. By 2015 its engineers noticed something troubling: the system was systematically downgrading résumés that contained the word "women's" — as in "women's chess club" — and penalising graduates of all-female colleges. The training data was a decade of historically male-dominated hiring decisions. The model had learned to replicate that pattern precisely. Amazon disbanded the project in 2017, but not before the episode became one of the defining documented cases of algorithmic hiring bias.
What made the problem detectable was a simple question: do acceptance rates differ by group? That question is the foundation of disparity metrics.
A disparity metric is any numerical measure that compares outcomes across demographic groups. The goal is to move from an intuition ("something seems unfair") to a falsifiable, documented claim ("Group A receives a positive outcome at 1.8× the rate of Group B").
Disparity metrics do not themselves define fairness — they surface patterns that require interpretation. Depending on context, the same disparity may represent historical injustice baked into training data, a legitimate predictor variable, or simple measurement noise. The metric is the start of an investigation, not its conclusion.
Each metric captures something slightly different. Understanding which one a fairness audit used is critical to interpreting its findings.
In 2016 researchers Chouldechova, and separately Kleinberg et al., proved mathematically that demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously when base rates differ between groups. Any bias audit must therefore choose which criterion matters most for the specific application — there is no single correct answer.
US employment law uses the four-fifths rule (also called the 80% rule): if the selection rate for a protected group is less than 80% of the rate for the highest-scoring group, disparate impact is presumed. This test originated in the 1978 Uniform Guidelines on Employee Selection Procedures and predates machine learning entirely — but regulators and courts have applied it to algorithmic systems.
In 2023 the EEOC issued guidance explicitly stating that employers using algorithmic decision-making tools remain liable under Title VII for disparate impact, regardless of whether the algorithm was developed in-house or purchased from a vendor.
ProPublica's 2016 investigation of the COMPAS recidivism scoring tool examined 7,000 defendants in Broward County, Florida. Their headline finding: Black defendants who did not reoffend were flagged as high-risk at nearly twice the rate of white defendants who also did not reoffend (false positive rate: 44.9% Black vs. 23.5% white).
Northpointe (maker of COMPAS) responded that the tool was calibrated — its scores predicted the same probability of reoffending for each race at each score level. Both claims were true simultaneously. The impossibility theorem explains why: when reoffend rates differ between groups, you cannot have both equal calibration and equal error rates. The disagreement was not about math — it was about which fairness criterion society should prioritise.
1. Define the decision. Identify the binary output — loan approval, bail recommendation, ad delivery, content moderation flag. Audits of multi-output or continuous systems require additional decomposition.
2. Identify protected attributes. Race, gender, age, disability status, national origin are legally protected in most jurisdictions. Proxy variables (zip code, name, school attended) may encode protected attributes even when those attributes are not explicitly in the model.
3. Choose the metric. Select at minimum demographic parity difference and equalized odds difference. Document why you chose each. Note base-rate differences between groups.
4. Collect stratified outcome data. This step is often the hardest — many organisations lack demographic data on users or applicants, either because they never collected it or because legal counsel advised against it. Techniques for inferring demographic data (Bayesian Improved Surname Geocoding, or BISG) introduce their own uncertainty.
5. Report with confidence intervals. Small sample sizes produce large variance. A 20-percentage-point gap in a dataset of 80 people may not be statistically significant. The 2019 audit of Amazon Rekognition found a 0% error rate for lighter-skinned males and a 31.4% error rate for darker-skinned females — with sufficient sample sizes that the gap was unambiguous.
The MIT Media Lab study by Joy Buolamwini and Timnit Gebru (Gender Shades, 2018, updated 2019) tested three commercial facial recognition APIs. Amazon Rekognition's gender classification error rate was 0.0% for lighter-skinned males, 6.1% for darker-skinned males, 7.7% for lighter-skinned females, and 31.4% for darker-skinned females. The audit used the exact metrics described above — TPR and FPR stratified by skin type and gender — and produced a disparity ratio of more than 31:1 between the best and worst-performing group.
A mid-sized tech company has deployed an ML-based screening tool for software engineering roles. You have been given the following outcome data for a six-month period. Your task is to conduct a disparity audit using the metrics from Lesson 1.
| Group | Applied | Selected | Qualified & Selected | Qualified & Rejected |
|---|---|---|---|---|
| White men | 800 | 320 (40%) | 290 | 110 |
| Women | 400 | 96 (24%) | 74 | 126 |
| Black men | 200 | 44 (22%) | 32 | 68 |
| Latino/a applicants | 150 | 33 (22%) | 24 | 51 |
In November 2019, software developer David Heinemeier Hansson tweeted that Apple Card had offered him a credit limit twenty times higher than his wife, despite her having a higher credit score. His tweet went viral. Within days, Apple co-founder Steve Wozniak reported the same pattern in his household. The New York Department of Financial Services opened an investigation into Goldman Sachs, which ran the underlying credit model.
Goldman Sachs stated that the algorithm did not use gender as an input. That claim was almost certainly true — and almost certainly beside the point. Credit limit models incorporate dozens of variables — individual versus joint accounts, spending category history, address — each of which can correlate with gender without encoding it explicitly. The investigation found no intentional discrimination but highlighted the inadequacy of simply removing protected attributes.
A proxy variable is a feature that correlates with a protected attribute strongly enough that including it in a model effectively encodes that attribute's influence even when the attribute itself is absent. The model learns the correlation from training data and exploits it in predictions.
This is not a theoretical concern. In US data, zip code correlates with race due to residential segregation patterns that persist from redlining-era policies. First name correlates with both race and gender. School attended correlates with socioeconomic class and race. Job title correlates with gender. Each of these is a legitimate predictor of many outcomes — and each can function as a race or gender proxy.
These are confirmed or strongly evidenced proxy relationships from real cases:
| Proxy Variable | Protected Attribute | Mechanism | Documented Case |
|---|---|---|---|
| Zip code | Race | Residential segregation from historical redlining | Car insurance pricing (ProPublica, 2017) |
| First name | Race / Gender | Name distributions differ by demographic group | Résumé callback studies (Bertrand & Mullainathan, 2004) |
| College attended | Race / SES | Historically Black Colleges, school demographics | Amazon hiring tool (Reuters, 2018) |
| Shopping categories | Gender | Gendered consumption patterns | Apple Card investigation (NYDFS, 2021) |
| Prior arrest record | Race | Racially disparate policing creates disparate records | COMPAS recidivism tool (ProPublica, 2016) |
| Word embeddings | Gender / Race | Text corpora reflect historical biases | Caliskan et al., Science, 2017 |
Feature auditing asks: which input variables drive the model's outputs, and do any of them function as protected-attribute proxies? The two dominant techniques are:
The Federal Housing Administration's explicit redlining maps (1934–1968) created residential segregation patterns that persist today. Studies show that zip codes in formerly redlined areas have lower homeownership rates, lower property values, and more economic disadvantage seven decades later. Any ML model trained on US address data inherits this legacy. A 2020 study by the National Community Reinvestment Coalition found that formerly redlined neighbourhoods experience higher auto insurance and mortgage rates from algorithmic pricing models that use zip code as a feature.
A common but legally insufficient mitigation strategy is fairness through unawareness: simply remove protected attributes from the model. The EEOC, the EU AI Act's Article 10, and academic research all agree this is inadequate when correlated proxies remain in the feature set.
The 2021 NYDFS report on Goldman Sachs found the firm's compliance programme relied substantially on the absence of gender as a variable. The regulator required Goldman Sachs to implement ongoing disparity monitoring and to audit for proxy variables — a requirement now reflected in New York's Local Law 144, which mandates bias audits of automated employment decision tools used in New York City.
New York City's Local Law 144, which took effect in July 2023, requires employers and employment agencies using Automated Employment Decision Tools (AEDTs) to conduct annual bias audits and publish summary results publicly. The law explicitly requires testing for disparate impact on the basis of sex, race/ethnicity, and their intersection — acknowledging that proxy variables make protected-attribute testing necessary even when those attributes are absent from the model.
A consumer lending fintech is building a credit-scoring model. Their data science team has proposed the following feature set. Your job is to identify which features may function as proxy variables, explain the mechanism, and recommend whether each should be dropped, retained with monitoring, or retained as-is.
| Feature | Type | Stated Rationale |
|---|---|---|
| FICO credit score | Numeric | Direct creditworthiness signal |
| Zip code of residence | Categorical | Local economic conditions |
| Employment industry code | Categorical | Job stability predictor |
| Years at current employer | Numeric | Income stability |
| Number of dependents | Numeric | Financial obligations |
| Bank account type (joint/individual) | Binary | Financial behaviour |
| Primary language of application | Categorical | Operational processing |
| Highest education level | Ordinal | Earnings potential |
In 2021, researchers from the University of Washington, Allen Institute for AI, and others published a study examining GPT-3's occupational associations. When prompted to complete sentences like "The doctor said that ___," GPT-3 completed them with male pronouns 83% of the time. For "nurse," female pronouns appeared in over 90% of completions. The researchers also found that Arab identity was associated with violence-related words in word-embedding space at a statistically significant rate.
These were not edge cases or adversarial prompts. They were standard continuations reflecting training data distributions — the internet's text, which reflects decades of human occupational segregation and social stereotyping, faithfully reproduced in a trillion-parameter model.
Tabular model audits compare acceptance rates across groups. NLP model audits must grapple with outputs that are probabilistic, context-dependent, and often evaluated by human raters whose own biases affect the assessment. The field uses several purpose-built methodologies.
In 2019 researchers at Google and University of Washington published a study on the Perspective API toxicity classifier, which is used by major news organisations to moderate comments. They found that texts mentioning identity terms — "I am gay," "I am Black," "I am a woman" — were rated as more toxic than equivalent sentences without those terms, at statistically significant rates.
This type of bias — identity-based toxicity inflation — can cause content moderation systems to disproportionately silence speech by and about marginalised groups. The study prompted Google to update its Perspective API training methodology and publish ongoing model cards tracking performance by identity subgroup.
A 2020 study by Kiritchenko and Mohammad tested 200 sentiment analysis systems across 16 teams and found that all 200 showed measurable race and gender bias in at least one condition. Black female names systematically received more negative sentiment scores than white male names in identical sentence contexts. This was true of both commercial APIs and open-source models, suggesting the bias originates in training data rather than model architecture.
In 2023 Bloomberg published an investigation of image generation models including Stable Diffusion and DALL-E. When prompted to generate "a photo of a CEO," the models produced images that were overwhelmingly male and white. "A photo of a janitor" or "a photo of a fast food worker" produced images that were predominantly non-white. The study systematically varied occupational prompts and documented the demographic distribution of outputs — applying the same disparity framework used in tabular audits to image outputs.
The audit methodology: generate N images per prompt, use a secondary classifier to assess apparent demographic characteristics, compute demographic parity difference across occupation categories. The same four-fifths rule logic applies — if a certain demographic group appears in less than 80% of the frequency that would be expected from random sampling, the model has a measurable disparity.
Google's 2019 paper "Model Cards for Model Reporting" (Mitchell et al.) proposed standardised documentation for ML models that includes disaggregated performance metrics across demographic subgroups. Model cards are now required for models submitted to several major benchmarks and are mandated by the EU AI Act for high-risk AI systems. A compliant model card must include:
Performance metrics by subgroup: accuracy, precision, recall, and F1 stratified by at minimum gender and race. Known bias limitations: explicit statement of cases where the model is known to underperform. Training data demographics: where known, the demographic composition of training data. Intended and prohibited use cases.
The EU AI Act (adopted 2024) classifies NLP systems used in employment, credit, education, and law enforcement as high-risk. Article 9 requires ongoing risk management including bias monitoring. Article 10 requires training data governance to address underrepresentation and bias. Article 13 requires transparency documentation equivalent to model cards. These requirements took effect for the highest-risk systems in August 2024.
A major e-commerce platform uses a sentiment analysis API to automatically flag negative product reviews for human review. An internal report has suggested the system may be flagging reviews written in African American Vernacular English (AAVE) at higher rates than standard American English reviews of equivalent sentiment. You have been asked to design a formal audit.
In 2020, Twitter users began reporting that the platform's automatic image cropping feature — which selected a preview crop when images were too tall — appeared to favour lighter-skinned faces. Researchers tested this systematically by creating images with a lighter-skinned and a darker-skinned face at different positions. The algorithm consistently cropped to show the lighter-skinned face.
Twitter's engineers ran their own investigation and in 2021 confirmed the bias, finding the cropping model had learned to associate facial saliency with features more common in lighter-skinned individuals. The model had passed internal testing before deployment. The bias was discovered not by the company's audit process but by external researchers two years after launch. Twitter subsequently disabled the algorithmic cropping feature entirely.
The episode illustrated a fundamental gap: pre-deployment audits capture the model's behaviour at a moment in time. They cannot anticipate how the model will interact with real-world content distributions that evolve over time.
Three frameworks have achieved significant industry adoption:
Red-teaming, borrowed from cybersecurity, involves structured adversarial testing designed to find failure modes before they occur in production. Applied to bias detection, red-teaming involves:
Demographic edge cases: Systematically test underrepresented intersectional groups — elderly Black women, young disabled Latino men — that aggregate statistics may obscure. The bias is often worst at intersections.
Historical stress tests: Test the model's behaviour on inputs that reflect historical events or cultural contexts where bias is known to be concentrated — names, dialects, religious references.
Adversarial demographic injection: Subtly alter inputs to include demographic signals and measure output changes. Used by Meta's Responsible AI team to test their content ranking algorithms.
In October 2021, Meta published an internal study finding that its Facebook news feed algorithm amplified right-wing content more than left-wing content with similar engagement metrics. The study was conducted by Meta's own Responsible AI team as part of a red-team exercise for potential political bias. The team used a randomised controlled experiment — showing some users chronological feed, others algorithmic — and measured downstream political content exposure differences. This represented one of the first public disclosures of a major platform's algorithmic amplification audit methodology.
A continuous monitoring programme requires infrastructure that most organisations have not yet built. The key components are:
Demographic logging: Capturing (with consent and in compliance with data protection law) the demographic composition of users receiving different model outputs. Without this data, post-deployment drift is undetectable.
Disparity dashboards: Real-time or near-real-time tracking of key fairness metrics. Amazon Web Services, Microsoft Azure, and Google Cloud all offer bias monitoring tools as part of their MLOps platforms. These tools compute demographic parity and equalized odds metrics on rolling windows of production data.
Drift detection: Statistical tests (e.g., Population Stability Index, Kolmogorov-Smirnov test) that alert when the input distribution shifts significantly from the training distribution. Bias can emerge or worsen when the user population changes — a model trained on one population may perform differently on a shifted one.
Scheduled re-audits: IEEE 7003 recommends audit intervals of no longer than 12 months for high-risk systems, or immediately following any material change to the model, training data, or deployment context.
Internal audits face a structural conflict of interest — the team auditing a system often built or deployed it. Third-party auditing addresses this but creates access challenges: auditors need sufficient access to the model and data to conduct meaningful tests, which organisations may resist for IP or security reasons.
The first documented third-party audit of a major AI system was the 2021 audit of Twitter's image cropping algorithm by an external team at the FAccT conference — which is precisely the event that identified the race bias described at the start of this lesson. The external auditors had access to the deployed model via API, not the model weights — demonstrating that meaningful audits are possible with API-level access, though weight-level access enables more thorough testing.
NIST's AI RMF frames bias not as a property to be eliminated at training time but as a risk to be managed across the full system lifecycle: design (what problem are we solving, for whom?), data (how representative is the training corpus?), development (which fairness criteria guide training?), deployment (who uses it, in what context?), and post-deployment (how does behaviour change over time?). This lifecycle framing is now the dominant conceptual model in corporate AI governance programmes.
A regional bank is deploying a loan approval ML model. It has passed a pre-deployment bias audit showing acceptable demographic parity across race and gender groups. Your task is to design the ongoing monitoring programme required under their state regulator's new AI fairness guidelines. The bank's model deployment team has budget for one FTE dedicated to AI fairness monitoring.