Module 3 · Lesson 1

Disparity Metrics

How researchers quantify the gap between what a system promises and what it delivers — for whom.

If a hiring algorithm approves 60% of white applicants but only 32% of Black applicants, is that bias — and how would you prove it?

Amazon quietly built a machine learning tool to screen résumés beginning in 2014. By 2015 its engineers noticed something troubling: the system was systematically downgrading résumés that contained the word "women's" — as in "women's chess club" — and penalising graduates of all-female colleges. The training data was a decade of historically male-dominated hiring decisions. The model had learned to replicate that pattern precisely. Amazon disbanded the project in 2017, but not before the episode became one of the defining documented cases of algorithmic hiring bias.

What made the problem detectable was a simple question: do acceptance rates differ by group? That question is the foundation of disparity metrics.

What Is a Disparity Metric?

A disparity metric is any numerical measure that compares outcomes across demographic groups. The goal is to move from an intuition ("something seems unfair") to a falsifiable, documented claim ("Group A receives a positive outcome at 1.8× the rate of Group B").

Disparity metrics do not themselves define fairness — they surface patterns that require interpretation. Depending on context, the same disparity may represent historical injustice baked into training data, a legitimate predictor variable, or simple measurement noise. The metric is the start of an investigation, not its conclusion.

The Four Most Used Metrics

Each metric captures something slightly different. Understanding which one a fairness audit used is critical to interpreting its findings.

Demographic Parity The positive prediction rate (PPR) is equal across groups. If a loan model approves 40% of applicants regardless of race, it satisfies demographic parity. Simple to compute; ignores whether the underlying qualification rates differ.

Equalized Odds Both the true positive rate (TPR) and false positive rate (FPR) are equal across groups. A recidivism model satisfies equalized odds if it correctly identifies high-risk individuals — and incorrectly flags low-risk ones — at the same rate for all racial groups. Much stricter than demographic parity.

Calibration A score of 70% means a 70% likelihood of the predicted outcome, for every group equally. The COMPAS recidivism tool was found to be well-calibrated across race even while its error types were asymmetric — this is how ProPublica and Northpointe could both be technically correct in 2016.

Individual Fairness Similar individuals receive similar predictions. Requires a domain-specific definition of "similar" — which is often contentious. Hard to audit at scale but important for detecting edge-case bias not visible in aggregate statistics.

Impossibility Result

In 2016 researchers Chouldechova, and separately Kleinberg et al., proved mathematically that demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously when base rates differ between groups. Any bias audit must therefore choose which criterion matters most for the specific application — there is no single correct answer.

Disparate Impact: The Legal Standard

US employment law uses the four-fifths rule (also called the 80% rule): if the selection rate for a protected group is less than 80% of the rate for the highest-scoring group, disparate impact is presumed. This test originated in the 1978 Uniform Guidelines on Employee Selection Procedures and predates machine learning entirely — but regulators and courts have applied it to algorithmic systems.

In 2023 the EEOC issued guidance explicitly stating that employers using algorithmic decision-making tools remain liable under Title VII for disparate impact, regardless of whether the algorithm was developed in-house or purchased from a vendor.

The COMPAS Case Study in Numbers

ProPublica's 2016 investigation of the COMPAS recidivism scoring tool examined 7,000 defendants in Broward County, Florida. Their headline finding: Black defendants who did not reoffend were flagged as high-risk at nearly twice the rate of white defendants who also did not reoffend (false positive rate: 44.9% Black vs. 23.5% white).

FPR — Black Defendants

44.9%

Flagged high-risk but did not reoffend

FPR — White Defendants

23.5%

Flagged high-risk but did not reoffend

FNR — Black Defendants

28%

Flagged low-risk but did reoffend

FNR — White Defendants

47.7%

Flagged low-risk but did reoffend

Key Insight

Northpointe (maker of COMPAS) responded that the tool was calibrated — its scores predicted the same probability of reoffending for each race at each score level. Both claims were true simultaneously. The impossibility theorem explains why: when reoffend rates differ between groups, you cannot have both equal calibration and equal error rates. The disagreement was not about math — it was about which fairness criterion society should prioritise.

Conducting a Disparity Audit: The Five Steps

1. Define the decision. Identify the binary output — loan approval, bail recommendation, ad delivery, content moderation flag. Audits of multi-output or continuous systems require additional decomposition.

2. Identify protected attributes. Race, gender, age, disability status, national origin are legally protected in most jurisdictions. Proxy variables (zip code, name, school attended) may encode protected attributes even when those attributes are not explicitly in the model.

3. Choose the metric. Select at minimum demographic parity difference and equalized odds difference. Document why you chose each. Note base-rate differences between groups.

4. Collect stratified outcome data. This step is often the hardest — many organisations lack demographic data on users or applicants, either because they never collected it or because legal counsel advised against it. Techniques for inferring demographic data (Bayesian Improved Surname Geocoding, or BISG) introduce their own uncertainty.

5. Report with confidence intervals. Small sample sizes produce large variance. A 20-percentage-point gap in a dataset of 80 people may not be statistically significant. The 2019 audit of Amazon Rekognition found a 0% error rate for lighter-skinned males and a 31.4% error rate for darker-skinned females — with sufficient sample sizes that the gap was unambiguous.

Real Case — Amazon Rekognition (2019)

The MIT Media Lab study by Joy Buolamwini and Timnit Gebru (Gender Shades, 2018, updated 2019) tested three commercial facial recognition APIs. Amazon Rekognition's gender classification error rate was 0.0% for lighter-skinned males, 6.1% for darker-skinned males, 7.7% for lighter-skinned females, and 31.4% for darker-skinned females. The audit used the exact metrics described above — TPR and FPR stratified by skin type and gender — and produced a disparity ratio of more than 31:1 between the best and worst-performing group.

Lesson 1 Quiz

Disparity Metrics — four questions

1. Amazon's internal résumé-screening tool was disbanded because it penalised applicants connected to women's institutions. What was the primary cause of this bias?

✓ Correct — Correct. The system learned to replicate historical hiring patterns. Past discrimination in who was hired became the template for who would be hired — a classic case of historical bias encoded through training data.

Incorrect. The bias was not intentional. Engineers discovered it through outcome analysis — the model had learned to replicate male-dominated historical hiring decisions.

2. In the COMPAS case, ProPublica found Black defendants were falsely flagged as high-risk at nearly twice the rate of white defendants. Northpointe countered that the tool was calibrated. How can both be true?

✓ Correct — Correct. Chouldechova (2016) proved mathematically that with unequal base rates, a model cannot simultaneously achieve calibration and equal false positive/negative rates. Both parties were correct — they were prioritising different fairness criteria.

Incorrect. Neither party made an error. The impossibility theorem explains how both findings can be mathematically valid simultaneously when groups have different base rates.

3. Under the US four-fifths (80%) rule, if white applicants are selected at a rate of 50%, what is the minimum selection rate for a protected group before disparate impact is presumed?

✓ Correct — Correct. 50% × 0.80 = 40%. Below 40%, disparate impact is legally presumed under the 1978 Uniform Guidelines, which the EEOC has extended to algorithmic hiring tools.

Incorrect. The four-fifths rule requires the protected group's rate to be at least 80% of the highest group's rate. 80% of 50% is 40%.

4. The Gender Shades audit of Amazon Rekognition found a 31.4% error rate for darker-skinned females versus 0% for lighter-skinned males. Which disparity metric most directly captures this finding?

✓ Correct — Correct. Gender Shades measured the rate at which the system misclassified gender — a false positive/negative analysis stratified by skin type and gender simultaneously, revealing an intersectional disparity invisible in aggregate statistics.

Incorrect. The study measured misclassification rates — essentially FPR/FNR — stratified across intersectional groups (skin type × gender). Demographic parity and calibration were not the primary metrics used.

Lab 1 — Auditing with Disparity Metrics

Apply demographic parity and equalized odds analysis to a hypothetical hiring scenario

Scenario

A mid-sized tech company has deployed an ML-based screening tool for software engineering roles. You have been given the following outcome data for a six-month period. Your task is to conduct a disparity audit using the metrics from Lesson 1.

Group	Applied	Selected	Qualified & Selected	Qualified & Rejected
White men	800	320 (40%)	290	110
Women	400	96 (24%)	74	126
Black men	200	44 (22%)	32	68
Latino/a applicants	150	33 (22%)	24	51

Using the four-fifths rule and equalized odds, analyse this data with your AI lab assistant. Calculate the disparity ratios, identify which groups trigger the 80% rule, and discuss what the true positive rate differences tell you that demographic parity alone misses. Aim for at least 3 exchanges.

Bias Detection Lab

Disparity Metrics

Welcome to Lab 1. I'm your bias detection assistant. You have hiring outcome data for four groups. Let's work through a formal disparity audit together. Start by applying the four-fifths rule: divide each group's selection rate by the highest group's rate (white men at 40%). Which groups fall below the 80% threshold — and what does that tell you?

Module 3 · Lesson 2

Proxy Variables and Feature Auditing

Protected attributes are often absent from a model — yet still influence its outputs through correlated features.

When a model never sees race or gender, how can it still discriminate on the basis of race or gender?

In November 2019, software developer David Heinemeier Hansson tweeted that Apple Card had offered him a credit limit twenty times higher than his wife, despite her having a higher credit score. His tweet went viral. Within days, Apple co-founder Steve Wozniak reported the same pattern in his household. The New York Department of Financial Services opened an investigation into Goldman Sachs, which ran the underlying credit model.

Goldman Sachs stated that the algorithm did not use gender as an input. That claim was almost certainly true — and almost certainly beside the point. Credit limit models incorporate dozens of variables — individual versus joint accounts, spending category history, address — each of which can correlate with gender without encoding it explicitly. The investigation found no intentional discrimination but highlighted the inadequacy of simply removing protected attributes.

What Is a Proxy Variable?

A proxy variable is a feature that correlates with a protected attribute strongly enough that including it in a model effectively encodes that attribute's influence even when the attribute itself is absent. The model learns the correlation from training data and exploits it in predictions.

This is not a theoretical concern. In US data, zip code correlates with race due to residential segregation patterns that persist from redlining-era policies. First name correlates with both race and gender. School attended correlates with socioeconomic class and race. Job title correlates with gender. Each of these is a legitimate predictor of many outcomes — and each can function as a race or gender proxy.

Documented Proxy Pathways

These are confirmed or strongly evidenced proxy relationships from real cases:

Proxy Variable	Protected Attribute	Mechanism	Documented Case
Zip code	Race	Residential segregation from historical redlining	Car insurance pricing (ProPublica, 2017)
First name	Race / Gender	Name distributions differ by demographic group	Résumé callback studies (Bertrand & Mullainathan, 2004)
College attended	Race / SES	Historically Black Colleges, school demographics	Amazon hiring tool (Reuters, 2018)
Shopping categories	Gender	Gendered consumption patterns	Apple Card investigation (NYDFS, 2021)
Prior arrest record	Race	Racially disparate policing creates disparate records	COMPAS recidivism tool (ProPublica, 2016)
Word embeddings	Gender / Race	Text corpora reflect historical biases	Caliskan et al., Science, 2017

Feature Importance Auditing

Feature auditing asks: which input variables drive the model's outputs, and do any of them function as protected-attribute proxies? The two dominant techniques are:

SHAP Values SHapley Additive exPlanations assign each feature a contribution score for a specific prediction. High SHAP values for zip code in a loan model, stratified by race, reveal whether the model is using geography as a racial proxy. Used by Salesforce, IBM, and others in bias audits.

Partial Dependence Plots Show the marginal effect of a single feature on predicted outcomes while holding all other features constant. Reveal nonlinear relationships — e.g., a credit model that treats zip codes in majority-Black neighbourhoods dramatically differently from adjacent codes.

Correlation Filtering Before model training, test each candidate feature for correlation with protected attributes. The threshold for concern is typically r > 0.3, though legal and domain-specific standards vary. High correlation does not automatically disqualify a feature — business necessity may justify it — but it must be flagged.

The Redlining Inheritance

The Federal Housing Administration's explicit redlining maps (1934–1968) created residential segregation patterns that persist today. Studies show that zip codes in formerly redlined areas have lower homeownership rates, lower property values, and more economic disadvantage seven decades later. Any ML model trained on US address data inherits this legacy. A 2020 study by the National Community Reinvestment Coalition found that formerly redlined neighbourhoods experience higher auto insurance and mortgage rates from algorithmic pricing models that use zip code as a feature.

The Fairness Through Unawareness Fallacy

A common but legally insufficient mitigation strategy is fairness through unawareness: simply remove protected attributes from the model. The EEOC, the EU AI Act's Article 10, and academic research all agree this is inadequate when correlated proxies remain in the feature set.

The 2021 NYDFS report on Goldman Sachs found the firm's compliance programme relied substantially on the absence of gender as a variable. The regulator required Goldman Sachs to implement ongoing disparity monitoring and to audit for proxy variables — a requirement now reflected in New York's Local Law 144, which mandates bias audits of automated employment decision tools used in New York City.

NYC Local Law 144 (Effective 2023)

New York City's Local Law 144, which took effect in July 2023, requires employers and employment agencies using Automated Employment Decision Tools (AEDTs) to conduct annual bias audits and publish summary results publicly. The law explicitly requires testing for disparate impact on the basis of sex, race/ethnicity, and their intersection — acknowledging that proxy variables make protected-attribute testing necessary even when those attributes are absent from the model.

Lesson 2 Quiz

Proxy Variables and Feature Auditing — four questions

1. Goldman Sachs stated that the Apple Card algorithm did not use gender as an input. Why did this defence not satisfy regulators?

✓ Correct — Correct. The NYDFS investigation focused on proxy variables — features that correlate with gender through social patterns — demonstrating that the absence of an explicit protected attribute does not guarantee non-discrimination.

Incorrect. Regulators did not allege dishonesty. The issue was that correlated proxy features can reproduce the effect of a protected attribute even when that attribute is not directly present in the model.

2. Why does zip code function as a race proxy in US credit and insurance models?

✓ Correct — Correct. FHA redlining maps explicitly graded neighbourhoods by race, denying mortgages and investment in minority areas. The resulting segregation patterns mean zip code is a documented racial proxy in contemporary ML models.

Incorrect. The correlation exists because of FHA redlining policies that created racially segregated residential patterns still evident today — making zip code a well-documented racial proxy variable.

3. What does a SHAP value analysis add to a bias audit that aggregate disparity metrics alone cannot provide?

✓ Correct — Correct. SHAP values provide feature-level explanations at the individual prediction level. By comparing SHAP contributions for zip code or name across demographic groups, auditors can identify proxy-variable pathways that aggregate statistics obscure.

Incorrect. SHAP analysis is an explainability technique that attributes each feature's contribution to individual predictions — allowing auditors to trace which features are acting as proxies for protected attributes.

4. NYC Local Law 144 requires employers using automated hiring tools to do which of the following?

✓ Correct — Correct. Local Law 144 mandates annual independent bias audits and public disclosure of results. It took effect July 2023 and is the first US municipal law requiring both the audit and its publication for employment algorithms.

Incorrect. Local Law 144 requires annual independent bias audits with public disclosure — it does not mandate feature removal, code access, or EEOC pre-approval.

Lab 2 — Identifying Proxy Variables

Analyse a feature list for protected-attribute correlations

Scenario

A consumer lending fintech is building a credit-scoring model. Their data science team has proposed the following feature set. Your job is to identify which features may function as proxy variables, explain the mechanism, and recommend whether each should be dropped, retained with monitoring, or retained as-is.

Feature	Type	Stated Rationale
FICO credit score	Numeric	Direct creditworthiness signal
Zip code of residence	Categorical	Local economic conditions
Employment industry code	Categorical	Job stability predictor
Years at current employer	Numeric	Income stability
Number of dependents	Numeric	Financial obligations
Bank account type (joint/individual)	Binary	Financial behaviour
Primary language of application	Categorical	Operational processing
Highest education level	Ordinal	Earnings potential

Work through each feature with your AI assistant. For each one: name the protected attribute it may proxy, explain the mechanism, and suggest a disposition (drop / monitor / retain). Then discuss what the fintech should do if removing high-risk proxies degrades model accuracy significantly.

Proxy Variable Lab

Feature Auditing

Let's audit this feature set systematically. Start with the features most likely to carry protected-attribute correlations. I'll give you a hint: zip code, employment industry, and primary language are the three I'd examine first. Pick one and walk me through your analysis — what protected attribute could it proxy, and why?

Module 3 · Lesson 3

Auditing NLP and Generative AI

Language models encode the biases of the text they were trained on — detecting those biases requires different tools than tabular audits.

How do you measure bias in a system whose outputs are sentences, images, or probabilities rather than binary decisions?

In 2021, researchers from the University of Washington, Allen Institute for AI, and others published a study examining GPT-3's occupational associations. When prompted to complete sentences like "The doctor said that ___," GPT-3 completed them with male pronouns 83% of the time. For "nurse," female pronouns appeared in over 90% of completions. The researchers also found that Arab identity was associated with violence-related words in word-embedding space at a statistically significant rate.

These were not edge cases or adversarial prompts. They were standard continuations reflecting training data distributions — the internet's text, which reflects decades of human occupational segregation and social stereotyping, faithfully reproduced in a trillion-parameter model.

Why NLP Bias Auditing Differs

Tabular model audits compare acceptance rates across groups. NLP model audits must grapple with outputs that are probabilistic, context-dependent, and often evaluated by human raters whose own biases affect the assessment. The field uses several purpose-built methodologies.

Word Embedding Association Test (WEAT) Developed by Caliskan et al. (Science, 2017), WEAT measures the relative association of concept words (e.g., "nurse," "engineer") with attribute words (e.g., male/female names) in embedding space. Effect sizes are computed using a permutation test. Published results showed that word2vec and GloVe embeddings reproduce human implicit bias test results with high fidelity.

Counterfactual Data Augmentation Replace a demographic marker in an input and measure the change in output. "The Black man walked into the store" vs. "The white man walked into the store" — if sentiment analysis scores differ, bias is present. Used extensively to audit sentiment analysis and toxicity classifiers.

Stereotype Benchmark (StereoSet) A dataset of 17,000 sentences designed to measure stereotypical associations in language models across gender, profession, race, and religion. Models are tested on whether they prefer stereotypical completions over anti-stereotypical ones. Published results for BERT, GPT-2, RoBERTa, and others show measurable stereotype scores on all tested models.

WinoBias / WinoGender Coreference resolution benchmarks that test whether models correctly identify pronouns when the referent's occupation is gender-stereotyped. "The doctor asked the nurse to help her" — does the model resolve "her" correctly? Models including BERT were shown to fail these tests at rates far exceeding chance in 2018 studies.

Toxicity and Content Moderation Bias

In 2019 researchers at Google and University of Washington published a study on the Perspective API toxicity classifier, which is used by major news organisations to moderate comments. They found that texts mentioning identity terms — "I am gay," "I am Black," "I am a woman" — were rated as more toxic than equivalent sentences without those terms, at statistically significant rates.

This type of bias — identity-based toxicity inflation — can cause content moderation systems to disproportionately silence speech by and about marginalised groups. The study prompted Google to update its Perspective API training methodology and publish ongoing model cards tracking performance by identity subgroup.

The Sentiment Analysis Problem

A 2020 study by Kiritchenko and Mohammad tested 200 sentiment analysis systems across 16 teams and found that all 200 showed measurable race and gender bias in at least one condition. Black female names systematically received more negative sentiment scores than white male names in identical sentence contexts. This was true of both commercial APIs and open-source models, suggesting the bias originates in training data rather than model architecture.

Auditing Generative Image Models

In 2023 Bloomberg published an investigation of image generation models including Stable Diffusion and DALL-E. When prompted to generate "a photo of a CEO," the models produced images that were overwhelmingly male and white. "A photo of a janitor" or "a photo of a fast food worker" produced images that were predominantly non-white. The study systematically varied occupational prompts and documented the demographic distribution of outputs — applying the same disparity framework used in tabular audits to image outputs.

The audit methodology: generate N images per prompt, use a secondary classifier to assess apparent demographic characteristics, compute demographic parity difference across occupation categories. The same four-fifths rule logic applies — if a certain demographic group appears in less than 80% of the frequency that would be expected from random sampling, the model has a measurable disparity.

Model Cards and Bias Documentation

Google's 2019 paper "Model Cards for Model Reporting" (Mitchell et al.) proposed standardised documentation for ML models that includes disaggregated performance metrics across demographic subgroups. Model cards are now required for models submitted to several major benchmarks and are mandated by the EU AI Act for high-risk AI systems. A compliant model card must include:

Performance metrics by subgroup: accuracy, precision, recall, and F1 stratified by at minimum gender and race. Known bias limitations: explicit statement of cases where the model is known to underperform. Training data demographics: where known, the demographic composition of training data. Intended and prohibited use cases.

EU AI Act — High-Risk System Requirements

The EU AI Act (adopted 2024) classifies NLP systems used in employment, credit, education, and law enforcement as high-risk. Article 9 requires ongoing risk management including bias monitoring. Article 10 requires training data governance to address underrepresentation and bias. Article 13 requires transparency documentation equivalent to model cards. These requirements took effect for the highest-risk systems in August 2024.

Lesson 3 Quiz

Auditing NLP and Generative AI — four questions

1. The WEAT (Word Embedding Association Test) demonstrated that GloVe and word2vec embeddings reproduce human implicit bias results. What does this imply for downstream NLP applications?

✓ Correct — Correct. Because downstream models initialise with or are trained on biased embeddings, they inherit those associations. The Caliskan et al. paper showed that classic IAT (Implicit Association Test) results — e.g., flowers/pleasant, insects/unpleasant — were reproduced in embedding space, as were gender-occupation associations.

Incorrect. WEAT showed that bias in training corpora propagates into embedding space and therefore into any model built on those embeddings — making monitoring or explicit debiasing necessary.

2. The Perspective API toxicity study found that sentences mentioning "I am gay" or "I am Black" were scored as more toxic than equivalent sentences without identity terms. What type of audit methodology would directly detect this?

✓ Correct — Correct. Counterfactual augmentation creates paired sentences that differ only in identity markers, then measures the score change. If "I am a person" scores lower in toxicity than "I am a gay person" with identical surrounding context, identity-based inflation is confirmed.

Incorrect. The most direct method is counterfactual data augmentation — constructing sentence pairs that are identical except for the identity term, then comparing outputs. This is the methodology used in the Perspective API study.

3. The Bloomberg investigation of generative image models found they produced predominantly white, male images for "CEO" prompts. Which of the following would constitute a rigorous bias audit of this finding?

✓ Correct — Correct. This is the methodology Bloomberg's study used — systematic prompt sampling, secondary demographic classification, and statistical comparison of output distributions. The four-fifths rule logic applies: if women appear in fewer than 80% of the frequency expected from parity, a measurable disparity exists.

Incorrect. A rigorous audit requires systematic sampling, consistent classification methodology, and statistical analysis — not subjective human review or comparison to stated commitments.

4. Under the EU AI Act, what is required of high-risk NLP systems regarding bias documentation?

✓ Correct — Correct. The EU AI Act's relevant articles create substantive obligations: Article 9 (risk management system including bias monitoring), Article 10 (training data governance), and Article 13 (transparency requirements equivalent to model cards with disaggregated metrics). These apply to NLP systems used in employment, credit, education, and law enforcement.

Incorrect. The EU AI Act imposes specific mandatory requirements across Articles 9, 10, and 13 for high-risk systems — documentation is required, not merely encouraged, and applies to multiple domains beyond law enforcement.

Lab 3 — NLP Bias Audit Design

Design a counterfactual audit for a deployed sentiment analysis system

Scenario

A major e-commerce platform uses a sentiment analysis API to automatically flag negative product reviews for human review. An internal report has suggested the system may be flagging reviews written in African American Vernacular English (AAVE) at higher rates than standard American English reviews of equivalent sentiment. You have been asked to design a formal audit.

Design a counterfactual bias audit for this system. Your design should specify: (1) how you would construct your test sentence pairs, (2) what metric you would use to quantify the disparity, (3) what sample size you need for statistical significance, and (4) what threshold would constitute an actionable finding. Discuss with your AI assistant — at least 3 exchanges.

NLP Bias Audit Lab

Counterfactual Testing

Let's design your counterfactual audit systematically. The core principle is that paired sentences should be semantically equivalent — same sentiment, same content — but differ in dialect markers. Start with Step 1: how would you construct sentence pairs that test for AAVE versus standard American English differential treatment, while holding semantic content constant? What linguistic features would you vary?

Module 3 · Lesson 4

Audit Frameworks, Red-Teaming, and Continuous Monitoring

One-time audits at deployment are insufficient — bias can emerge from data drift, user behaviour, and model updates.

If a model passes its pre-deployment bias audit, what happens when the world it was trained on changes?

In 2020, Twitter users began reporting that the platform's automatic image cropping feature — which selected a preview crop when images were too tall — appeared to favour lighter-skinned faces. Researchers tested this systematically by creating images with a lighter-skinned and a darker-skinned face at different positions. The algorithm consistently cropped to show the lighter-skinned face.

Twitter's engineers ran their own investigation and in 2021 confirmed the bias, finding the cropping model had learned to associate facial saliency with features more common in lighter-skinned individuals. The model had passed internal testing before deployment. The bias was discovered not by the company's audit process but by external researchers two years after launch. Twitter subsequently disabled the algorithmic cropping feature entirely.

The episode illustrated a fundamental gap: pre-deployment audits capture the model's behaviour at a moment in time. They cannot anticipate how the model will interact with real-world content distributions that evolve over time.

Structured Audit Frameworks

Three frameworks have achieved significant industry adoption:

NIST AI RMF (2023) The US National Institute of Standards and Technology's AI Risk Management Framework organises bias and fairness risk across four functions: Govern, Map, Measure, and Manage. The Measure function explicitly requires documentation of bias metrics stratified by demographic group. Companies including Microsoft, Google, and IBM have publicly committed to NIST RMF alignment.

IEEE 7003-2023 The IEEE Standard for Algorithmic Bias Considerations provides specific technical requirements for bias testing, including requirements for intersectional testing (multiple attributes simultaneously), documentation of fairness criteria selection rationale, and post-deployment monitoring intervals.

ISO/IEC 42001 (2023) The AI Management System standard from ISO/IEC includes bias and fairness as components of risk management. Unlike NIST RMF (voluntary) and most IEEE standards, ISO/IEC 42001 is designed for third-party certification, meaning organisations can receive formal certification of their AI governance processes including bias auditing procedures.

Red-Teaming for Bias

Red-teaming, borrowed from cybersecurity, involves structured adversarial testing designed to find failure modes before they occur in production. Applied to bias detection, red-teaming involves:

Demographic edge cases: Systematically test underrepresented intersectional groups — elderly Black women, young disabled Latino men — that aggregate statistics may obscure. The bias is often worst at intersections.

Historical stress tests: Test the model's behaviour on inputs that reflect historical events or cultural contexts where bias is known to be concentrated — names, dialects, religious references.

Adversarial demographic injection: Subtly alter inputs to include demographic signals and measure output changes. Used by Meta's Responsible AI team to test their content ranking algorithms.

Case — Meta's Algorithmic Amplification Audit (2021)

In October 2021, Meta published an internal study finding that its Facebook news feed algorithm amplified right-wing content more than left-wing content with similar engagement metrics. The study was conducted by Meta's own Responsible AI team as part of a red-team exercise for potential political bias. The team used a randomised controlled experiment — showing some users chronological feed, others algorithmic — and measured downstream political content exposure differences. This represented one of the first public disclosures of a major platform's algorithmic amplification audit methodology.

Continuous Monitoring Architecture

A continuous monitoring programme requires infrastructure that most organisations have not yet built. The key components are:

Demographic logging: Capturing (with consent and in compliance with data protection law) the demographic composition of users receiving different model outputs. Without this data, post-deployment drift is undetectable.

Disparity dashboards: Real-time or near-real-time tracking of key fairness metrics. Amazon Web Services, Microsoft Azure, and Google Cloud all offer bias monitoring tools as part of their MLOps platforms. These tools compute demographic parity and equalized odds metrics on rolling windows of production data.

Drift detection: Statistical tests (e.g., Population Stability Index, Kolmogorov-Smirnov test) that alert when the input distribution shifts significantly from the training distribution. Bias can emerge or worsen when the user population changes — a model trained on one population may perform differently on a shifted one.

Scheduled re-audits: IEEE 7003 recommends audit intervals of no longer than 12 months for high-risk systems, or immediately following any material change to the model, training data, or deployment context.

Third-Party Auditing

Internal audits face a structural conflict of interest — the team auditing a system often built or deployed it. Third-party auditing addresses this but creates access challenges: auditors need sufficient access to the model and data to conduct meaningful tests, which organisations may resist for IP or security reasons.

The first documented third-party audit of a major AI system was the 2021 audit of Twitter's image cropping algorithm by an external team at the FAccT conference — which is precisely the event that identified the race bias described at the start of this lesson. The external auditors had access to the deployed model via API, not the model weights — demonstrating that meaningful audits are possible with API-level access, though weight-level access enables more thorough testing.

The NIST AI RMF Bias Lifecycle

NIST's AI RMF frames bias not as a property to be eliminated at training time but as a risk to be managed across the full system lifecycle: design (what problem are we solving, for whom?), data (how representative is the training corpus?), development (which fairness criteria guide training?), deployment (who uses it, in what context?), and post-deployment (how does behaviour change over time?). This lifecycle framing is now the dominant conceptual model in corporate AI governance programmes.

Lesson 4 Quiz

Audit Frameworks, Red-Teaming, and Continuous Monitoring — four questions

1. Twitter's image cropping bias was discovered by external researchers two years after deployment, despite the model passing internal testing. What does this reveal about pre-deployment audits?

✓ Correct — Correct. The Twitter case is a canonical example of why continuous monitoring matters. The model's bias was not detectable from the stated test conditions at launch — it became apparent only when researchers systematically tested it against varied real-world image content two years later.

Incorrect. The lesson is structural, not personal — pre-deployment tests are snapshots. Real-world content distributions shift, user behaviour evolves, and biases that were statistically insignificant at launch can become significant at scale.

2. What distinguishes ISO/IEC 42001 from NIST AI RMF in the context of bias auditing?

✓ Correct — Correct. The key distinguishing feature is certifiability. ISO/IEC 42001 is an auditable management system standard — organisations can receive formal third-party certification. NIST AI RMF is voluntary guidance without certification infrastructure.

Incorrect. NIST AI RMF is also voluntary. The key distinction is that ISO/IEC 42001 is designed for third-party certification — meaning independent auditors can formally assess and certify organisational compliance.

3. Meta's 2021 algorithmic amplification audit used a randomised controlled experiment comparing chronological and algorithmic feeds. This is an example of what type of bias testing?

✓ Correct — Correct. The Meta study used a red-team experimental design — deliberately contrasting two conditions (algorithmic vs. chronological) and measuring differential outcomes. This is the experimental approach to bias red-teaming: create a controlled comparison to isolate the algorithm's effect.

Incorrect. The Meta study was a red-team exercise using a randomised controlled experiment — it compared two feed conditions to measure the algorithm's causal effect on content exposure, which is distinct from WEAT, counterfactual augmentation, or calibration testing.

4. IEEE 7003-2023 recommends that high-risk AI systems be re-audited at maximum intervals of how long?

✓ Correct — Correct. IEEE 7003 specifies annual re-audits as the maximum interval for high-risk systems, with immediate re-audit triggered by material changes. NYC Local Law 144 similarly requires annual audits for automated employment decision tools.

Incorrect. IEEE 7003-2023 recommends no more than 12 months between audits for high-risk systems, with earlier re-audits required after material changes to the model, training data, or deployment context.

Lab 4 — Continuous Monitoring Programme Design

Build a post-deployment bias monitoring plan for a real-world system

Scenario

A regional bank is deploying a loan approval ML model. It has passed a pre-deployment bias audit showing acceptable demographic parity across race and gender groups. Your task is to design the ongoing monitoring programme required under their state regulator's new AI fairness guidelines. The bank's model deployment team has budget for one FTE dedicated to AI fairness monitoring.

Design a continuous monitoring programme. Address: (1) what demographic data you will collect and how, (2) which disparity metrics you will track and at what frequency, (3) what drift detection method you will use and what thresholds trigger re-audit, (4) how you will handle the conflict between monitoring effectiveness and customer data privacy, and (5) who conducts re-audits and how often. Discuss with your AI assistant — at least 3 exchanges.

Monitoring Design Lab

Continuous Auditing

Let's build your monitoring programme from the ground up. The first — and often most underestimated — challenge is demographic data collection. Banks in the US have HMDA (Home Mortgage Disclosure Act) reporting requirements that give them some demographic data, but this coverage is incomplete. Start here: what data do you have access to, what do you need to infer, and what are the risks of demographic inference methods like BISG (Bayesian Improved Surname Geocoding)?

Module 3 — Test

Detecting Bias in Systems · 15 questions · Pass at 80%

1. Amazon's résumé screening tool penalised applicants from women's colleges primarily because:

✓ Correct — Correct. Historical bias in training data caused the model to replicate past discriminatory hiring decisions.

Incorrect. The bias was unintentional — the model learned from historically male-dominated hiring data.

2. Demographic parity requires that:

✓ Correct — Correct. Demographic parity holds when the rate of positive predictions is the same regardless of group membership.

Incorrect. Demographic parity is specifically the equal positive prediction rate across groups — not TPR equality (equalized odds), score calibration, or individual treatment.

3. The Chouldechova impossibility theorem states that when base rates differ between groups, which combination is impossible?

✓ Correct — Correct. With unequal base rates, these three criteria are mathematically incompatible — satisfying one requires violating at least one other.

Incorrect. The impossibility result applies specifically to the combination of demographic parity, equalized odds, and calibration when group base rates differ.

4. Under the US four-fifths rule, if the highest-scoring group has a selection rate of 60%, disparate impact is presumed when a protected group's rate falls below:

✓ Correct — Correct. 60% × 0.80 = 48%. Below this rate, disparate impact is legally presumed under the 1978 Uniform Guidelines.

Incorrect. The four-fifths rule: 60% × 0.80 = 48%. Any group with a selection rate below 48% triggers presumed disparate impact.

5. The Gender Shades study found the largest error rate disparity in Amazon Rekognition between which two groups?

✓ Correct — Correct. The 31:1 disparity between lighter-skinned males and darker-skinned females was the study's most striking finding, demonstrating intersectional bias invisible in non-stratified metrics.

Incorrect. The maximum disparity was between lighter-skinned males (0% error) and darker-skinned females (31.4% error) — a ratio exceeding 31:1.

6. Why is "fairness through unawareness" (removing protected attributes from a model) legally and technically insufficient?

✓ Correct — Correct. Proxy variables — features correlated with protected attributes through social and historical patterns — allow the model to learn protected-attribute effects indirectly. The Apple Card and Amazon cases both demonstrate this.

Incorrect. The technical and legal problem is that correlated features (zip code, name, shopping history) can replicate the effect of a protected attribute even after it is removed from the input set.

7. Historical FHA redlining (1934–1968) is relevant to modern ML models primarily because:

✓ Correct — Correct. The residential segregation created by redlining is measurable today in zip code demographics, property values, and economic indicators — all common ML features that inherit this historical discrimination.

Incorrect. Redlining laws ended with the Fair Housing Act of 1968. Their relevance today is that they created persistent residential segregation patterns that make geographic features like zip code act as racial proxies in current ML models.

8. SHAP values are used in bias auditing primarily to:

✓ Correct — Correct. SHAP values attribute model output contributions to individual features, allowing auditors to see whether features like zip code or name are contributing disproportionately to outcome differences across demographic groups.

Incorrect. SHAP values are an explainability tool that shows feature-level contributions to individual predictions — used to trace proxy-variable effects, not to measure calibration or compute significance tests.

9. The Word Embedding Association Test (WEAT) showed that GloVe and word2vec embeddings:

✓ Correct — Correct. Caliskan et al. (Science, 2017) showed that WEAT effect sizes closely matched those from human IAT studies, demonstrating that bias in training corpora is faithfully encoded in embedding geometry.

Incorrect. WEAT demonstrated that embeddings trained on large text corpora reproduce human implicit bias results — the statistical patterns of biased language use are encoded in embedding space.

10. NYC Local Law 144 (effective 2023) applies to:

✓ Correct — Correct. Local Law 144 specifically targets automated employment decision tools used in hiring and employment decisions affecting New York City candidates or employees, requiring annual bias audits and public disclosure.

Incorrect. Local Law 144 applies to employers using Automated Employment Decision Tools for candidates or employees in New York City — it is an employment law, not a general AI governance law.

11. Twitter's image cropping algorithm was found to preferentially crop to lighter-skinned faces. The bias was discovered by:

✓ Correct — Correct. The discovery came from outside the company — researchers published findings at FAccT 2021, prompting Twitter's own internal confirmation. This illustrates the limitation of relying solely on internal auditing.

Incorrect. The bias was discovered by external academic researchers, not Twitter's internal teams — a key reason post-deployment monitoring and third-party auditing are considered necessary complements to internal processes.

12. What distinguishes ISO/IEC 42001 from NIST AI RMF as an audit framework?

✓ Correct — Correct. ISO/IEC 42001 is a certifiable management system standard — organisations can receive formal third-party certification. NIST AI RMF is voluntary guidance without certification infrastructure.

Incorrect. Both are voluntary (not legally mandatory), and ISO/IEC 42001 is international in scope. The key distinction is certifiability — ISO/IEC 42001 supports formal third-party certification.

13. Counterfactual data augmentation is most directly used to detect:

✓ Correct — Correct. Counterfactual augmentation creates matched input pairs that differ only in demographic markers. Output differences reveal direct sensitivity to those markers — as used in the Perspective API toxicity study.

Incorrect. Counterfactual augmentation tests whether altering a demographic marker (name, identity term, dialect feature) while holding semantic content constant changes the model's output.

14. The EU AI Act's Article 10 addresses bias by requiring:

✓ Correct — Correct. Article 10 requires that training, validation, and testing datasets for high-risk AI systems undergo data governance practices that account for possible biases and underrepresentation — addressing the root cause rather than only measuring downstream symptoms.

Incorrect. Article 10 is specifically about training data governance — requiring measures to address underrepresentation and bias in datasets. Human override requirements come from Article 14, and transparency documentation from Article 13.

15. IEEE 7003-2023 specifically requires bias audits to include which type of testing not always covered by single-attribute demographic parity analysis?

✓ Correct — Correct. IEEE 7003-2023 explicitly requires intersectional testing — examining bias at combinations of attributes (e.g., Black women, elderly Latino men) — because aggregate single-attribute analysis can miss disparities that exist only at attribute intersections.

Incorrect. IEEE 7003's key addition over simpler frameworks is the explicit requirement for intersectional testing — bias analysis at the combination of multiple protected attributes simultaneously, not just each attribute independently.