On January 1, 2023, New York City Local Law 144 took effect — the first U.S. law requiring independent bias audits of automated employment decision tools before they can be used on job candidates or employees. Companies that sell or use such tools without a compliant audit face fines of up to $1,500 per day. The law defines an "automated employment decision tool" as any computational process that "substantially assists or replaces discretionary decision-making" in hiring. Within months, vendors scrambled to commission audits, and critics debated whether the audit methodology was rigorous enough to catch the harms it claimed to prevent.
A bias audit is a systematic, evidence-based examination of an algorithmic system to determine whether it produces disparate outcomes across demographic groups, whether those disparities are legally or ethically significant, and whether the system's design choices contributed to them. The term borrows from financial auditing: an independent examiner reviews records against a standard and issues a finding.
The parallel is instructive but imperfect. Financial audits have century-old standards (GAAP, IFRS). AI bias audits do not — yet. The field is assembling its vocabulary and methodology in real time, which is exactly why learning to conduct one now is a competitive skill.
Several high-profile failures transformed bias auditing from academic curiosity to legal necessity. In 2018, MIT researcher Joy Buolamwini and Timnit Gebru published "Gender Shades," demonstrating that three leading commercial face recognition systems misclassified darker-skinned women at error rates up to 34.7 percentage points higher than lighter-skinned men. The study used no proprietary data — only a publicly constructed benchmark — and triggered congressional hearings, vendor apologies, and IBM's temporary withdrawal of its face recognition product.
In 2016, ProPublica's analysis of the COMPAS recidivism tool used in Broward County, Florida, found Black defendants were nearly twice as likely as white defendants to be falsely flagged as higher risk. The vendor, Northpointe, disputed the methodology — illustrating that audit disputes are as much about choosing metrics as computing them.
Amazon's internal résumé-screening tool, developed between 2014 and 2017 and quietly shut down after engineers discovered it penalized résumés containing the word "women's" (as in "women's chess club"), became a textbook case of training data encoding historical discrimination.
Each of these cases was exposed not by the organizations deploying the systems, but by independent researchers or journalists applying systematic audit logic. The gap between internal assurance and external accountability is precisely the space that formal bias audits are designed to fill.
By the end of this module you will be able to plan, structure, and present a complete bias audit — selecting a real system, gathering evidence, applying at least two fairness metrics, and communicating findings to a specific audience with specific recommendations.
You are preparing a bias audit proposal. Work with the AI coach to define your audit scope: choose a real AI system (hiring, lending, healthcare, criminal justice, or content moderation), identify the affected groups, select the most appropriate audit type, and explain what evidence you would need.
Complete at least 3 exchanges to finish this lab.
When ProPublica published its COMPAS analysis in May 2016, Northpointe immediately responded that ProPublica had used the wrong fairness metric. ProPublica showed that Black defendants had a higher false positive rate — they were labeled high-risk but didn't reoffend. Northpointe countered that among defendants who were labeled high-risk, the proportion who did reoffend was equal across races — a property called calibration. Both facts were mathematically true. And in 2016, researchers Chouldechova and Kleinberg et al. proved formally: if base rates differ across groups, you cannot simultaneously achieve equal false positive rates, equal false negative rates, and calibration. The COMPAS dispute was not a matter of one side being wrong — it was a collision of incompatible mathematical definitions of fairness.
Before applying any metric, auditors must gather data. Evidence falls into three categories:
Outcome data: Records of actual decisions — loans approved/denied, candidates advanced/rejected, bail set/denied — broken down by demographic group. This is the primary material for disparate impact analysis. Auditors typically request at least 12 months of data to control for seasonal variation.
Input/feature data: The variables the model uses. Auditors check for proxy variables, assess whether protected attributes appear directly, and measure feature correlations. The CFPB requires lenders to retain HMDA (Home Mortgage Disclosure Act) data — demographic and decision data — making mortgage lending one of the most auditable domains.
Documentation: Model cards, datasheets for datasets, training logs, labeling instructions, and deployment policies. IBM's 2019 open-source FactSheets project and the Partnership on AI's work on dataset documentation provide templates. The absence of documentation is itself an audit finding.
Chouldechova (2017) and Kleinberg et al. (2016) independently proved that when base rates differ between groups — when one group actually commits crimes, defaults on loans, or gets sick at different rates — you cannot simultaneously satisfy calibration, equal false positive rates, and equal false negative rates. Every audit must declare which metric it prioritizes and explain the ethical reasoning behind that choice.
The EEOC's 1978 Uniform Guidelines on Employee Selection Procedures established the four-fifths (80%) rule: if the selection rate for any group is less than 80% of the selection rate for the group with the highest rate, that is evidence of adverse impact requiring justification. Under NYC LL 144, auditors must compute this ratio for each race/ethnicity and sex category present in sufficient numbers.
Example: If 60% of white applicants pass a screening tool but only 40% of Black applicants do, the ratio is 40/60 = 0.67 — below 0.80 — flagging potential adverse impact. The employer must then demonstrate either that the ratio is explainable by legitimate job-related factors or that no less discriminatory alternative exists.
The four-fifths rule is a practical threshold, not a mathematical law. Small samples can produce false positives; very large samples can produce statistically significant but legally immaterial gaps. Good auditors report both the ratio and the statistical confidence around it.
Never present a single fairness metric as "the" measure of bias. Report multiple metrics, explain their trade-offs, and be transparent about which stakeholder interests each metric prioritizes. Your audit's credibility depends on this transparency.
You've scoped your audit. Now you need to choose which fairness metrics you'll apply and why — and acknowledge the trade-offs you're accepting. Work with the coach to select at least two metrics, explain what evidence you'd need to compute them, and articulate which stakeholder interests each metric prioritizes.
Complete at least 3 exchanges to finish this lab.
In May 2023 the U.S. Equal Employment Opportunity Commission issued its "Promising Practices for Employers Using AI and Algorithmic Decision-Making Tools," stating explicitly that employers remain liable for Title VII violations even when the discriminatory effect comes from a vendor's algorithm. The EEOC guidance recommended that employers audit these tools and retain an independent third party to do so. What the guidance notably did not specify was the format of the audit report — leaving practitioners to develop de facto standards through practice.
Across the emerging audit ecosystem — including audits published under NYC LL 144, academic audits like Gender Shades, and nonprofit audits by the Algorithmic Justice League — a consensus structure has emerged. Your report should contain five sections:
Without severity ratings, all findings look equal — and organizations will address whichever is cheapest rather than most urgent. Adopt a consistent scale:
| Level | Definition | Example |
|---|---|---|
| Critical | Active legal liability; significant documented harm to a protected class; must be addressed immediately | Selection ratio for Black applicants is 0.62, below the four-fifths threshold, in a jurisdiction where LL 144 applies |
| High | Significant disparity without current legal action; predictive of harm at scale; address within 90 days | False positive rate for women applicants is 1.4× that of men; no statistically significant ground-truth difference exists |
| Medium | Disparity detectable but within legal thresholds; upstream risk factor; monitor quarterly | ZIP code feature correlates 0.71 with race/ethnicity; no adverse impact threshold crossed yet |
| Low | Documentation gap or process concern; no measured disparity; address in next development cycle | No model card exists; training data source is undocumented |
Research on audit uptake (including a 2022 study by Metcalf, Moss, and boyd at Data & Society) found that bias audit recommendations were most likely to be implemented when they were: (1) specific and bounded — "retrain the model excluding ZIP code" rather than "reduce proxy variables"; (2) linked to a business risk the organization already recognizes — regulatory fine, reputational damage, or contract loss; and (3) assigned to a named individual with authority and accountability.
Vague recommendations like "consider fairness" produce no action. Recommendations with dollar estimates attached to non-compliance ("the EEOC fine schedule starts at $50,000 per violation") move budgets.
Write your executive summary for the CFO. Write your methodology for the ML engineer who will implement the fix. Write your findings for the legal counsel who needs to assess liability. The same audit serves three audiences, and each section should speak to one of them explicitly.
The most instructive model is the HireVue algorithmic audit (2021, conducted by O'Neil Risk Consulting & Algorithmic Auditing). HireVue voluntarily discontinued its facial analysis feature before the audit, but the published report — available publicly — demonstrates the five-section structure, uses adverse impact ratios, and rates findings by severity. Its limitations section, which acknowledges that the audit could not test the model on real applicants, is a model of intellectual honesty that strengthens rather than undermines the report's credibility.
You've scoped your audit and chosen your metrics. Now draft the executive summary and one finding with a severity rating and a specific recommendation. The coach will give you feedback on clarity, specificity, audience calibration, and whether your recommendation would actually drive action.
Complete at least 3 exchanges to finish this lab.
In December 2020, Google AI ethics researcher Timnit Gebru was dismissed — or resigned under pressure, depending on the account — after a dispute over a paper co-authored with Margaret Mitchell and others that critiqued large language models for encoding social bias and environmental costs. The paper had not yet been published; Google leadership objected to its conclusions and asked that Gebru withdraw it or remove Google affiliates' names. The incident, which became widely known as the "Stochastic Parrots" controversy, illustrated in vivid terms that presenting bias findings to organizations with financial interests in a contrary conclusion is not merely a communication challenge — it is a political and professional risk. Gebru went on to found the Distributed AI Research Institute (DAIR), an independent research organization explicitly not dependent on tech industry funding, to ensure that bias findings could be published without organizational gatekeeping.
Different stakeholders need different framings of the same findings. Engineers respond to technical specificity — "the model's false positive rate for group X is 1.8 standard deviations above the mean" is actionable to them. Executives respond to liability and reputational framing — "this exposure puts us outside LL 144 compliance and creates a $1,500/day fine risk." Legal counsel wants findings mapped to specific statutes. Regulators want methodology documented to their standards.
Affected community members — often excluded from audit presentations — need to understand findings in terms of the actual harms, not statistical abstractions. The Algorithmic Justice League's public communications deliberately avoid metric-only framings in favor of concrete narratives: not "false positive rate disparity of 0.21" but "Black defendants marked high-risk who were not rearrested — labeled dangerous, released later, or held longer."
| Pushback | What It Usually Means | Effective Response |
|---|---|---|
| "Your metric choice is biased." | The organization prefers a metric where they perform better | Agree that metric choice involves trade-offs; present multiple metrics; ask the organization to specify which metric they believe should govern the decision and why |
| "The sample is too small." | May be legitimate or may be deflection | Report confidence intervals; if sample is genuinely small, flag it as a data collection recommendation; ask why demographic data wasn't retained |
| "The model is just reflecting real-world patterns." | Conflating prediction with prescription | Acknowledge base rate differences; explain that a system can be calibrated and still cause harm; distinguish between descriptive accuracy and normative acceptability |
| "We already knew about this." | Either true (and they didn't fix it) or false (face-saving) | If true: ask for the remediation timeline that was in place; if no timeline exists, the finding stands. If false: note that the audit has now documented it formally. |
The value of a bias audit is directly proportional to the auditor's independence from the audited organization. Timnit Gebru's founding of DAIR reflects a structural truth: audits conducted by internal teams, or by external teams financially dependent on continued contracts with the auditee, face structural pressures that compromise findings. When presenting your audit, be transparent about your relationship to the organization and any limitations that relationship creates.
A presentation without a follow-up written record is an audit finding that exists only in memory. After every stakeholder presentation, send a written summary of: decisions made, recommendations accepted or rejected (with stated reasons), owners assigned, and timelines committed. This creates an accountability trail and, if the organization later faces regulatory scrutiny, demonstrates either due diligence or deliberate non-remediation — a distinction that matters significantly in legal proceedings.
The FTC's 2022 enforcement action against the data broker Kochava referenced the company's internal documentation acknowledging privacy risks, then continuing operations unchanged. Documented awareness of a problem without remediation is often worse legally than undocumented ignorance. The same principle applies to bias audit follow-through.
You have now covered the complete bias audit workflow: defining scope and type (L1), gathering evidence and choosing metrics with justified trade-offs (L2), structuring a five-section report with severity ratings and actionable recommendations (L3), and presenting findings to diverse stakeholders while handling pushback and documenting follow-through (L4). Your module test will assess all four competencies.
You are presenting your bias audit findings to a skeptical executive audience. The coach will play the role of a senior stakeholder who pushes back on your findings using the four common arguments from Lesson 4. Practice responding clearly and professionally.
Complete at least 3 exchanges to finish this lab.