L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 6 · Lesson 1

What Is a Bias Audit?

Defining scope, purpose, and the real-world precedents that made auditing a formal discipline
What exactly are auditors measuring — and who decided that measurement matters?

On January 1, 2023, New York City Local Law 144 took effect — the first U.S. law requiring independent bias audits of automated employment decision tools before they can be used on job candidates or employees. Companies that sell or use such tools without a compliant audit face fines of up to $1,500 per day. The law defines an "automated employment decision tool" as any computational process that "substantially assists or replaces discretionary decision-making" in hiring. Within months, vendors scrambled to commission audits, and critics debated whether the audit methodology was rigorous enough to catch the harms it claimed to prevent.

Defining the Audit

A bias audit is a systematic, evidence-based examination of an algorithmic system to determine whether it produces disparate outcomes across demographic groups, whether those disparities are legally or ethically significant, and whether the system's design choices contributed to them. The term borrows from financial auditing: an independent examiner reviews records against a standard and issues a finding.

The parallel is instructive but imperfect. Financial audits have century-old standards (GAAP, IFRS). AI bias audits do not — yet. The field is assembling its vocabulary and methodology in real time, which is exactly why learning to conduct one now is a competitive skill.

Three Audit Types

Type 01
Disparate Impact Audit
Measures whether selection rates, scores, or outcomes differ across demographic groups. Uses the 80% (four-fifths) rule from U.S. employment law or statistical significance tests.
Type 02
Algorithmic Fairness Audit
Tests mathematical fairness metrics — equalized odds, calibration, demographic parity — to identify which fairness definition the system satisfies and which it violates.
Type 03
Process / Documentation Audit
Reviews training data provenance, labeling decisions, model cards, and deployment policies to identify upstream choices that could introduce or amplify bias.

The Landmark Cases That Created Demand

Several high-profile failures transformed bias auditing from academic curiosity to legal necessity. In 2018, MIT researcher Joy Buolamwini and Timnit Gebru published "Gender Shades," demonstrating that three leading commercial face recognition systems misclassified darker-skinned women at error rates up to 34.7 percentage points higher than lighter-skinned men. The study used no proprietary data — only a publicly constructed benchmark — and triggered congressional hearings, vendor apologies, and IBM's temporary withdrawal of its face recognition product.

In 2016, ProPublica's analysis of the COMPAS recidivism tool used in Broward County, Florida, found Black defendants were nearly twice as likely as white defendants to be falsely flagged as higher risk. The vendor, Northpointe, disputed the methodology — illustrating that audit disputes are as much about choosing metrics as computing them.

Amazon's internal résumé-screening tool, developed between 2014 and 2017 and quietly shut down after engineers discovered it penalized résumés containing the word "women's" (as in "women's chess club"), became a textbook case of training data encoding historical discrimination.

Why It Matters

Each of these cases was exposed not by the organizations deploying the systems, but by independent researchers or journalists applying systematic audit logic. The gap between internal assurance and external accountability is precisely the space that formal bias audits are designed to fill.

Key Terms

Disparate ImpactA neutral-seeming policy or tool that produces significantly different outcomes across protected groups, regardless of intent.
Protected AttributeA characteristic — race, sex, age, disability, national origin, etc. — that anti-discrimination law prohibits using as a basis for adverse decisions.
Proxy VariableA seemingly neutral data feature (ZIP code, name, vocabulary) that correlates strongly enough with a protected attribute to reproduce its discriminatory effect.
Selection RateThe proportion of applicants in a group who are selected, hired, or advanced. The ratio of selection rates across groups is the core metric in disparate impact analysis.
Module 6 Goal

By the end of this module you will be able to plan, structure, and present a complete bias audit — selecting a real system, gathering evidence, applying at least two fairness metrics, and communicating findings to a specific audience with specific recommendations.

Lesson 1 Quiz

What Is a Bias Audit? — 4 questions
New York City Local Law 144 (effective January 2023) requires what before an automated employment decision tool can be used on candidates?
Correct. LL 144 mandates an independent bias audit whose summary results must be publicly posted before the tool is deployed on candidates or employees.
Not quite. LL 144's core requirement is an independent bias audit, with results posted publicly, before the tool is used.
The "Gender Shades" study (Buolamwini & Gebru, 2018) found that face recognition systems performed worst on which demographic group?
Correct. Darker-skinned women had error rates up to 34.7 percentage points higher than lighter-skinned men across all three commercial systems tested.
The study found darker-skinned women experienced the highest error rates — up to 34.7 pp above lighter-skinned men.
A "proxy variable" in bias auditing refers to:
Correct. Proxy variables like ZIP code or name appear neutral but can encode race, national origin, or other protected attributes, effectively reintroducing discrimination through the back door.
A proxy variable is a seemingly neutral feature — like ZIP code — that correlates strongly with a protected attribute, allowing a model to discriminate without explicitly using that attribute.
Amazon's internal résumé-screening tool was shut down primarily because it:
Correct. The tool was trained on a decade of Amazon's own hiring data — data that reflected a historically male-dominated tech workforce — and learned to downgrade résumés with female-coded language.
Amazon's tool learned from historical hiring data skewed toward male candidates and penalized résumés with the word "women's," demonstrating how training data encodes past discrimination.

Lab 1: Audit Framing Workshop

Practice scoping a bias audit — system selection, stakeholders, and audit type

Your Task

You are preparing a bias audit proposal. Work with the AI coach to define your audit scope: choose a real AI system (hiring, lending, healthcare, criminal justice, or content moderation), identify the affected groups, select the most appropriate audit type, and explain what evidence you would need.

Complete at least 3 exchanges to finish this lab.

Start by telling the coach which real-world AI system you want to audit and why you chose it. The coach will help you refine your framing.
Audit Framing Coach
Lab 1
Welcome to the Audit Framing Workshop. I'm here to help you scope a real bias audit proposal. Tell me: which AI system do you want to audit — something in hiring, lending, healthcare, criminal justice, or content moderation? And what initially drew your attention to it?
Module 6 · Lesson 2

Gathering Evidence & Choosing Metrics

What data to collect, which fairness metrics apply, and why every metric involves a trade-off
If there are dozens of fairness definitions and they mathematically conflict — which one do you use, and how do you justify that choice?

When ProPublica published its COMPAS analysis in May 2016, Northpointe immediately responded that ProPublica had used the wrong fairness metric. ProPublica showed that Black defendants had a higher false positive rate — they were labeled high-risk but didn't reoffend. Northpointe countered that among defendants who were labeled high-risk, the proportion who did reoffend was equal across races — a property called calibration. Both facts were mathematically true. And in 2016, researchers Chouldechova and Kleinberg et al. proved formally: if base rates differ across groups, you cannot simultaneously achieve equal false positive rates, equal false negative rates, and calibration. The COMPAS dispute was not a matter of one side being wrong — it was a collision of incompatible mathematical definitions of fairness.

Evidence Sources for a Bias Audit

Before applying any metric, auditors must gather data. Evidence falls into three categories:

Outcome data: Records of actual decisions — loans approved/denied, candidates advanced/rejected, bail set/denied — broken down by demographic group. This is the primary material for disparate impact analysis. Auditors typically request at least 12 months of data to control for seasonal variation.

Input/feature data: The variables the model uses. Auditors check for proxy variables, assess whether protected attributes appear directly, and measure feature correlations. The CFPB requires lenders to retain HMDA (Home Mortgage Disclosure Act) data — demographic and decision data — making mortgage lending one of the most auditable domains.

Documentation: Model cards, datasheets for datasets, training logs, labeling instructions, and deployment policies. IBM's 2019 open-source FactSheets project and the Partnership on AI's work on dataset documentation provide templates. The absence of documentation is itself an audit finding.

The Core Fairness Metrics

Metric 01
Demographic Parity
Selection rate is equal across groups. Simple to compute; ignores whether the model is accurate for each group. The 80% rule (four-fifths) is a legal threshold version.
Metric 02
Equalized Odds
Both true positive rate and false positive rate are equal across groups. Requires ground-truth labels. Captures both benefit and harm dimensions of error.
Metric 03
Calibration
Predicted probability means the same thing across groups — a 70% risk score means 70% of those flagged actually reoffend, equally across races. The metric Northpointe used to defend COMPAS.
Metric 04
Individual Fairness
Similar individuals receive similar scores. Requires defining "similar" — a task that itself embeds value judgments. Used in credit scoring contexts.
The Impossibility Result

Chouldechova (2017) and Kleinberg et al. (2016) independently proved that when base rates differ between groups — when one group actually commits crimes, defaults on loans, or gets sick at different rates — you cannot simultaneously satisfy calibration, equal false positive rates, and equal false negative rates. Every audit must declare which metric it prioritizes and explain the ethical reasoning behind that choice.

The Four-Fifths Rule in Practice

The EEOC's 1978 Uniform Guidelines on Employee Selection Procedures established the four-fifths (80%) rule: if the selection rate for any group is less than 80% of the selection rate for the group with the highest rate, that is evidence of adverse impact requiring justification. Under NYC LL 144, auditors must compute this ratio for each race/ethnicity and sex category present in sufficient numbers.

Example: If 60% of white applicants pass a screening tool but only 40% of Black applicants do, the ratio is 40/60 = 0.67 — below 0.80 — flagging potential adverse impact. The employer must then demonstrate either that the ratio is explainable by legitimate job-related factors or that no less discriminatory alternative exists.

The four-fifths rule is a practical threshold, not a mathematical law. Small samples can produce false positives; very large samples can produce statistically significant but legally immaterial gaps. Good auditors report both the ratio and the statistical confidence around it.

Audit Principle

Never present a single fairness metric as "the" measure of bias. Report multiple metrics, explain their trade-offs, and be transparent about which stakeholder interests each metric prioritizes. Your audit's credibility depends on this transparency.

Lesson 2 Quiz

Evidence & Metrics — 4 questions
Northpointe defended COMPAS using which fairness metric?
Correct. Northpointe argued that among defendants assigned a given risk score, the reoffending rate was equal across races — the calibration property — even as ProPublica showed higher false positive rates for Black defendants.
Northpointe used calibration: among defendants assigned a given score, the actual reoffending rate was equal across races.
The four-fifths (80%) rule states that adverse impact is indicated when a group's selection rate falls below what fraction of the highest-rate group's selection rate?
Correct. The EEOC's 1978 Uniform Guidelines set 80% (four-fifths) as the threshold — if the lowest-group rate is less than 80% of the highest-group rate, adverse impact is indicated.
The four-fifths (80%) rule: if any group's selection rate is less than 80% of the highest-rate group's, adverse impact is flagged.
The mathematical impossibility result proved by Chouldechova (2017) shows that when base rates differ across groups, you cannot simultaneously achieve:
Correct. This is the core impossibility result: differing base rates make it mathematically impossible to simultaneously satisfy all three error-rate constraints, forcing auditors to choose which trade-off is ethically acceptable.
Chouldechova's impossibility result specifically concerns calibration, equal false positive rates, and equal false negative rates — all three cannot hold simultaneously when base rates differ.
HMDA (Home Mortgage Disclosure Act) data is particularly useful for bias audits because it:
Correct. HMDA mandates retention of applicant demographics and loan outcomes, giving auditors rare access to the joint distribution of inputs and decisions needed for disparate impact analysis.
HMDA requires lenders to collect and retain demographic data alongside loan decisions — this transparency makes mortgage lending one of the most auditable AI application domains.

Lab 2: Metric Selection Workshop

Practice choosing and justifying fairness metrics for your audit system

Your Task

You've scoped your audit. Now you need to choose which fairness metrics you'll apply and why — and acknowledge the trade-offs you're accepting. Work with the coach to select at least two metrics, explain what evidence you'd need to compute them, and articulate which stakeholder interests each metric prioritizes.

Complete at least 3 exchanges to finish this lab.

Tell the coach which AI system you're auditing (from Lab 1 or a new choice), then propose which fairness metric you'd start with and why. The coach will challenge you to consider the trade-offs.
Metrics Selection Coach
Lab 2
Welcome to the Metric Selection Workshop. Choosing fairness metrics is never neutral — every metric favors some stakeholders over others. Tell me: which system are you auditing, and which fairness metric would you start with? I'll push you to think through what that choice actually means for the people affected.
Module 6 · Lesson 3

Structuring Your Audit Report

How professional auditors organize findings, rate severity, and write recommendations that actually get implemented
What separates a bias audit that changes organizational behavior from one that gets filed and forgotten?

In May 2023 the U.S. Equal Employment Opportunity Commission issued its "Promising Practices for Employers Using AI and Algorithmic Decision-Making Tools," stating explicitly that employers remain liable for Title VII violations even when the discriminatory effect comes from a vendor's algorithm. The EEOC guidance recommended that employers audit these tools and retain an independent third party to do so. What the guidance notably did not specify was the format of the audit report — leaving practitioners to develop de facto standards through practice.

The Five-Section Audit Report

Across the emerging audit ecosystem — including audits published under NYC LL 144, academic audits like Gender Shades, and nonprofit audits by the Algorithmic Justice League — a consensus structure has emerged. Your report should contain five sections:

Section 01
Executive Summary
One page. System name, audit type, key findings, severity ratings, and top 3 recommendations. Written for decision-makers who won't read further.
Section 02
System Description
What the system does, who built it, how it is deployed, which decisions it influences, and which populations it affects. Includes the "intended use" versus "actual use" gap if one exists.
Section 03
Methodology
Data sources, demographic categories examined, fairness metrics selected (with justification), statistical methods, and scope limitations. Must be replicable.
Section 04
Findings
Metric-by-metric results, disaggregated by group. Severity ratings (Critical / High / Medium / Low). Tables, visualizations, and confidence intervals where applicable.
Section 05
Recommendations
Specific, actionable, prioritized. Each recommendation tied to a finding, with a named responsible party, a proposed timeline, and a success metric.

Severity Ratings

Without severity ratings, all findings look equal — and organizations will address whichever is cheapest rather than most urgent. Adopt a consistent scale:

LevelDefinitionExample
CriticalActive legal liability; significant documented harm to a protected class; must be addressed immediatelySelection ratio for Black applicants is 0.62, below the four-fifths threshold, in a jurisdiction where LL 144 applies
HighSignificant disparity without current legal action; predictive of harm at scale; address within 90 daysFalse positive rate for women applicants is 1.4× that of men; no statistically significant ground-truth difference exists
MediumDisparity detectable but within legal thresholds; upstream risk factor; monitor quarterlyZIP code feature correlates 0.71 with race/ethnicity; no adverse impact threshold crossed yet
LowDocumentation gap or process concern; no measured disparity; address in next development cycleNo model card exists; training data source is undocumented

What Makes Recommendations Stick

Research on audit uptake (including a 2022 study by Metcalf, Moss, and boyd at Data & Society) found that bias audit recommendations were most likely to be implemented when they were: (1) specific and bounded — "retrain the model excluding ZIP code" rather than "reduce proxy variables"; (2) linked to a business risk the organization already recognizes — regulatory fine, reputational damage, or contract loss; and (3) assigned to a named individual with authority and accountability.

Vague recommendations like "consider fairness" produce no action. Recommendations with dollar estimates attached to non-compliance ("the EEOC fine schedule starts at $50,000 per violation") move budgets.

Audience Calibration

Write your executive summary for the CFO. Write your methodology for the ML engineer who will implement the fix. Write your findings for the legal counsel who needs to assess liability. The same audit serves three audiences, and each section should speak to one of them explicitly.

Published Audit Examples

The most instructive model is the HireVue algorithmic audit (2021, conducted by O'Neil Risk Consulting & Algorithmic Auditing). HireVue voluntarily discontinued its facial analysis feature before the audit, but the published report — available publicly — demonstrates the five-section structure, uses adverse impact ratios, and rates findings by severity. Its limitations section, which acknowledges that the audit could not test the model on real applicants, is a model of intellectual honesty that strengthens rather than undermines the report's credibility.

Lesson 3 Quiz

Structuring Your Audit Report — 4 questions
The EEOC's 2023 AI guidance clarified that employers are liable for Title VII violations caused by an algorithmic tool even when:
Correct. The EEOC guidance explicitly stated that outsourcing discrimination to a vendor's algorithm does not transfer liability — the employer remains responsible under Title VII.
The EEOC stated that employers remain liable even when the discriminatory effect originates in a vendor's algorithm they did not build or control.
A "Critical" severity finding in a bias audit report means:
Correct. Critical findings indicate active legal exposure or documented harm to a protected class — these require immediate remediation, not merely monitoring.
Critical severity means active legal liability or significant documented harm — immediate action required, not a monitoring recommendation.
According to Data & Society research on audit uptake, which type of recommendation is most likely to actually be implemented?
Correct. Specificity, business-risk linkage, and named accountability are the three factors Metcalf, Moss, and boyd identified as driving audit recommendation uptake.
Research found that specificity (not vagueness), connection to recognized business risk, and named individual accountability drive implementation.
Why did acknowledging limitations in the HireVue 2021 audit report strengthen rather than undermine its credibility?
Correct. Readers — especially technically sophisticated ones — trust audits more when auditors are candid about what they could not test. Claiming more than you measured destroys credibility; honest scope-setting builds it.
Transparent limitations signal rigor and honesty. Readers trust claims within scope more when auditors clearly demarcate what they could not test.

Lab 3: Report Drafting Workshop

Draft your executive summary and a severity-rated finding with a specific recommendation

Your Task

You've scoped your audit and chosen your metrics. Now draft the executive summary and one finding with a severity rating and a specific recommendation. The coach will give you feedback on clarity, specificity, audience calibration, and whether your recommendation would actually drive action.

Complete at least 3 exchanges to finish this lab.

Draft a 3–5 sentence executive summary for your bias audit. Include: the system name, the highest-severity finding, and your top recommendation. Then the coach will help you refine it.
Report Drafting Coach
Lab 3
Welcome to the Report Drafting Workshop. A strong executive summary can move an organization to act; a weak one gets filed. Start by drafting 3–5 sentences: name the system you audited, state your most significant finding with a severity label (Critical / High / Medium / Low), and give your top recommendation. I'll give you concrete feedback on how to make it sharper.
Module 6 · Lesson 4

Presenting to Stakeholders

Communicating bias findings to executives, engineers, regulators, and affected communities — and handling pushback
When the people who built the system are in the room, how do you present findings that challenge their work — and keep the conversation productive?

In December 2020, Google AI ethics researcher Timnit Gebru was dismissed — or resigned under pressure, depending on the account — after a dispute over a paper co-authored with Margaret Mitchell and others that critiqued large language models for encoding social bias and environmental costs. The paper had not yet been published; Google leadership objected to its conclusions and asked that Gebru withdraw it or remove Google affiliates' names. The incident, which became widely known as the "Stochastic Parrots" controversy, illustrated in vivid terms that presenting bias findings to organizations with financial interests in a contrary conclusion is not merely a communication challenge — it is a political and professional risk. Gebru went on to found the Distributed AI Research Institute (DAIR), an independent research organization explicitly not dependent on tech industry funding, to ensure that bias findings could be published without organizational gatekeeping.

Know Your Audience

Different stakeholders need different framings of the same findings. Engineers respond to technical specificity — "the model's false positive rate for group X is 1.8 standard deviations above the mean" is actionable to them. Executives respond to liability and reputational framing — "this exposure puts us outside LL 144 compliance and creates a $1,500/day fine risk." Legal counsel wants findings mapped to specific statutes. Regulators want methodology documented to their standards.

Affected community members — often excluded from audit presentations — need to understand findings in terms of the actual harms, not statistical abstractions. The Algorithmic Justice League's public communications deliberately avoid metric-only framings in favor of concrete narratives: not "false positive rate disparity of 0.21" but "Black defendants marked high-risk who were not rearrested — labeled dangerous, released later, or held longer."

Handling Pushback: Four Common Arguments

PushbackWhat It Usually MeansEffective Response
"Your metric choice is biased."The organization prefers a metric where they perform betterAgree that metric choice involves trade-offs; present multiple metrics; ask the organization to specify which metric they believe should govern the decision and why
"The sample is too small."May be legitimate or may be deflectionReport confidence intervals; if sample is genuinely small, flag it as a data collection recommendation; ask why demographic data wasn't retained
"The model is just reflecting real-world patterns."Conflating prediction with prescriptionAcknowledge base rate differences; explain that a system can be calibrated and still cause harm; distinguish between descriptive accuracy and normative acceptability
"We already knew about this."Either true (and they didn't fix it) or false (face-saving)If true: ask for the remediation timeline that was in place; if no timeline exists, the finding stands. If false: note that the audit has now documented it formally.

The Presentation Structure

  • Open with the business or regulatory context — why this audit was conducted and what standard it was measured against. Never open with methodology.
  • State the top finding in plain language in the first two minutes. Executives leave meetings early; front-load the critical information.
  • Present metrics with visualizations: disparity ratio bar charts, false positive rate comparisons, calibration curves. Tables bury findings; charts surface them.
  • For each finding, immediately follow with the corresponding recommendation. Never separate findings from their remedies.
  • Acknowledge limitations explicitly before the audience raises them. Anticipating objections demonstrates rigor.
  • Close with a proposed timeline and named owners for each recommendation. Leave the room with documented commitments, not open-ended discussions.
Independence Matters

The value of a bias audit is directly proportional to the auditor's independence from the audited organization. Timnit Gebru's founding of DAIR reflects a structural truth: audits conducted by internal teams, or by external teams financially dependent on continued contracts with the auditee, face structural pressures that compromise findings. When presenting your audit, be transparent about your relationship to the organization and any limitations that relationship creates.

Documentation After Presentation

A presentation without a follow-up written record is an audit finding that exists only in memory. After every stakeholder presentation, send a written summary of: decisions made, recommendations accepted or rejected (with stated reasons), owners assigned, and timelines committed. This creates an accountability trail and, if the organization later faces regulatory scrutiny, demonstrates either due diligence or deliberate non-remediation — a distinction that matters significantly in legal proceedings.

The FTC's 2022 enforcement action against the data broker Kochava referenced the company's internal documentation acknowledging privacy risks, then continuing operations unchanged. Documented awareness of a problem without remediation is often worse legally than undocumented ignorance. The same principle applies to bias audit follow-through.

Module Summary

You have now covered the complete bias audit workflow: defining scope and type (L1), gathering evidence and choosing metrics with justified trade-offs (L2), structuring a five-section report with severity ratings and actionable recommendations (L3), and presenting findings to diverse stakeholders while handling pushback and documenting follow-through (L4). Your module test will assess all four competencies.

Lesson 4 Quiz

Presenting to Stakeholders — 4 questions
Timnit Gebru founded the Distributed AI Research Institute (DAIR) primarily to:
Correct. DAIR's founding purpose was structural independence from industry funding, ensuring that research findings — particularly about bias and harms — could be published without organizational gatekeeping.
DAIR was founded specifically to ensure independence from industry funding, so that bias research findings could not be suppressed by organizations with financial interests in contrary conclusions.
When a stakeholder argues "the model is just reflecting real-world patterns," the correct response is to:
Correct. A model can accurately reflect historical patterns and still cause harm — calibration does not equal fairness. The audit's job is to distinguish what the model predicts from what the organization chooses to do with those predictions.
The correct response is to acknowledge that base rates may differ while distinguishing accurate prediction from normative acceptability — a system can be calibrated and still perpetuate harm.
Why should the top finding be stated in plain language within the first two minutes of a stakeholder presentation?
Correct. Front-loading the critical finding is a practical presentation strategy: if the executive leaves at minute five, they've heard the most important thing. Burying findings after 20 minutes of methodology guarantees key decision-makers miss them.
The practical reason is that executives leave early — front-loading the critical finding ensures the most important information reaches decision-makers regardless of how long they stay.
The FTC's 2022 enforcement action against data broker Kochava illustrates which principle relevant to bias audit follow-through?
Correct. When regulators can show that an organization knew about a problem — through its own internal documents — and continued operations without remediation, that documented awareness becomes evidence of willful non-compliance, which carries heavier penalties.
The Kochava case illustrated that documented awareness of harm without remediation is legally damaging — regulators treat known-and-ignored problems more harshly than unknown ones.

Lab 4: Stakeholder Presentation Rehearsal

Practice delivering your audit findings and handling real pushback

Your Task

You are presenting your bias audit findings to a skeptical executive audience. The coach will play the role of a senior stakeholder who pushes back on your findings using the four common arguments from Lesson 4. Practice responding clearly and professionally.

Complete at least 3 exchanges to finish this lab.

Open your presentation: state which system you audited, your top finding with severity rating, and your #1 recommendation. The coach (as skeptical executive) will respond with pushback for you to handle.
Stakeholder Presentation Coach
Lab 4
I'll be playing a skeptical senior executive during your audit presentation. Go ahead and open: tell me which system you audited, your most significant finding, and your top recommendation. I'll respond the way a real executive might — and we'll practice handling the pushback professionally.

Module 6 Test

Present Your Bias Audit — 15 questions · 80% to pass
1. NYC Local Law 144 applies to automated employment decision tools used in which jurisdiction?
Correct. LL 144 is a New York City local law, applying to employers and employment agencies using automated tools on candidates or employees in NYC.
LL 144 is a New York City local law — it applies to employers using automated hiring tools on candidates or employees within New York City.
2. A "process / documentation audit" primarily examines:
Correct. A process/documentation audit focuses on upstream decisions — where did the training data come from, how was it labeled, what policies govern deployment — rather than outcome statistics.
Process/documentation audits review training data provenance, labeling instructions, model cards, and deployment policies — upstream design choices that can introduce bias before any outcome data exists.
3. Amazon's résumé-screening tool penalized résumés containing the word "women's" because:
Correct. The tool learned from historical "successful" hire data that was male-dominated, effectively learning to replicate past discrimination as a predictive pattern.
The tool learned to replicate Amazon's historical hiring patterns from its training data — data that reflected a male-dominated tech workforce — without any deliberate engineering of that bias.
4. HMDA data is valuable for mortgage lending bias audits because it:
Correct. HMDA requires lenders to retain and report the joint distribution of applicant demographics and loan outcomes — the data structure auditors need for disparate impact analysis.
HMDA mandates retention of applicant demographics alongside loan decisions, giving auditors the joint data distribution needed for disparate impact analysis.
5. The "Gender Shades" study measured error rate disparities in which type of AI system?
Correct. Buolamwini and Gebru tested three commercial face recognition products (including IBM, Microsoft, and Face++) and found darker-skinned women had error rates up to 34.7 pp higher than lighter-skinned men.
Gender Shades tested commercial face recognition systems from multiple vendors, finding dramatic error rate disparities across gender and skin tone combinations.
6. Which fairness metric was at the center of the ProPublica vs. Northpointe COMPAS dispute?
Correct. ProPublica showed that Black defendants had higher false positive rates; Northpointe countered that the tool was equally calibrated across races. Both were mathematically correct — they were measuring different properties.
The dispute was precisely between ProPublica's false positive rate analysis and Northpointe's calibration defense — both mathematically valid but measuring different things.
7. The Chouldechova (2017) impossibility result proves that when group base rates differ, you cannot simultaneously achieve:
Correct. This is the core impossibility theorem: differing base rates force a three-way trade-off among calibration, equal false positive rates, and equal false negative rates.
The impossibility result concerns calibration, equal false positive rates, and equal false negative rates — all three cannot be simultaneously satisfied when base rates differ across groups.
8. Under the four-fifths (80%) rule, adverse impact is indicated when a group's selection rate is below what threshold relative to the highest-rate group?
Correct. The EEOC's 1978 Uniform Guidelines established 80% (four-fifths) as the threshold for flagging adverse impact in employee selection.
The four-fifths rule: adverse impact is flagged when a group's selection rate is below 80% of the highest-rate group's selection rate.
9. The five-section audit report structure includes (in order): Executive Summary, System Description, Methodology, Findings, and:
Correct. The fifth and final section is Recommendations — specific, actionable, prioritized, and tied to findings with named owners and timelines.
The five sections are: Executive Summary, System Description, Methodology, Findings, and Recommendations.
10. A "Critical" severity finding in a bias audit requires:
Correct. Critical findings indicate active legal exposure or documented harm — they cannot wait for a development cycle or quarterly review.
Critical severity means immediate action is required because there is active legal liability or significant documented harm to a protected class.
11. The EEOC's 2023 AI guidance held that employers are liable for Title VII violations caused by a vendor's algorithm because:
Correct. Title VII focuses on discriminatory outcomes affecting employees and applicants — the employer's decision to use a discriminatory tool makes the employer liable for the resulting harm.
The EEOC's position is that Title VII prohibits discriminatory outcomes — using a vendor's discriminatory tool produces those outcomes, and the employer is responsible for the tools they deploy.
12. According to Data & Society research, bias audit recommendations most likely to be implemented share which three characteristics?
Correct. Metcalf, Moss, and boyd found that specific, bounded recommendations linked to business risk and assigned to named individuals were far more likely to drive organizational action than vague or unassigned ones.
The three factors are: specific and bounded scope, connection to a business risk the organization already recognizes, and a named individual with authority and accountability.
13. Timnit Gebru founded DAIR (Distributed AI Research Institute) primarily to ensure:
Correct. DAIR's structural independence from industry funding is its core feature — it ensures that findings critical of powerful tech organizations can be published and disseminated.
DAIR was founded specifically so that bias and ethics research could be conducted and published free from the financial pressures and organizational gatekeeping that constrain industry-funded researchers.
14. When presenting a bias audit, findings should be separated from recommendations in the presentation structure because:
Correct — this was a trick question. Best practice is to pair each finding immediately with its recommendation. Separating them makes it easier for audiences to mentally dismiss findings before hearing their remedies.
This was a trick question. Best practice is to immediately follow each finding with its recommendation — separation allows audiences to disengage from findings before hearing what to do about them.
15. The FTC enforcement principle illustrated by the Kochava case most directly warns audit practitioners to:
Correct. Once an organization has a written record showing awareness of a bias problem, failure to remediate becomes willful non-compliance — significantly increasing regulatory and legal exposure.
The Kochava principle: documented awareness of harm without remediation is legally worse than ignorance. Audit practitioners must document follow-through commitments to create accountability trails.