L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Module 4 Β· Lesson 1

What Does It Mean to Audit an AI?

Before we can fix something, we have to know how to look at it clearly.
If a machine makes a decision that ruins someone's life β€” who checks its work?

In August 2020, about 280,000 students across England sat waiting for their A-level exam results β€” the grades that determined whether they'd get into university. Because COVID-19 had cancelled the actual exams, the UK government handed the job of assigning grades to an algorithm built by Ofqual, the national education regulator.

The algorithm looked at each student's school's historical performance and adjusted their predicted grades downward if their school had a history of overestimating results. On paper, that sounds reasonable. In practice, it meant that a student at a low-performing school in a low-income area could score brilliantly in mock exams and still receive a grade two levels below what their teacher expected β€” because the school's past record dragged the number down.

Students from private schools fared better, because those schools had smaller class sizes and more stable track records. The algorithm, which no student had ever seen or been told about, had baked in existing inequality and called it math. Within days, the UK government reversed the decision under public pressure. But nobody had audited the algorithm before it went live. Nobody had checked what it would actually do to real people.

So What Is an Audit?

An audit is a careful, systematic check of whether something is working the way it's supposed to β€” and whether it's causing harm it shouldn't cause. You've probably heard of financial audits, where accountants check whether a company's money records are accurate. An AI audit does something similar, but instead of checking numbers in a ledger, it checks the behavior of a model.

Specifically, an AI audit asks three questions: What did the model learn? Who does it hurt? And does it do what it claims to do? Those sound simple, but answering them is surprisingly hard β€” and the people who build AI systems are often the last ones to spot the problems.

That's why independent auditing matters. The Ofqual algorithm was built by people who thought they were being fair. They weren't trying to disadvantage students from poorer schools. But because nobody outside their team was systematically checking the outcomes on different groups, the bias went undetected until it was too late.

Audit A structured review of a system to check whether it works correctly and causes no unacceptable harm. In AI, this means testing what the model actually does β€” not just what it was designed to do.

The Gap Between Intention and Outcome

Here is something that took researchers years to fully accept: a model can be built with good intentions and still produce deeply unfair outcomes. The Ofqual algorithm wasn't designed to punish poor kids. Its designers genuinely believed they were correcting for grade inflation. But the data they used β€” historical school performance β€” carried decades of unequal funding, unequal resources, and unequal opportunity embedded inside it. The model learned from that data. Then it reproduced those inequalities, with mathematical confidence.

This is sometimes called automation bias β€” the tendency for people to trust a computer's output more than a human's judgment, even when the computer is wrong. Because the number came from an algorithm, it felt official, certain, objective. Teachers who knew their students were overruled by a score.

An auditor's job is specifically to resist automation bias. An auditor looks at the output and asks: Wait. Does this actually make sense? Who got hurt here, and why?

The Ethical Question

When an algorithm causes harm that its designers didn't intend, who is responsible? The engineers who built it? The government that deployed it? The regulators who approved it? Or is "no one" a real answer β€” and if so, what does that mean for the people who were hurt?

What Auditors Actually Look For

In professional AI audits β€” which now happen at companies, governments, and universities β€” auditors look for a specific set of warning signs. You can think of them as the five things every audit checks:

  • Accuracy: Does the model get the right answer often enough, across all the cases it will encounter β€” not just the easy ones?
  • Fairness: Does the model perform equally well for different groups of people β€” different races, genders, income levels, locations?
  • Transparency: Can anyone explain why the model made a specific decision? Or is it a black box?
  • Robustness: Does the model break down when it encounters unusual cases or people trying to game it?
  • Accountability: Is there a clear process for correcting errors when they're found?

The Ofqual algorithm failed on at least three of these. It was inaccurate for many individual students, unfair across economic groups, and there was no quick accountability mechanism β€” until public outrage forced one.

You now understand something most adults who hear news stories about AI still miss: there's a difference between a model that looks correct and a model that has been audited. Looking correct just means it produces confident-sounding numbers. Being audited means someone checked whether those numbers are actually right β€” and for whom.

Your New Lens

From now on, whenever you hear that "an algorithm decided" something β€” about college admissions, loan applications, parole, job hiring β€” you have a framework. You can ask: Was this audited? For what kind of fairness? By whom? Those questions put you ahead of most people in the room.

Quiz β€” Lesson 1

What Does It Mean to Audit an AI?
1. In 2020, the Ofqual algorithm assigned lower-than-expected grades to students mainly because it:
Correct. The algorithm used each school's past grade history as a factor β€” which meant students at historically lower-performing schools were dragged down, even if they individually performed well.
Not quite. The algorithm wasn't looking at individual behavior β€” it was using school-level historical data, which embedded existing inequalities into individual outcomes.
2. "Automation bias" in this context means:
Exactly right. Automation bias is about us β€” humans over-trusting algorithmic outputs because they feel official or objective.
Automation bias refers to how humans respond to algorithmic outputs, not how the algorithm itself behaves. Computers feel authoritative, so people often trust them more than they should.
3. A school district buys an AI system to predict which students might drop out. The vendor says accuracy is 85%. An auditor reviewing this tool should also ask:
Correct. Overall accuracy can hide the fact that a model works well for one group and terribly for another. An auditor asks: accurate for whom?
An auditor's priority is fairness across groups. Overall accuracy statistics can be misleading if the model is significantly more accurate for some students than others.
4. Which of the five audit dimensions did the Ofqual algorithm most clearly fail on regarding individual students whose teachers knew them well?
Right. The algorithm's predictions were factually wrong for thousands of individual students. Teacher assessments were overridden by a system that couldn't account for individual cases.
While the algorithm had multiple failures, the most immediate problem for individual students was accuracy β€” the grades were simply wrong, especially for those from lower-performing schools who had outperformed expectations.
5. Why might the engineers who built the Ofqual algorithm have failed to catch its fairness problems before it was deployed?
Exactly. Builders often test whether a system works "on average" β€” but bias hides in the gaps between groups. Independent audits exist precisely because insiders tend to miss what outsiders catch.
The engineers had good intentions but limited perspective. They likely tested whether the model worked on typical cases without systematically examining outcomes for students from different socioeconomic backgrounds.

Lab 1 β€” The Audit Briefing

You've been assigned to audit an AI system. Your first job: figure out what questions to ask.

Your Role: Junior AI Auditor

A city government is using an AI model to decide which neighborhoods get increased police patrols. The model was trained on five years of arrest data. Officials say it's 79% accurate. Your job is to audit it β€” and you need to decide what questions matter most before you even look at the data.

Your lab partner SABLE is a peer auditor, not a teacher. She'll push back on weak reasoning and ask you to justify your positions. Have at least 3 exchanges to complete this lab.

Start by telling SABLE: What's the first question you'd ask about this model, and why is that the most important thing to know?
SABLE β€” Peer Auditor Lab 1
Okay, I've read the briefing. A predictive policing model trained on arrest data, 79% accuracy claimed. I've done three of these audits now and I'll tell you β€” 79% accuracy is almost meaningless without more context. But I want to hear your instinct first. What's the first question you'd ask, and why that one before everything else?
Module 4 Β· Lesson 2

Reading the Evidence: Accuracy, Fairness, and What the Numbers Hide

Numbers can tell the truth and lie at the same time. Knowing the difference is the whole job.
If a model is right 90% of the time, does that mean it's fair?

In 2016, the investigative newsroom ProPublica published an analysis that changed how the world thought about AI and criminal justice. They examined a tool called COMPAS β€” Correctional Offender Management Profiling for Alternative Sanctions β€” which was being used by judges in Broward County, Florida to predict whether defendants would commit another crime before trial.

COMPAS gave each defendant a risk score from 1 to 10. Judges used those scores to help decide who got bail and who sat in jail. The tool's creator, a company called Northpointe, said the model was accurate β€” and by one measure, they were right. Overall, it predicted recidivism (re-offending) at a rate around 65–70% accuracy.

But ProPublica's analysts, led by Jeff Larson and Julia Angwin, broke the numbers down by race. And what they found was this: Black defendants who did not reoffend were nearly twice as likely to be falsely labeled high-risk as white defendants who did not reoffend. Meanwhile, white defendants who did reoffend were more likely to be labeled low-risk and released. The model was equally accurate overall. It was deeply unequal in how its errors fell.

Two Kinds of Accuracy

This is one of the most important ideas in all of AI ethics, and it comes down to something called error distribution. When a model makes mistakes β€” and every model makes mistakes β€” the question is: whose lives do those mistakes disrupt?

There are two types of errors that matter in a model like COMPAS. A false positive is when the model predicts something bad will happen but it doesn't β€” like labeling a defendant "high risk" when they actually wouldn't reoffend. A false negative is when the model predicts nothing bad will happen, but it does β€” like labeling someone "low risk" when they actually do reoffend.

In the COMPAS case, false positives fell disproportionately on Black defendants β€” people who were treated as dangerous when they weren't. False negatives fell disproportionately on white defendants β€” people the system assumed were safe when they weren't. The model's overall accuracy looked fine. But the errors were not distributed equally. And in a system where errors mean people sit in jail or walk free, that asymmetry is not a technical detail. It's a civil rights issue.

False Positive When a model incorrectly predicts that something will happen. Example: predicting a defendant will reoffend when they actually won't.
False Negative When a model incorrectly predicts that something will not happen. Example: predicting a defendant won't reoffend when they actually will.

The Fairness Impossibility Problem

Here's where it gets genuinely complicated β€” and this is the part that even professional statisticians sometimes argue about. After ProPublica's report, Northpointe responded with their own analysis. They said COMPAS was fair β€” because it produced equally accurate scores for Black and white defendants at each risk level. If it said someone was 7-out-of-10 risk, that score meant the same thing regardless of race.

Mathematically, both claims were correct. ProPublica's measure of fairness and Northpointe's measure of fairness are real, recognized definitions. And in 2016, three researchers β€” Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan β€” proved something that should have been front-page news everywhere: when the underlying rate of an outcome differs between groups, you mathematically cannot satisfy all definitions of fairness at the same time.

Because arrest rates differed between Black and white defendants β€” partly due to policing patterns, not just actual crime rates β€” any model trained on that data faced an impossible tradeoff. You could make the false positive rates equal, or you could make the accuracy-per-score-level equal. You could not do both. Someone was going to be treated worse no matter what.

This doesn't mean fairness is hopeless. It means choosing a fairness definition is itself a moral and political decision β€” not a technical one. And right now, in most places, that decision is being made quietly by engineers and companies, without public input.

The Ethical Question

If you genuinely cannot satisfy all definitions of fairness at once, who should decide which definition a model uses β€” especially when lives are at stake? Should it be the company that built it? The judge who uses it? A government body? The people most affected by it? There's no clean answer here.

What This Means When You Read a Headline

Almost every news story about AI accuracy reports a single number. "The model is 94% accurate." "The system detects fraud with 87% precision." These numbers are real β€” and almost always incomplete. A trained auditor reads them and immediately asks: Accurate for whom? What happens in the 6% or 13% of cases where it fails? Do those failures fall evenly, or do they pile up on the same group of people?

The COMPAS story is now taught in law schools, computer science programs, and statistics departments worldwide. But most of the judges who used COMPAS scores between 2010 and 2016 were never told to ask those questions. They trusted the number because it was scientific. That's automation bias at institutional scale.

Understanding what you now understand β€” that fairness is not a single number and that errors have victims β€” changes how you read every story about AI. You are no longer a passive reader of those stories. You're an analyst.

What You Can Now See

When someone says an AI is "accurate," you know to ask: accurate on what measure, for which group, and where do the errors land? Those three follow-up questions put you ahead of most journalists, most policymakers, and a significant number of engineers who built the models being discussed.

Quiz β€” Lesson 2

Accuracy, Fairness, and What the Numbers Hide
1. ProPublica's 2016 analysis of COMPAS found that the tool was unfair specifically because:
Correct. The key finding was that the model's errors β€” specifically false positives β€” were distributed unequally along racial lines.
The issue wasn't overall accuracy β€” it was that the errors weren't distributed equally. Black defendants who wouldn't have reoffended were far more likely to be wrongly labeled high-risk.
2. A hospital uses an AI model to decide which patients get follow-up care calls. The model is 88% accurate overall. To audit it fairly, you should also check:
Right. An 88% overall accuracy can hide systematic failures for specific patient groups. The auditor's job is to find where those failures cluster.
Overall accuracy statistics can be misleading when errors fall unevenly. The audit needs to break down performance by group to see if some patients are being systematically missed.
3. When researchers proved in 2016 that you can't simultaneously satisfy all fairness definitions when base rates differ between groups, the main implication was:
Exactly. The math forces a choice. And that choice β€” which groups bear more of the errors β€” is a values question, not an engineering question. It needs human judgment and accountability.
The impossibility result doesn't mean fairness is hopeless β€” it means you must consciously choose which type of fairness to prioritize. That's a decision with moral weight, not a technical setting.
4. A company says their hiring algorithm is fair because it gives equally accurate scores to male and female candidates at each score level. A critic says it's unfair because women receive false negatives (missed for positions they'd succeed in) at higher rates. Which statement is most accurate?
Correct. This is the real-world version of the COMPAS debate. Multiple fairness definitions can be simultaneously valid, which is why choosing between them is a policy decision, not just a math problem.
The COMPAS lesson applies here: both sides can be mathematically right because they're using different fairness definitions. The deeper question is which definition is most appropriate β€” and that requires human judgment about values.
5. What is a "false positive" in the context of a model predicting which students might fail their final exams?
Correct. A false positive is when the model's prediction triggers an alarm that shouldn't have been triggered. In this case: flagging a student as failing when they would have been fine.
A false positive means the model said something bad would happen β€” and it didn't. The student was labeled "at risk of failing" when they actually would have passed.

Lab 2 β€” The Fairness Tribunal

Two definitions of fairness. One model. You have to pick a side β€” and defend it.

Your Role: Fairness Analyst

A university is using an AI model to predict which applicants will succeed academically. The model shows equal score accuracy for applicants from different socioeconomic backgrounds β€” but students from low-income families receive false rejections (false negatives) at twice the rate of students from wealthier families.

Your peer analyst CASS has a strong opinion about which fairness metric matters. So do you. Have at least 3 exchanges to complete this lab.

Begin by stating your position: which type of fairness failure matters more here β€” the unequal false negative rate, or maintaining consistent accuracy per score level? Give one real-world reason for your choice.
CASS β€” Fairness Analyst Lab 2
I'll be upfront: I think score-level accuracy consistency is the right metric here. If the model says 80% confidence for everyone at that level, it's treating people equally by the numbers. But I want to hear you first β€” which failure do you think is more serious for this university case, and why?
Module 4 Β· Lesson 3

The Black Box Problem: When You Can't Explain Why

A model that gives you no reason for its decision isn't just inconvenient β€” it may be a fundamental threat to accountability.
If an AI denied you a loan and couldn't tell you why, would you have any way to appeal?

In November 2019, David Heinemeier Hansson β€” the programmer who created Ruby on Rails, a tool used by millions of developers worldwide β€” posted on Twitter about something that had just happened to his wife. She had applied for an Apple Card, a credit card run by Goldman Sachs using an algorithm, and received a credit limit 20 times lower than his. They filed their taxes jointly. They had the same assets. The algorithm had assigned him far more credit than her, with no explanation offered.

Hansson's tweet spread fast because many people recognized the pattern. Steve Wozniak, co-founder of Apple, joined in to say the same thing had happened to him and his wife. The New York Department of Financial Services launched a formal investigation. Goldman Sachs issued statements saying the algorithm didn't use gender as an input β€” and that may have been technically true. But the algorithm had been trained on historical credit data, which carried decades of gender-based lending discrimination. The output replicated those patterns perfectly, even without explicitly coding gender in.

Here is the detail that haunted the story: Goldman Sachs couldn't fully explain why the algorithm made the decisions it made. Not to customers. Not entirely even to regulators. The model was a trained neural network β€” a type of AI where the reasoning is distributed across millions of numerical connections, with no sentence you can point to and say "this is the rule it used."

What Makes a Model a "Black Box"

Some models make decisions through rules you can trace step by step. A simple credit scoring model might say: if income is below $30,000 and debt is above 40% of income, deny the application. You can check that rule, challenge it, and β€” if you're a regulator β€” declare it illegal. This is called an interpretable or transparent model.

A neural network β€” the kind of model that runs many of today's most powerful AI systems β€” works differently. It learns patterns by adjusting millions of numerical weights during training, and those weights don't translate to human-readable rules. The model ends up knowing something, but it can't say what it knows in words. This is the black box problem: the model produces outputs but cannot produce a human-understandable explanation of how it got there.

For lower-stakes applications, this is inconvenient. For decisions about someone's access to credit, housing, a job, or their freedom, it is a fundamental problem. You cannot appeal a decision you cannot understand. You cannot prove discrimination if you cannot read the discriminatory rule. And a regulator cannot enforce anti-discrimination law against a process that offers no reasoning.

Black Box A model whose internal decision-making cannot be directly read or explained in human terms β€” even by its creators. Neural networks are the most common example.
Interpretability The degree to which a model's decisions can be understood and explained. A decision tree is highly interpretable. A deep neural network is typically not.

LIME, SHAP, and Trying to Open the Box

Researchers haven't given up on interpretability. Since around 2016, a field called explainable AI (XAI) has grown rapidly, developing tools that try to estimate why a black box model made a particular decision β€” even when you can't read the model directly.

Two of the most widely used tools are LIME (Local Interpretable Model-agnostic Explanations, developed by Marco Tulio Ribeiro and colleagues in 2016) and SHAP (SHapley Additive exPlanations, from game theory, adapted for ML by Scott Lundberg in 2017). Both work roughly the same way: they probe the model with many slightly altered versions of a specific input and observe how the output changes. From those experiments, they infer which features mattered most for that one decision.

These tools are genuinely useful. They've helped researchers discover that a medical imaging AI was using the metal tag on X-ray films as a feature (because tags appeared more often in certain types of images), that a model predicting hospital readmission was heavily influenced by a patient's birth month in ways nobody could explain, and dozens of other strange hidden patterns. But they are estimates β€” not the actual reasoning. They tell you what probably mattered, not what definitely did. And for legal accountability, "probably" may not be enough.

The Ethical Question

In the United States and European Union, there are now legal debates about whether people have a "right to explanation" when an AI makes a decision about them. Should someone denied a loan, a job, or housing be entitled to a plain-language reason? If the model genuinely can't provide one, should that model be allowed to make those decisions at all?

The Accuracy-Interpretability Tradeoff

Here is a tension that sits at the center of AI deployment right now: in most domains, more complex models β€” neural networks with millions of parameters β€” are significantly more accurate than simpler, interpretable models like decision trees. A neural network diagnosing cancer from a scan might be 5% more accurate than any interpretable alternative. That 5% represents real lives saved.

So the question isn't "black box bad, transparent good." The question is how high the accuracy gain needs to be to justify losing the ability to explain individual decisions. In low-stakes situations β€” recommending a movie, filtering spam β€” the answer is easy: use whatever works best. In high-stakes situations β€” criminal sentencing, medical diagnosis, credit decisions, hiring β€” the calculus is much harder, and reasonable people disagree.

What you can now do that most people cannot: name the tradeoff explicitly. When you see a powerful AI deployed in a sensitive context, you can ask: how interpretable is this? What do we lose by not being able to explain its decisions? Who bears that cost? Those questions used to belong only to researchers and policymakers. They now belong to you.

Institutional Reality

In 2022, the EU's AI Act classified certain AI systems as "high-risk" and required them to meet transparency standards before deployment. In the US, the CFPB (Consumer Financial Protection Bureau) has issued guidance requiring lenders to provide specific, accurate reasons for credit denials β€” even when an algorithm was involved. These are real policy decisions being made right now, shaped exactly by the tension this lesson describes.

Quiz β€” Lesson 3

The Black Box Problem
1. The Apple Card controversy in 2019 was significant for AI auditing because:
Correct. The case illustrated how a model can reproduce discrimination through training data, without explicitly encoding a protected characteristic β€” and how black box models make this nearly impossible to investigate.
Goldman Sachs stated they didn't use gender as an input. The problem was that the model learned discriminatory patterns from historical data, which embedded gender bias indirectly β€” and the company couldn't fully explain the model's reasoning.
2. Why is a "black box" model a problem specifically for legal accountability?
Exactly. Legal accountability depends on being able to trace reasoning. If a model's reasoning is opaque, you cannot find the discriminatory rule to challenge it β€” even if discrimination is clearly happening in the outcomes.
The accountability problem is specific: you need to trace the reasoning to prove discrimination or challenge a decision. Without interpretability, that chain of evidence is broken.
3. LIME and SHAP tools work by:
Correct. Both are probing strategies β€” they treat the model as a black box and infer likely reasons from patterns in how outputs change when inputs change. They estimate, they don't reveal directly.
LIME and SHAP can't read model weights directly or replace the model. They work from the outside, testing variations of inputs and watching how outputs change β€” generating estimates of which features probably influenced a specific decision.
4. A medical AI that diagnoses tumors is 6% more accurate than any interpretable model alternative, but cannot explain its decisions. A hospital is deciding whether to deploy it. What is the most precise framing of the tradeoff they face?
Correct. The tradeoff isn't just "better vs. worse" β€” it's about what we give up when we choose the more powerful but opaque model. In medicine, the inability to explain a diagnosis has real consequences for patients, clinicians, and legal liability.
The accuracy-interpretability tradeoff is specifically about what is lost when we choose more complex models. In high-stakes medical contexts, losing the ability to explain a decision matters legally, ethically, and practically.
5. A researcher discovers that a hospital readmission prediction model is heavily influenced by patients' birth months β€” a factor that has no medical meaning. This kind of discovery is most likely made using:
Right. This is exactly the kind of surprising discovery that LIME, SHAP, and similar tools can surface. The model learned a spurious correlation from training data, and explainability tools revealed it by analyzing feature importance for specific predictions.
Explainability tools are specifically designed to surface which features influenced a prediction β€” even unexpected ones. Checking source code wouldn't reveal what the model learned; it would only show what was fed in as input.

Lab 3 β€” The Explanation Demand

Should a model be allowed to make decisions it can't explain? You're being asked to advise a regulator.

Your Role: Policy Advisor

A government agency is considering banning black-box AI models from making parole decisions β€” decisions about whether people stay in prison or go free. An AI company argues their neural network is 12% more accurate at predicting reoffending than any interpretable model, and that more accuracy means fewer crimes and fewer wrongful imprisonments. Civil liberties groups argue that no model should be used if it can't provide a plain-language reason for keeping someone imprisoned.

Your peer advisor REMY has read both sides and wants to think through this with you. Have at least 3 exchanges to complete this lab.

State your advisory position: should black-box models be allowed to inform parole decisions if they're significantly more accurate? What's the strongest argument against your own position?
REMY β€” Policy Peer Advisor Lab 3
This one keeps me up at night. Twelve percent more accuracy isn't a small number when we're talking about parole β€” that's real people either wrongly imprisoned or wrongly released. But the civil liberties argument is serious too: due process exists precisely so people can challenge the reasoning behind decisions that take their freedom. Where do you land, and can you name the strongest counterargument to your own position?
Module 4 Β· Lesson 4

Delivering Your Verdict: How Auditors Write Findings

An audit that produces no action is just paperwork. A verdict has to say something that people will actually do something about.
If you found serious problems with an AI system β€” what would you actually do with that information?

In 2020, the City of Amsterdam and the City of Rotterdam commissioned an independent audit of their risk scoring algorithms β€” systems used to predict which social welfare recipients might be committing fraud. The audit was conducted by a team that included Frederike Kaltheuner from Privacy International and researchers from the University of Amsterdam. It was one of the first government-commissioned AI audits in Europe to be published in full.

What they found was methodical and damning. The Amsterdam algorithm used features that correlated with being a recent immigrant or coming from certain ethnic backgrounds β€” without explicitly including those variables. The system was generating higher fraud-risk scores for people who were not committing fraud, at rates that differed by neighborhood in ways that tracked closely with ethnicity. And crucially, the city's own staff using the system often didn't know how the scores were generated. They trusted numbers they couldn't interpret.

The auditors didn't just publish a list of errors. They wrote a formal verdict β€” structured findings with specific evidence, named risks, and concrete recommendations. They said clearly: this system should not be used in its current form. Amsterdam suspended use of the algorithm. Rotterdam revised theirs. The audit had teeth because the findings were specific enough to act on.

The Structure of a Finding

This is the part of auditing that separates useful work from theoretical concern. Saying "this AI might be biased" is not a finding. A finding, in the audit sense, is a structured statement that contains four things:

  • The observation: What did you actually see in the data, the model's outputs, or the deployment? Be specific. Numbers, examples, groups affected.
  • The criteria: What standard is this being measured against? A law, a stated goal, a fairness definition, an internal policy? Without criteria, "harm" is just opinion.
  • The effect: What real-world impact does this have? Who is affected and how? Make the stakes concrete.
  • The recommendation: What should change, and who should do it? Vague recommendations produce vague responses. Good audit findings say specifically what needs to happen.

The Amsterdam audit worked because it followed this structure. It didn't just say the algorithm was unfair in the abstract β€” it named which features were problematic, cited the legal standards being violated under Dutch anti-discrimination law, identified which neighborhoods' residents were most affected, and recommended suspension pending redesign. Each of those four components was present.

Audit Finding A structured statement of a discovered problem: what was observed, what standard it violates, who is affected, and what should change. Not a complaint β€” a documented, actionable claim.

Verdicts Under Pressure

There's a reality about AI auditing that the textbooks don't emphasize enough: audit findings have opponents. The company or government that built the system being audited rarely welcomes a finding that says "this is broken." They have financial interests, reputational interests, and sometimes genuine belief that the auditor got it wrong. Understanding how to defend a finding is as important as writing it.

The most common challenges auditors face are: the data dispute ("your sample is too small to be statistically meaningful"), the definition dispute ("you're using the wrong definition of fairness"), the comparability challenge ("you're comparing it to the wrong baseline β€” humans do worse"), and the intent defense ("we didn't mean to cause harm, so it's not really discrimination").

None of these challenges are inherently dishonest β€” some are legitimate scientific disagreements. But auditors need to anticipate them. The Amsterdam team addressed the sample-size objection by documenting that they had access to the complete dataset. They addressed the intent defense by citing European law, which bases discrimination standards on effect, not intent. Defending findings is part of the job.

This is also why audits need to be conducted independently β€” by people who are not employed by the organization being audited and have no financial stake in the outcome. When an AI company audits its own model, the structure of incentives makes thoroughness genuinely difficult, regardless of individual good intentions.

The Ethical Question

Most AI audits are still voluntary β€” companies choose whether to submit to one. Should AI systems used in high-stakes decisions be required by law to undergo independent audits before deployment, the way drugs are required to pass clinical trials? If yes, who pays for the audits? If no, what incentive does any company have to find problems in its own products?

What You're Now Equipped to Do

You've covered the full arc of an AI audit. You can identify why auditing matters and what it checks. You can read accuracy claims critically and ask where the errors land. You understand the black box problem and why interpretability is a legal and ethical issue, not just a technical preference. And you know what a structured finding looks like and how to defend it.

That combination of knowledge is not common. Journalists covering AI often don't have all of it. Many policymakers don't. Engineers who build AI systems are frequently expert in building but less equipped to critique from outside the system. The ability to step back, define what "fair" and "accountable" mean in a specific context, gather evidence, and write a finding that names what should change β€” that's a skill that matters right now, in real institutions making real decisions.

The AI systems that shape lives β€” who gets a loan, who gets a job interview, whose neighborhood gets more policing, who gets bail β€” are being built and deployed faster than oversight is developing. The gap between the speed of deployment and the depth of scrutiny is where the most consequential decisions get made quietly. You now know how to look at those decisions. That knowledge is only useful if you use it.

Where You Stand Now

You can audit a model. Not in the sense of running the full statistical analysis β€” that takes specialized tools. But in the sense of knowing what questions to ask, what evidence to look for, what standards to apply, and what a defensible finding looks like. Most people who interact with AI systems every day β€” as users, as affected citizens, as future employees β€” don't have that framework. You do.

Quiz β€” Lesson 4

Delivering Your Verdict
1. The Amsterdam fraud-risk algorithm audit in 2020 was significant because:
Correct. The audit's significance was both its transparency (published in full) and its actionability β€” the findings were concrete enough to force real changes in how the city used the system.
The audit's importance was about rigor and transparency. Published findings, specific evidence, and clear recommendations made it possible for the city to act β€” and they did, suspending the algorithm.
2. Which of the following is a properly structured audit finding?
Correct. That option contains all four components: a specific observation with numbers, a named criterion (the stated policy), a concrete effect, and a specific recommendation with a timeline.
A properly structured finding includes all four elements: a specific observation (with data), the criterion being violated, the real-world effect, and a concrete recommendation. Vague concerns are not audit findings.
3. A company responds to an audit finding by saying: "We never intended to discriminate, so this isn't really a discrimination issue." Under European law (as applied in the Amsterdam case), why is this response likely insufficient?
Correct. Effect-based discrimination law means that if the outcome discriminates against a protected group, the company's intentions don't eliminate the legal problem. The harm exists regardless of whether it was meant.
The intent defense fails in effect-based legal frameworks. In European law, what matters for discrimination is what the system does to different groups β€” not what the engineers meant to do when they built it.
4. Why do audit findings need to be made by people independent of the organization being audited?
Correct. This isn't about distrust of individuals β€” it's about incentive structures. Even well-intentioned internal auditors face pressures that external auditors don't. Independence is structural protection against those pressures.
The independence requirement is about incentives, not technical ability. When your company's success depends on the audit finding no problems, finding problems becomes structurally harder β€” regardless of individual honesty or skill.
5. A local government is deploying an AI to allocate school resources. A parent advocacy group wants to audit it but the city says audits should only be done by the company that built the system. Based on what you've learned, what is the most compelling argument against the city's position?
Exactly. The argument isn't about the company's honesty β€” it's about incentive structures. For decisions that affect all children in a district, independent verification isn't optional; it's what accountability means.
The strongest argument against company-only auditing is structural: financial and reputational incentives make thorough self-criticism difficult. Public institutions making high-stakes decisions need independent accountability mechanisms.

Lab 4 β€” Write Your Verdict

You've completed the audit. Now you have to deliver a finding that will actually change something.

Your Role: Lead Auditor β€” Final Report

You've just finished auditing a hiring algorithm used by a large employer. Your data shows: overall accuracy is 81%, but for applicants over age 45, accuracy drops to 59% and false rejection rates are 3x higher than for younger applicants. The company's stated hiring policy is to evaluate candidates without regard to age. Age discrimination in hiring is illegal under the Age Discrimination in Employment Act of 1967 (US) and equivalent laws in most countries.

Your peer auditor VALE needs to review your draft finding before it goes out. Have at least 3 exchanges β€” present your finding and be ready to defend it against challenges. Have at least 3 exchanges to complete this lab.

Draft your audit finding using the four-part structure: observation, criterion, effect, recommendation. Then present it to VALE for critique.
VALE β€” Senior Auditor, Peer Review Lab 4
Okay, I've seen the data. I'll be honest β€” I think the finding is there, but I want to see how you structure it before I sign off. The company is going to push back hard on the sample size for the 45+ group, and they'll argue that age isn't a protected category in the same way race is under their internal policies. You need to be ready for both. Show me your draft finding β€” all four components β€” and I'll tell you where it's vulnerable.

Module 4 β€” Module Test

Audit a Real Model: Your Verdict Matters Β· 15 questions Β· Pass at 80%
1. The 2020 UK A-level grading algorithm assigned lower grades to individual students primarily by:
Correct. School-level history was used to modify individual predictions β€” embedding institutional disadvantage into personal outcomes.
The algorithm used school-level historical data to adjust individual student grades β€” meaning a strong student at a historically weak school was pulled down by the school's track record.
2. Which of these best describes "automation bias"?
Right. Automation bias describes a human tendency, not a model behavior β€” we trust algorithmic outputs more than we should, even when they're wrong.
Automation bias is about how humans respond to algorithmic outputs β€” over-trusting them because they feel scientific and official, even when they're incorrect.
3. In the COMPAS case, ProPublica found that the model produced unfair outcomes even though its overall accuracy was reasonable. The specific problem was:
Correct. The unequal distribution of false positives was the finding β€” the errors weren't random, they fell disproportionately on one racial group.
Race wasn't a direct input. The problem was in the error distribution: false positives β€” predicting high risk when someone wouldn't reoffend β€” fell on Black defendants at far higher rates.
4. The mathematical proof by Kleinberg, Mullainathan, and Raghavan (2016) showed that when base rates differ between groups, you:
Correct. The impossibility result means choosing a fairness definition is a forced moral choice, not a technical default β€” someone has to decide which type of unfairness is more acceptable.
The proof showed a genuine mathematical impossibility when base rates differ: you can't satisfy all fairness definitions at once. The choice between them is a values decision, not a technical one.
5. A content moderation AI is 91% accurate overall, but removes posts by users writing in African American Vernacular English at twice the rate of posts in standard American English with similar content. An auditor's most important next step is:
Right. The disparity is a finding that needs all four components: specific observation, criterion, effect, recommendation. Asking the engineers to explain is part of the investigation, not the finding itself.
Discovering a disparity is not the same as writing a finding. The auditor's job is to structure the finding formally β€” observation, criterion, effect, recommendation β€” so it can be acted on.
6. What makes a neural network a "black box" model?
Correct. The "black box" problem isn't about secrecy β€” it's about the fundamental structure of how the model works. The reasoning is distributed and numerical, not readable.
The black box problem is structural, not about legal secrecy. Neural networks learn through numerical weight adjustments across millions of connections β€” those weights don't translate into human-understandable rules.
7. LIME and SHAP tools help with model interpretability by:
Right. Both tools work by probing β€” not by reading the model directly. They produce estimates of feature importance, not definitive explanations.
LIME and SHAP are probing tools. They can't read model internals directly β€” they test variations of inputs and observe output changes to infer which features probably mattered most.
8. In the Apple Card case, Goldman Sachs claimed the algorithm didn't use gender as an input. Why was this response insufficient to address the bias concern?
Correct. Proxy discrimination is the mechanism: other features (like purchasing patterns or location) correlate with gender in the training data, so excluding gender explicitly doesn't eliminate gender-correlated outcomes.
This is the proxy problem: historical data contains correlations between gender and other features. A model trained on that data learns those correlations β€” producing gender-patterned outputs even without directly including gender.
9. A well-structured audit finding includes which four components?
Correct. Observation, criterion, effect, recommendation β€” those four make a finding actionable rather than just descriptive.
The four components of a proper audit finding are: the observation (specific data), the criterion (what standard is being violated), the effect (who is harmed and how), and the recommendation (what should change).
10. The Amsterdam fraud-risk algorithm audit resulted in suspension of the algorithm mainly because:
Correct. The audit worked because it was actionable β€” specific enough for decision-makers to evaluate and respond to with concrete policy changes.
The audit's power came from its structure: specific observations, evidence, and recommendations gave city officials something concrete to act on, rather than vague concerns they could dismiss.
11. Why is independent auditing particularly important for AI systems used in public decisions (credit, policing, school resources)?
Correct. This is fundamentally an incentive structure argument. Independence removes the conflict of interest that makes internal auditing unreliable in high-stakes contexts.
The argument for independence isn't about technical skill β€” it's about incentives. An auditor with no stake in the outcome faces none of the pressures that make finding problems uncomfortable or career-risky for insiders.
12. An employer's AI hiring system has 88% overall accuracy. For applicants over 50, accuracy is 61% and false rejection rates are 2.8x higher. The company responds: "We didn't program age discrimination." Based on what you've learned, how should an auditor respond?
Correct. Disparate impact doctrine in employment law focuses on outcomes β€” if the system produces discriminatory effects, intent is not a defense. The 2.8x disparity is the finding.
In employment discrimination law β€” as in the Amsterdam case β€” effects matter, not just intentions. A 2.8x disparity in false rejection rates for older applicants is a disparate impact finding regardless of what the engineers meant to build.
13. A journalist reports that a new medical AI has "95% accuracy." What is the single most important follow-up question an informed reader should ask?
Exactly. The COMPAS lesson applies everywhere: an overall accuracy number can hide serious disparities for specific groups. Always ask "accurate for whom."
Overall accuracy figures are incomplete without knowing how that accuracy breaks down across patient groups. A model that's 98% accurate for one demographic and 70% accurate for another looks fine in the headline.
14. Which of the five audit dimensions β€” accuracy, fairness, transparency, robustness, accountability β€” is most directly at stake in the black box problem?
Correct. The black box problem is fundamentally a transparency failure β€” and because you can't explain decisions you can't interpret, accountability follows close behind.
Black box models aren't necessarily less accurate or less fair β€” but they are definitionally less transparent. And when you can't explain a decision, you also can't properly hold anyone accountable for it.
15. A school district is using an AI to identify students who need additional academic support. An auditor finds that students from non-English-speaking households are flagged for support at 4x the rate of English-speaking households, controlling for actual academic performance. The most complete audit finding would:
Correct. Even if "more support" sounds positive, a 4x disparity not explained by actual performance is a fairness finding β€” it could represent misidentification, stigmatization, or proxy discrimination. The full four-part structure gives decision-makers what they need to act.
A disparity not explained by actual academic differences is a finding regardless of whether the intervention sounds positive. The four-part structure β€” observation with data, criterion, effect, recommendation β€” is what makes an audit actionable rather than just descriptive.