In August 2020, about 280,000 students across England sat waiting for their A-level exam results β the grades that determined whether they'd get into university. Because COVID-19 had cancelled the actual exams, the UK government handed the job of assigning grades to an algorithm built by Ofqual, the national education regulator.
The algorithm looked at each student's school's historical performance and adjusted their predicted grades downward if their school had a history of overestimating results. On paper, that sounds reasonable. In practice, it meant that a student at a low-performing school in a low-income area could score brilliantly in mock exams and still receive a grade two levels below what their teacher expected β because the school's past record dragged the number down.
Students from private schools fared better, because those schools had smaller class sizes and more stable track records. The algorithm, which no student had ever seen or been told about, had baked in existing inequality and called it math. Within days, the UK government reversed the decision under public pressure. But nobody had audited the algorithm before it went live. Nobody had checked what it would actually do to real people.
An audit is a careful, systematic check of whether something is working the way it's supposed to β and whether it's causing harm it shouldn't cause. You've probably heard of financial audits, where accountants check whether a company's money records are accurate. An AI audit does something similar, but instead of checking numbers in a ledger, it checks the behavior of a model.
Specifically, an AI audit asks three questions: What did the model learn? Who does it hurt? And does it do what it claims to do? Those sound simple, but answering them is surprisingly hard β and the people who build AI systems are often the last ones to spot the problems.
That's why independent auditing matters. The Ofqual algorithm was built by people who thought they were being fair. They weren't trying to disadvantage students from poorer schools. But because nobody outside their team was systematically checking the outcomes on different groups, the bias went undetected until it was too late.
Here is something that took researchers years to fully accept: a model can be built with good intentions and still produce deeply unfair outcomes. The Ofqual algorithm wasn't designed to punish poor kids. Its designers genuinely believed they were correcting for grade inflation. But the data they used β historical school performance β carried decades of unequal funding, unequal resources, and unequal opportunity embedded inside it. The model learned from that data. Then it reproduced those inequalities, with mathematical confidence.
This is sometimes called automation bias β the tendency for people to trust a computer's output more than a human's judgment, even when the computer is wrong. Because the number came from an algorithm, it felt official, certain, objective. Teachers who knew their students were overruled by a score.
An auditor's job is specifically to resist automation bias. An auditor looks at the output and asks: Wait. Does this actually make sense? Who got hurt here, and why?
When an algorithm causes harm that its designers didn't intend, who is responsible? The engineers who built it? The government that deployed it? The regulators who approved it? Or is "no one" a real answer β and if so, what does that mean for the people who were hurt?
In professional AI audits β which now happen at companies, governments, and universities β auditors look for a specific set of warning signs. You can think of them as the five things every audit checks:
The Ofqual algorithm failed on at least three of these. It was inaccurate for many individual students, unfair across economic groups, and there was no quick accountability mechanism β until public outrage forced one.
You now understand something most adults who hear news stories about AI still miss: there's a difference between a model that looks correct and a model that has been audited. Looking correct just means it produces confident-sounding numbers. Being audited means someone checked whether those numbers are actually right β and for whom.
From now on, whenever you hear that "an algorithm decided" something β about college admissions, loan applications, parole, job hiring β you have a framework. You can ask: Was this audited? For what kind of fairness? By whom? Those questions put you ahead of most people in the room.
A city government is using an AI model to decide which neighborhoods get increased police patrols. The model was trained on five years of arrest data. Officials say it's 79% accurate. Your job is to audit it β and you need to decide what questions matter most before you even look at the data.
Your lab partner SABLE is a peer auditor, not a teacher. She'll push back on weak reasoning and ask you to justify your positions. Have at least 3 exchanges to complete this lab.
In 2016, the investigative newsroom ProPublica published an analysis that changed how the world thought about AI and criminal justice. They examined a tool called COMPAS β Correctional Offender Management Profiling for Alternative Sanctions β which was being used by judges in Broward County, Florida to predict whether defendants would commit another crime before trial.
COMPAS gave each defendant a risk score from 1 to 10. Judges used those scores to help decide who got bail and who sat in jail. The tool's creator, a company called Northpointe, said the model was accurate β and by one measure, they were right. Overall, it predicted recidivism (re-offending) at a rate around 65β70% accuracy.
But ProPublica's analysts, led by Jeff Larson and Julia Angwin, broke the numbers down by race. And what they found was this: Black defendants who did not reoffend were nearly twice as likely to be falsely labeled high-risk as white defendants who did not reoffend. Meanwhile, white defendants who did reoffend were more likely to be labeled low-risk and released. The model was equally accurate overall. It was deeply unequal in how its errors fell.
This is one of the most important ideas in all of AI ethics, and it comes down to something called error distribution. When a model makes mistakes β and every model makes mistakes β the question is: whose lives do those mistakes disrupt?
There are two types of errors that matter in a model like COMPAS. A false positive is when the model predicts something bad will happen but it doesn't β like labeling a defendant "high risk" when they actually wouldn't reoffend. A false negative is when the model predicts nothing bad will happen, but it does β like labeling someone "low risk" when they actually do reoffend.
In the COMPAS case, false positives fell disproportionately on Black defendants β people who were treated as dangerous when they weren't. False negatives fell disproportionately on white defendants β people the system assumed were safe when they weren't. The model's overall accuracy looked fine. But the errors were not distributed equally. And in a system where errors mean people sit in jail or walk free, that asymmetry is not a technical detail. It's a civil rights issue.
Here's where it gets genuinely complicated β and this is the part that even professional statisticians sometimes argue about. After ProPublica's report, Northpointe responded with their own analysis. They said COMPAS was fair β because it produced equally accurate scores for Black and white defendants at each risk level. If it said someone was 7-out-of-10 risk, that score meant the same thing regardless of race.
Mathematically, both claims were correct. ProPublica's measure of fairness and Northpointe's measure of fairness are real, recognized definitions. And in 2016, three researchers β Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan β proved something that should have been front-page news everywhere: when the underlying rate of an outcome differs between groups, you mathematically cannot satisfy all definitions of fairness at the same time.
Because arrest rates differed between Black and white defendants β partly due to policing patterns, not just actual crime rates β any model trained on that data faced an impossible tradeoff. You could make the false positive rates equal, or you could make the accuracy-per-score-level equal. You could not do both. Someone was going to be treated worse no matter what.
This doesn't mean fairness is hopeless. It means choosing a fairness definition is itself a moral and political decision β not a technical one. And right now, in most places, that decision is being made quietly by engineers and companies, without public input.
If you genuinely cannot satisfy all definitions of fairness at once, who should decide which definition a model uses β especially when lives are at stake? Should it be the company that built it? The judge who uses it? A government body? The people most affected by it? There's no clean answer here.
Almost every news story about AI accuracy reports a single number. "The model is 94% accurate." "The system detects fraud with 87% precision." These numbers are real β and almost always incomplete. A trained auditor reads them and immediately asks: Accurate for whom? What happens in the 6% or 13% of cases where it fails? Do those failures fall evenly, or do they pile up on the same group of people?
The COMPAS story is now taught in law schools, computer science programs, and statistics departments worldwide. But most of the judges who used COMPAS scores between 2010 and 2016 were never told to ask those questions. They trusted the number because it was scientific. That's automation bias at institutional scale.
Understanding what you now understand β that fairness is not a single number and that errors have victims β changes how you read every story about AI. You are no longer a passive reader of those stories. You're an analyst.
When someone says an AI is "accurate," you know to ask: accurate on what measure, for which group, and where do the errors land? Those three follow-up questions put you ahead of most journalists, most policymakers, and a significant number of engineers who built the models being discussed.
A university is using an AI model to predict which applicants will succeed academically. The model shows equal score accuracy for applicants from different socioeconomic backgrounds β but students from low-income families receive false rejections (false negatives) at twice the rate of students from wealthier families.
Your peer analyst CASS has a strong opinion about which fairness metric matters. So do you. Have at least 3 exchanges to complete this lab.
In November 2019, David Heinemeier Hansson β the programmer who created Ruby on Rails, a tool used by millions of developers worldwide β posted on Twitter about something that had just happened to his wife. She had applied for an Apple Card, a credit card run by Goldman Sachs using an algorithm, and received a credit limit 20 times lower than his. They filed their taxes jointly. They had the same assets. The algorithm had assigned him far more credit than her, with no explanation offered.
Hansson's tweet spread fast because many people recognized the pattern. Steve Wozniak, co-founder of Apple, joined in to say the same thing had happened to him and his wife. The New York Department of Financial Services launched a formal investigation. Goldman Sachs issued statements saying the algorithm didn't use gender as an input β and that may have been technically true. But the algorithm had been trained on historical credit data, which carried decades of gender-based lending discrimination. The output replicated those patterns perfectly, even without explicitly coding gender in.
Here is the detail that haunted the story: Goldman Sachs couldn't fully explain why the algorithm made the decisions it made. Not to customers. Not entirely even to regulators. The model was a trained neural network β a type of AI where the reasoning is distributed across millions of numerical connections, with no sentence you can point to and say "this is the rule it used."
Some models make decisions through rules you can trace step by step. A simple credit scoring model might say: if income is below $30,000 and debt is above 40% of income, deny the application. You can check that rule, challenge it, and β if you're a regulator β declare it illegal. This is called an interpretable or transparent model.
A neural network β the kind of model that runs many of today's most powerful AI systems β works differently. It learns patterns by adjusting millions of numerical weights during training, and those weights don't translate to human-readable rules. The model ends up knowing something, but it can't say what it knows in words. This is the black box problem: the model produces outputs but cannot produce a human-understandable explanation of how it got there.
For lower-stakes applications, this is inconvenient. For decisions about someone's access to credit, housing, a job, or their freedom, it is a fundamental problem. You cannot appeal a decision you cannot understand. You cannot prove discrimination if you cannot read the discriminatory rule. And a regulator cannot enforce anti-discrimination law against a process that offers no reasoning.
Researchers haven't given up on interpretability. Since around 2016, a field called explainable AI (XAI) has grown rapidly, developing tools that try to estimate why a black box model made a particular decision β even when you can't read the model directly.
Two of the most widely used tools are LIME (Local Interpretable Model-agnostic Explanations, developed by Marco Tulio Ribeiro and colleagues in 2016) and SHAP (SHapley Additive exPlanations, from game theory, adapted for ML by Scott Lundberg in 2017). Both work roughly the same way: they probe the model with many slightly altered versions of a specific input and observe how the output changes. From those experiments, they infer which features mattered most for that one decision.
These tools are genuinely useful. They've helped researchers discover that a medical imaging AI was using the metal tag on X-ray films as a feature (because tags appeared more often in certain types of images), that a model predicting hospital readmission was heavily influenced by a patient's birth month in ways nobody could explain, and dozens of other strange hidden patterns. But they are estimates β not the actual reasoning. They tell you what probably mattered, not what definitely did. And for legal accountability, "probably" may not be enough.
In the United States and European Union, there are now legal debates about whether people have a "right to explanation" when an AI makes a decision about them. Should someone denied a loan, a job, or housing be entitled to a plain-language reason? If the model genuinely can't provide one, should that model be allowed to make those decisions at all?
Here is a tension that sits at the center of AI deployment right now: in most domains, more complex models β neural networks with millions of parameters β are significantly more accurate than simpler, interpretable models like decision trees. A neural network diagnosing cancer from a scan might be 5% more accurate than any interpretable alternative. That 5% represents real lives saved.
So the question isn't "black box bad, transparent good." The question is how high the accuracy gain needs to be to justify losing the ability to explain individual decisions. In low-stakes situations β recommending a movie, filtering spam β the answer is easy: use whatever works best. In high-stakes situations β criminal sentencing, medical diagnosis, credit decisions, hiring β the calculus is much harder, and reasonable people disagree.
What you can now do that most people cannot: name the tradeoff explicitly. When you see a powerful AI deployed in a sensitive context, you can ask: how interpretable is this? What do we lose by not being able to explain its decisions? Who bears that cost? Those questions used to belong only to researchers and policymakers. They now belong to you.
In 2022, the EU's AI Act classified certain AI systems as "high-risk" and required them to meet transparency standards before deployment. In the US, the CFPB (Consumer Financial Protection Bureau) has issued guidance requiring lenders to provide specific, accurate reasons for credit denials β even when an algorithm was involved. These are real policy decisions being made right now, shaped exactly by the tension this lesson describes.
A government agency is considering banning black-box AI models from making parole decisions β decisions about whether people stay in prison or go free. An AI company argues their neural network is 12% more accurate at predicting reoffending than any interpretable model, and that more accuracy means fewer crimes and fewer wrongful imprisonments. Civil liberties groups argue that no model should be used if it can't provide a plain-language reason for keeping someone imprisoned.
Your peer advisor REMY has read both sides and wants to think through this with you. Have at least 3 exchanges to complete this lab.
In 2020, the City of Amsterdam and the City of Rotterdam commissioned an independent audit of their risk scoring algorithms β systems used to predict which social welfare recipients might be committing fraud. The audit was conducted by a team that included Frederike Kaltheuner from Privacy International and researchers from the University of Amsterdam. It was one of the first government-commissioned AI audits in Europe to be published in full.
What they found was methodical and damning. The Amsterdam algorithm used features that correlated with being a recent immigrant or coming from certain ethnic backgrounds β without explicitly including those variables. The system was generating higher fraud-risk scores for people who were not committing fraud, at rates that differed by neighborhood in ways that tracked closely with ethnicity. And crucially, the city's own staff using the system often didn't know how the scores were generated. They trusted numbers they couldn't interpret.
The auditors didn't just publish a list of errors. They wrote a formal verdict β structured findings with specific evidence, named risks, and concrete recommendations. They said clearly: this system should not be used in its current form. Amsterdam suspended use of the algorithm. Rotterdam revised theirs. The audit had teeth because the findings were specific enough to act on.
This is the part of auditing that separates useful work from theoretical concern. Saying "this AI might be biased" is not a finding. A finding, in the audit sense, is a structured statement that contains four things:
The Amsterdam audit worked because it followed this structure. It didn't just say the algorithm was unfair in the abstract β it named which features were problematic, cited the legal standards being violated under Dutch anti-discrimination law, identified which neighborhoods' residents were most affected, and recommended suspension pending redesign. Each of those four components was present.
There's a reality about AI auditing that the textbooks don't emphasize enough: audit findings have opponents. The company or government that built the system being audited rarely welcomes a finding that says "this is broken." They have financial interests, reputational interests, and sometimes genuine belief that the auditor got it wrong. Understanding how to defend a finding is as important as writing it.
The most common challenges auditors face are: the data dispute ("your sample is too small to be statistically meaningful"), the definition dispute ("you're using the wrong definition of fairness"), the comparability challenge ("you're comparing it to the wrong baseline β humans do worse"), and the intent defense ("we didn't mean to cause harm, so it's not really discrimination").
None of these challenges are inherently dishonest β some are legitimate scientific disagreements. But auditors need to anticipate them. The Amsterdam team addressed the sample-size objection by documenting that they had access to the complete dataset. They addressed the intent defense by citing European law, which bases discrimination standards on effect, not intent. Defending findings is part of the job.
This is also why audits need to be conducted independently β by people who are not employed by the organization being audited and have no financial stake in the outcome. When an AI company audits its own model, the structure of incentives makes thoroughness genuinely difficult, regardless of individual good intentions.
Most AI audits are still voluntary β companies choose whether to submit to one. Should AI systems used in high-stakes decisions be required by law to undergo independent audits before deployment, the way drugs are required to pass clinical trials? If yes, who pays for the audits? If no, what incentive does any company have to find problems in its own products?
You've covered the full arc of an AI audit. You can identify why auditing matters and what it checks. You can read accuracy claims critically and ask where the errors land. You understand the black box problem and why interpretability is a legal and ethical issue, not just a technical preference. And you know what a structured finding looks like and how to defend it.
That combination of knowledge is not common. Journalists covering AI often don't have all of it. Many policymakers don't. Engineers who build AI systems are frequently expert in building but less equipped to critique from outside the system. The ability to step back, define what "fair" and "accountable" mean in a specific context, gather evidence, and write a finding that names what should change β that's a skill that matters right now, in real institutions making real decisions.
The AI systems that shape lives β who gets a loan, who gets a job interview, whose neighborhood gets more policing, who gets bail β are being built and deployed faster than oversight is developing. The gap between the speed of deployment and the depth of scrutiny is where the most consequential decisions get made quietly. You now know how to look at those decisions. That knowledge is only useful if you use it.
You can audit a model. Not in the sense of running the full statistical analysis β that takes specialized tools. But in the sense of knowing what questions to ask, what evidence to look for, what standards to apply, and what a defensible finding looks like. Most people who interact with AI systems every day β as users, as affected citizens, as future employees β don't have that framework. You do.
You've just finished auditing a hiring algorithm used by a large employer. Your data shows: overall accuracy is 81%, but for applicants over age 45, accuracy drops to 59% and false rejection rates are 3x higher than for younger applicants. The company's stated hiring policy is to evaluate candidates without regard to age. Age discrimination in hiring is illegal under the Age Discrimination in Employment Act of 1967 (US) and equivalent laws in most countries.
Your peer auditor VALE needs to review your draft finding before it goes out. Have at least 3 exchanges β present your finding and be ready to defend it against challenges. Have at least 3 exchanges to complete this lab.