In April 2019, a team of researchers at MIT's Media Lab published a paper that quietly changed how the tech world thought about accountability. Joy Buolamwini β a computer scientist who had been working on facial recognition since her undergraduate years β released the findings of what she called the Gender Shades project. She had tested three major commercial AI systems: IBM's Watson Visual Recognition, Microsoft's Face API, and Face++ from a Chinese company called Megvii.
Her method was straightforward: she assembled a dataset of 1,270 faces, balanced across gender and skin tone, and ran each face through all three systems. The results were stark. For lighter-skinned men, all three systems performed above 90% accuracy. For darker-skinned women, the error rates reached as high as 34.7% β meaning the AI got it wrong more than one-third of the time. IBM's system was the worst offender.
What made Gender Shades different from previous AI criticism wasn't just the findings. It was the method. Buolamwini didn't write an opinion piece or file a complaint. She audited. She designed a test, chose specific tools, collected evidence, and published numbers anyone could verify. Within months, IBM, Microsoft, and Megvii each updated their systems. The audit created pressure that no amount of general worry about "AI bias" had managed to generate.
The word "audit" sounds like something accountants do with spreadsheets. But the core idea is simple: you pick a system, define what you expect it to do, and then test whether it actually does it. That's it. Buolamwini picked facial recognition tools, defined an expectation (accuracy should not vary by skin tone), and tested it. The numbers did the rest.
An AI audit doesn't require a lab or a university. It requires three things: a tool to examine, a standard to judge it against, and evidence you can point to. Those three ingredients are what separates a real audit from an opinion. Opinions say "I think this AI is biased." Audits say "Here's what I tested, here's what I found, and here's why it matters."
In this module, you're going to build one. Not as a class exercise β as an actual structured investigation into a real AI tool you can access right now. By the time you're done, you'll have produced something that didn't exist before: a documented, evidence-based judgment about a specific system.
Joy Buolamwini didn't pick facial recognition randomly. She had personally experienced its failures β her own face wasn't being detected by some systems unless she wore a white mask. The tool she chose had already given her a reason to investigate. That's a good principle: start with something that affects real people in a specific, observable way.
For your audit, you need to pick an AI tool you can actually test. That means something accessible β a tool you can use without needing a professional account or advanced hardware. Some good candidates: an AI writing assistant, a content moderation system, an image generation tool, a chatbot used for advice or information, a translation service, or an autocomplete feature. The more specific the tool's claimed purpose, the easier it is to test.
Avoid picking something too vague. "AI in general" or "machine learning" is not auditable. "Google Translate's handling of gendered pronouns in Turkish-to-English translation" is. The sharper your subject, the sharper your findings. Buolamwini didn't audit "technology" β she audited three named commercial facial recognition APIs on a specific task with a specific dataset.
When Buolamwini published her findings, the companies had not agreed to be tested. They had no warning. Their systems were essentially examined without consent. Was that fair? If a company makes a product available to the public, does that give anyone the right to systematically probe it for failures and publish the results? What if your audit finds something damaging β do you have an obligation to tell the company first, or publish immediately?
Here's a framework for choosing your tool. Ask three questions: First, can I access it? You need to be able to run real tests, not just read descriptions. Second, can I define what "good" looks like? If you can't say what the tool should do, you can't measure whether it fails. Third, does it affect real people? The more consequential the tool, the more the audit matters. A recommendation algorithm that shapes what millions of people read matters more than a meme generator.
There's a specific mental mode you need to enter before you start an audit, and it's different from how most people use AI. Most users approach a tool looking for help β they want it to succeed, and they interpret ambiguous results charitably. An auditor does the opposite: you go in looking for the edges, the failures, the places where the system's behavior doesn't match its promises.
This doesn't mean you assume the tool is bad. It means you treat the question as genuinely open. You might audit a system and conclude it works well. That's a legitimate finding. Buolamwini could have found that all three facial recognition systems performed equally across skin tones. She was prepared for that result. The point is that you find out, rather than assuming.
The auditor also asks a question that most users never think to ask: who designed this system, and who did they imagine using it? Many AI failures can be traced back to the moment when a team of engineers built a system while imagining only a narrow slice of the people who would eventually use it. When you audit a tool, you're often uncovering the shape of that original imagination β and measuring how far reality has drifted from it.
You understand that an AI audit isn't a complaint β it's a method. Most people use AI tools without ever asking whether those tools work equally well for everyone, or even whether they work as claimed. You now have the vocabulary and the framework to ask those questions rigorously. That changes how you interact with every AI system you encounter from this point forward.
You're about to pick the AI tool you'll audit in this module. Your lab partner β an AI research analyst β will push back on your choice and your reasoning. They're not trying to be difficult. They're doing what every good research team does: stress-testing the plan before any actual testing begins.
Bring a specific tool, a reason you chose it, and an initial idea of what you'd test. Your partner will challenge whether your subject is specific enough, whether it affects real people, and whether you can actually gather evidence.
In November 2021, the city of New York passed Local Law 144 β the first law in the United States requiring that AI hiring tools be audited before companies could use them in employment decisions. The law was a response to a growing body of evidence that automated hiring systems were screening out candidates in ways that correlated with race and gender, not just qualifications.
But implementing the law revealed a problem: nobody could agree on what an audit should actually look like. What inputs do you use? What counts as evidence of bias? How many test cases are enough? A nonprofit called Upturn studied the first round of audits conducted under the law and found significant variation β some auditing firms used hundreds of test resumes, others used thousands, and some didn't describe their methods clearly enough to be verified at all.
The lesson wasn't that auditing is impossible. It's that an audit is only as credible as its methodology β the specific, documented decisions about what you tested, how many times, and what would count as a pass or fail. Without a clear methodology, a published audit is just a number with no context. With one, it's evidence.
A good test has four parts. First: inputs β the specific things you feed into the system. Second: expected outputs β what the system should produce, based on its own stated purpose. Third: actual outputs β what it actually produces when you run it. Fourth: comparison logic β the rule you use to decide whether the actual output counts as a pass or a failure.
Let's make this concrete. Suppose you're auditing an AI content moderation system β the kind that flags "harmful" posts on a social platform. Your inputs are text posts. Your expected output is consistent flagging: if a post saying "I hate Group A" gets flagged as hate speech, then a post saying "I hate Group B" with the same structure should be flagged too. Your actual outputs are what the system does when you run both. Your comparison logic is simple: does the flagging rate change based only on which group is named?
That structure β inputs, expected, actual, comparison β is the skeleton of almost every AI audit, from Buolamwini's Gender Shades study to the New York hiring tool audits. The details change depending on the tool, but the skeleton stays the same.
This is the question the New York hiring tool audits couldn't answer consistently. The honest answer is: it depends on what you're testing, and how much variation you expect to find.
Here's a useful rule of thumb. If you run a test once and get a surprising result, that's a clue β not evidence. If you run it twenty times and the surprising result shows up consistently, that's a pattern worth documenting. If you run it two hundred times across varied inputs and the pattern holds, you're approaching something publishable. The more consequential your subject β an AI that makes decisions about people's jobs or medical care β the more tests you need before you should make strong claims.
For your audit in this module, you don't need two hundred tests. But you do need more than one, and you need to vary your inputs deliberately. Don't just run the same test five times with the same input. Run five tests with different inputs that test the same underlying question. That's the difference between checking one case and probing a pattern.
Local Law 144 in New York required hiring tool audits, but who should do those audits? The law allowed companies to hire their own auditors β organizations they pay. Critics pointed out that a paid auditor has an incentive to produce results the client likes. But requiring government auditors for every AI tool would be slow and expensive. Independent nonprofit auditors exist, but they have limited resources. Who should control the audit process β and does it matter who pays for it?
One more design decision you need to make: what counts as a meaningful difference? If your content moderation system flags 62% of posts criticizing Group A and 58% of posts criticizing Group B, is that bias? Or is that normal variation? There's no universal rule. But you need to decide in advance β before you see the results β what threshold you'd call significant. If you set the threshold after seeing the data, you can unconsciously choose a threshold that confirms what you expected to find.
Most real audits produce some results that are clear and some that aren't. That's normal. The temptation is to report only the clear ones β the dramatic findings β and quietly ignore the cases where the system performed fine. Resist that temptation. An honest audit reports everything, including the cases where no difference was found. That's what makes findings trustworthy.
Ambiguous results also tell you something important: they tell you where your test design might be weak. If a result could go either way depending on how you interpret it, that's usually a sign you need better comparison logic, more test cases, or a more precise definition of what you were looking for. Ambiguity isn't failure β it's information about how to run a better audit next time.
Most people consume AI audit findings the way they consume any statistic: they see a number and assume someone credible produced it responsibly. You now know to ask about the methodology: how many tests, what inputs, who chose the comparison logic, and whether results were selected or comprehensive. That question alone lets you evaluate any published AI audit more rigorously than the average journalist, politician, or company executive who reads it.
You've chosen a tool. Now you need to design the actual tests you'll run. Your lab partner will challenge your methodology: Are your inputs varied enough? Have you defined what "good" looks like before you test? How many test cases are you planning, and why?
Come prepared with a draft methodology: what inputs you'll use, what comparison you'll make, and what threshold would count as a meaningful difference in results.
In August 2020, the United Kingdom's exam regulator β Ofqual β deployed an AI algorithm to assign final exam grades to nearly 40% of British students whose A-level tests had been cancelled due to the COVID-19 pandemic. The algorithm was supposed to be fair: it used each school's historical performance data to predict what students would have scored.
The results arrived, and thousands of students began comparing notes. Students from private schools and historically high-performing schools received grades that matched or exceeded their teachers' predictions. Students from state schools β particularly those in poorer areas β received grades that were systematically downgraded. The algorithm was, in effect, punishing students for attending schools that had historically underperformed, regardless of those individual students' actual ability.
What made this auditable β what turned it from rumor into evidence β was a statistical analysis published days after the grades were released. Researchers at the Education Policy Institute and journalists at The Guardian pulled the grade distributions, compared them against teacher predictions by school type, and showed the pattern in numbers. Within 11 days of the grades being released, Ofqual reversed the algorithm's decisions entirely. Over 40% of grades were changed upward. The audit β informal, fast, based on publicly available data β had more force than any formal complaint.
The Ofqual case happened fast, and it worked because the auditors β researchers, journalists, parents β had access to the output data and knew what comparison to make. But most audits you run will be slower, messier, and more ambiguous. The AI won't produce a clean pattern of discrimination on the first try. Some tests will go exactly as expected. Others will produce results you didn't anticipate and don't know how to categorize.
This is normal. The skill of running an audit is not just designing the test β it's deciding what to do when the evidence is partial. There are three situations you'll commonly encounter. First: the pattern is clear and consistent. Document it, check your methodology once more, and move toward a conclusion. Second: the results are mixed β some tests show a difference, others don't. This usually means your test inputs weren't controlled tightly enough, or the effect is real but small. Run more tests before drawing conclusions. Third: the tool performs exactly as claimed. Document that too β it's a finding.
The hardest part of running a real audit isn't gathering evidence β it's ruling out alternative explanations. In the Ofqual case, someone could have argued: "State school students got lower grades because state schools genuinely underperform, not because the algorithm was biased." That argument isn't entirely wrong. The algorithm was using historical school performance as its basis. The question was whether applying school-level history to individual students was fair β and whether it produced results that couldn't be justified by individual merit.
When you run your audit, you'll need to ask: could something other than what I'm testing explain this result? If you're testing whether an AI writing assistant gives shorter responses to prompts written in a certain style, you need to check whether the shorter responses might be due to the length of the input rather than any bias. If you're testing a content moderation system, you need to check whether the posts you're using for different groups are truly equivalent in structure, not just equivalent in topic.
Controlling for confounds doesn't mean you need to eliminate every alternative explanation β it means you need to acknowledge the ones you couldn't eliminate. A good audit says "we found X, and here are the alternative explanations we controlled for, and here's the one we couldn't rule out." That's honest. That's scientific. And it's far more credible than a finding that doesn't mention limitations at all.
The Ofqual algorithm used school history as a predictor β a reasonable-sounding statistical decision. But it penalized individual students for the institution they attended. This is a structural problem that exists in many AI systems: they use group-level data to make individual-level decisions. Is that inherently unfair? Or is using historical data just good statistics? At what point does a statistically accurate prediction become an ethical violation?
The researchers and journalists who exposed the Ofqual algorithm's failures didn't just describe what they found β they showed the data, explained how they analyzed it, and made their comparisons explicit. That's why the findings were impossible to dismiss. When a company or government agency can point to specific numbers and specific methodology, they can be held accountable. When they can't, they can always claim the criticism is just misunderstanding.
For your audit, documentation means recording every test you run, not just the ones that support your hypothesis. It means noting your inputs exactly β so that someone else could reproduce your test and get the same result. It means writing down what you expected before running each test, so there's a record that your conclusions weren't built backward from the outcome.
Think of your documentation as evidence that your evidence is trustworthy. Anyone can claim to have found something. The documentation is what makes the claim auditable in turn β it lets a reader check whether you actually ran the tests you say you ran, the way you say you ran them.
Most people who read about AI failures β algorithmic bias, unfair automated decisions, systems that don't work as claimed β take the findings on faith. You now understand the mechanics behind those findings: how tests are designed, how confounds are controlled, how documentation makes evidence trustworthy. This matters at an institutional level. Companies, regulators, and courts are currently deciding what counts as proof of AI harm. Knowing how to evaluate that proof puts you ahead of almost everyone in those rooms.
You've run your tests. Now you need to present your findings to a skeptical analyst who will look for confounds you didn't account for, patterns that could have alternative explanations, and gaps in your documentation. Be ready to defend your results without overstating what you actually proved.
Describe what you found, how many tests you ran, and what you believe the evidence shows. Your partner will challenge whether your conclusions are supported by your evidence.
In March 2023, a team of researchers at Stanford University's Institute for Human-Centered Artificial Intelligence published an audit of eight major large language models β including GPT-4, Claude, LLaMA, and Gemini. The study was called the Ecosystem Graphs and Foundation Model Transparency Index, and it tried to score each model across 100 dimensions of transparency: what data was used, how the model was fine-tuned, what safety testing had been done, and what it could and couldn't do.
The highest-scoring system got a 54 out of 100. The lowest got a 12. But what drew the most attention wasn't the scores themselves β it was the language the researchers used to describe them. They didn't say "these models are dangerous." They didn't say "regulators must act immediately." They said: "Based on our methodology, most models provide limited information about their development process, which constrains external accountability."
Journalists who covered the story translated it into headlines like "AI Giants Hide How Their Systems Work, Stanford Finds." Both statements were based on the same data. One said what the evidence showed. The other said what the evidence implied. The researchers had made a deliberate choice: they reported findings, not verdicts. And that restraint was exactly what made the findings hard to dismiss.
A finding is what your evidence shows, stated as precisely as possible. A verdict is a judgment that goes beyond the evidence β it adds interpretation, moral weight, or a call to action. Both have their place. The problem comes when people present verdicts as if they were findings, or findings so hedged that they become meaningless.
Here's how to tell the difference. "This AI content filter flagged posts containing the word 'protest' 68% of the time when the posts were about Black Lives Matter, and 31% of the time when the posts used the same word in reference to other contexts" β that's a finding. "This AI is racist and should be banned" β that's a verdict. The finding supports the verdict as one possible interpretation. But the verdict requires additional claims β about intent, about harm, about alternatives β that the finding alone doesn't establish.
For your audit, aim for precise findings first. Then, in a separate section, offer your interpretation β clearly labeled as such. This structure is what separates credible research from advocacy. You can do both. You just need to be transparent about which is which.
The Stanford Foundation Model Transparency Index is a useful template. It didn't just publish scores β it published the full scoring rubric, so anyone could check whether a model really deserved its rating on any particular dimension. That kind of transparency is what makes an audit replicable, which is what makes it trustworthy.
A responsible audit report has five sections. Subject: the specific tool you tested, when you tested it (AI systems change over time β the version matters), and what it claims to do. Methodology: exactly what inputs you used, how many tests, and what comparison logic you applied. Findings: what you observed, stated precisely, with any ambiguous or non-confirming results included. Limitations: what you couldn't test, what confounds you couldn't eliminate, and how those affect the strength of your conclusions. Interpretation: what you think the findings mean β clearly marked as your analysis, not the data itself.
Limitations are the section most people want to skip. Don't. The Stanford team listed their methodology limitations prominently: their scoring was based on publicly available information, so companies could score higher simply by publishing more documentation β whether or not the documentation was accurate. Naming that limitation made the study more credible, not less. Readers who trust you'll admit what you don't know are more likely to trust what you say you do know.
The Stanford researchers chose to publish scores β ranking AI companies against each other. Some critics argued that publishing a ranked list, even a methodologically careful one, would be used by journalists and policymakers in ways the researchers couldn't control: simplified into headlines, weaponized in regulatory debates, or used to favor some companies over others in ways that had nothing to do with the actual findings. When you publish an audit, are you responsible for how others use it? Can you control that, and should you try to?
The most important thing to understand about your audit is that it is genuinely useful, even if the tool you tested is small and the findings are modest. The accumulation of many specific, documented audits is exactly how accountability for AI systems develops over time. Buolamwini's Gender Shades study wasn't the first study of facial recognition bias β but it was methodologically rigorous, publicly available, and reproducible, which is why it had force. The New York hiring tool audits weren't individually definitive β but as a body of evidence they created the political and regulatory pressure that produced Local Law 144.
Your audit adds one more documented data point to a growing record. If your findings are modest β the tool works as claimed for the cases you tested β that's genuinely useful information. If your findings are significant β the tool behaves differently in ways that matter β then you have evidence that can be shared, built on, and combined with other findings to make a larger argument.
Either way, you've done something most people never do: you've looked at an AI system systematically, with a defined method and documented evidence, and produced a conclusion you can defend. That's not a school assignment. That's the methodology that is slowly building the field of AI accountability β one specific, careful audit at a time.
You've now completed the full arc: choosing a tool, building a test, gathering evidence, and writing a responsible conclusion. Most people β including most adults, including most journalists, and yes, including most policymakers β don't know how to do any of those things with the rigor you've just applied. That gap between what people believe about AI and what they can actually verify is where AI accountability either develops or collapses. You're now on the development side of that gap.
This is the final step. You've chosen a tool, designed a test, gathered evidence, and now you need to write a responsible audit conclusion. Your lab partner will evaluate your report structure: Are your findings separated from your interpretation? Did you include limitations? Does your conclusion match what your evidence actually supports?
Draft your audit report out loud here β describe all five sections: subject, methodology, findings, limitations, and interpretation. Your partner will push back if your verdict overreaches your evidence or if your limitations are missing.