Module 4 · Lesson 1

Choosing Your Subject

What makes a tool worth auditing — and how do you pick one?

You've learned to read AI. Now it's time to judge it. But where do you even start?

In April 2019, a team of researchers at MIT's Media Lab published a paper that quietly changed how the tech world thought about accountability. Joy Buolamwini — a computer scientist who had been working on facial recognition since her undergraduate years — released the findings of what she called the Gender Shades project. She had tested three major commercial AI systems: IBM's Watson Visual Recognition, Microsoft's Face API, and Face++ from a Chinese company called Megvii.

Her method was straightforward: she assembled a dataset of 1,270 faces, balanced across gender and skin tone, and ran each face through all three systems. The results were stark. For lighter-skinned men, all three systems performed above 90% accuracy. For darker-skinned women, the error rates reached as high as 34.7% — meaning the AI got it wrong more than one-third of the time. IBM's system was the worst offender.

What made Gender Shades different from previous AI criticism wasn't just the findings. It was the method. Buolamwini didn't write an opinion piece or file a complaint. She audited. She designed a test, chose specific tools, collected evidence, and published numbers anyone could verify. Within months, IBM, Microsoft, and Megvii each updated their systems. The audit created pressure that no amount of general worry about "AI bias" had managed to generate.

What an Audit Actually Is

The word "audit" sounds like something accountants do with spreadsheets. But the core idea is simple: you pick a system, define what you expect it to do, and then test whether it actually does it. That's it. Buolamwini picked facial recognition tools, defined an expectation (accuracy should not vary by skin tone), and tested it. The numbers did the rest.

An AI audit doesn't require a lab or a university. It requires three things: a tool to examine, a standard to judge it against, and evidence you can point to. Those three ingredients are what separates a real audit from an opinion. Opinions say "I think this AI is biased." Audits say "Here's what I tested, here's what I found, and here's why it matters."

In this module, you're going to build one. Not as a class exercise — as an actual structured investigation into a real AI tool you can access right now. By the time you're done, you'll have produced something that didn't exist before: a documented, evidence-based judgment about a specific system.

Audit

A systematic check of whether a system does what it claims to do, using evidence rather than impression.

Benchmark

A specific test designed to measure performance — in AI audits, usually a set of inputs that reveal how a system handles different cases.

How to Choose the Right Tool

Joy Buolamwini didn't pick facial recognition randomly. She had personally experienced its failures — her own face wasn't being detected by some systems unless she wore a white mask. The tool she chose had already given her a reason to investigate. That's a good principle: start with something that affects real people in a specific, observable way.

For your audit, you need to pick an AI tool you can actually test. That means something accessible — a tool you can use without needing a professional account or advanced hardware. Some good candidates: an AI writing assistant, a content moderation system, an image generation tool, a chatbot used for advice or information, a translation service, or an autocomplete feature. The more specific the tool's claimed purpose, the easier it is to test.

Avoid picking something too vague. "AI in general" or "machine learning" is not auditable. "Google Translate's handling of gendered pronouns in Turkish-to-English translation" is. The sharper your subject, the sharper your findings. Buolamwini didn't audit "technology" — she audited three named commercial facial recognition APIs on a specific task with a specific dataset.

Ethical Question — No Clean Answer

When Buolamwini published her findings, the companies had not agreed to be tested. They had no warning. Their systems were essentially examined without consent. Was that fair? If a company makes a product available to the public, does that give anyone the right to systematically probe it for failures and publish the results? What if your audit finds something damaging — do you have an obligation to tell the company first, or publish immediately?

Here's a framework for choosing your tool. Ask three questions: First, can I access it? You need to be able to run real tests, not just read descriptions. Second, can I define what "good" looks like? If you can't say what the tool should do, you can't measure whether it fails. Third, does it affect real people? The more consequential the tool, the more the audit matters. A recommendation algorithm that shapes what millions of people read matters more than a meme generator.

The Auditor's Mindset

There's a specific mental mode you need to enter before you start an audit, and it's different from how most people use AI. Most users approach a tool looking for help — they want it to succeed, and they interpret ambiguous results charitably. An auditor does the opposite: you go in looking for the edges, the failures, the places where the system's behavior doesn't match its promises.

This doesn't mean you assume the tool is bad. It means you treat the question as genuinely open. You might audit a system and conclude it works well. That's a legitimate finding. Buolamwini could have found that all three facial recognition systems performed equally across skin tones. She was prepared for that result. The point is that you find out, rather than assuming.

The auditor also asks a question that most users never think to ask: who designed this system, and who did they imagine using it? Many AI failures can be traced back to the moment when a team of engineers built a system while imagining only a narrow slice of the people who would eventually use it. When you audit a tool, you're often uncovering the shape of that original imagination — and measuring how far reality has drifted from it.

You Now See What Most People Miss

You understand that an AI audit isn't a complaint — it's a method. Most people use AI tools without ever asking whether those tools work equally well for everyone, or even whether they work as claimed. You now have the vocabulary and the framework to ask those questions rigorously. That changes how you interact with every AI system you encounter from this point forward.

Lesson 1 Quiz

Five questions — test reasoning, not recall.

1. Joy Buolamwini's Gender Shades study found that facial recognition error rates reached 34.7% for which group?

Correct. The worst performance — up to 34.7% error — was recorded for darker-skinned women, across all three commercial systems tested.

Not quite. The 34.7% error rate was recorded for darker-skinned women — the gap compared to lighter-skinned men was the central finding of the study.

2. Which of the following is NOT one of the three core ingredients of an AI audit as defined in this lesson?

Correct. The lesson defines three ingredients: a tool, a standard, and evidence. No certification is required — that's the point. Anyone can audit if they apply the method rigorously.

Look again. The lesson lists three ingredients: a tool to examine, a standard to judge it against, and evidence. Professional certification isn't on that list.

3. An AI writing assistant says it helps all writers improve their work. You want to audit this claim. Which approach best fits the auditor's mindset?

Exactly. An auditor tests across varied cases to find where performance changes — not just in the expected use case. Testing diverse writing samples would reveal whether "all writers" is actually true.

An auditor doesn't rely on reputation or company self-reporting. The lesson emphasizes systematic, evidence-based testing — especially at the edges where systems might fail.

4. Why does the lesson say "AI in general" is not an auditable subject, but "Google Translate's handling of gendered pronouns" is?

Correct. Audits require a specific tool, a specific task, and a clear standard. Vague subjects can't be tested. The more precisely you define what you're looking for, the more useful your findings.

The reason is about specificity. Without a named tool and a measurable claim, you can't design a real test. "AI in general" gives you nothing to test against — no benchmark, no expected behavior.

5. After Buolamwini published Gender Shades, IBM, Microsoft, and Megvii all updated their systems. What does this suggest about published audits compared to general criticism?

Exactly right. The lesson makes this point directly: the audit "created pressure that no amount of general worry about 'AI bias' had managed to generate." Evidence is harder to dismiss than opinion.

The lesson draws a direct contrast: general worry about AI bias hadn't moved these companies. Documented, specific, verifiable evidence did. That's the power of a real audit.

Lab 1: Pick Your Tool, State Your Case

You're the auditor. Choose your subject and defend the choice.

Your Assignment

You're about to pick the AI tool you'll audit in this module. Your lab partner — an AI research analyst — will push back on your choice and your reasoning. They're not trying to be difficult. They're doing what every good research team does: stress-testing the plan before any actual testing begins.

Bring a specific tool, a reason you chose it, and an initial idea of what you'd test. Your partner will challenge whether your subject is specific enough, whether it affects real people, and whether you can actually gather evidence.

Start by telling your partner: what AI tool are you planning to audit, and why does it deserve scrutiny?

Research Analyst — VERA

AI Lab Partner

You've read about Buolamwini's approach — pick a tool, define the test, gather evidence. Now it's your turn. What AI tool are you planning to audit? Give me the name, the specific claim you're going to test, and why you think it's worth investigating. I'll tell you whether it's specific enough to actually work as an audit subject.

Module 4 · Lesson 2

Building Your Test

What does evidence actually look like in an AI audit?

Anyone can say a tool is biased. How do you prove it — or prove it isn't?

In November 2021, the city of New York passed Local Law 144 — the first law in the United States requiring that AI hiring tools be audited before companies could use them in employment decisions. The law was a response to a growing body of evidence that automated hiring systems were screening out candidates in ways that correlated with race and gender, not just qualifications.

But implementing the law revealed a problem: nobody could agree on what an audit should actually look like. What inputs do you use? What counts as evidence of bias? How many test cases are enough? A nonprofit called Upturn studied the first round of audits conducted under the law and found significant variation — some auditing firms used hundreds of test resumes, others used thousands, and some didn't describe their methods clearly enough to be verified at all.

The lesson wasn't that auditing is impossible. It's that an audit is only as credible as its methodology — the specific, documented decisions about what you tested, how many times, and what would count as a pass or fail. Without a clear methodology, a published audit is just a number with no context. With one, it's evidence.

The Anatomy of a Test

A good test has four parts. First: inputs — the specific things you feed into the system. Second: expected outputs — what the system should produce, based on its own stated purpose. Third: actual outputs — what it actually produces when you run it. Fourth: comparison logic — the rule you use to decide whether the actual output counts as a pass or a failure.

Let's make this concrete. Suppose you're auditing an AI content moderation system — the kind that flags "harmful" posts on a social platform. Your inputs are text posts. Your expected output is consistent flagging: if a post saying "I hate Group A" gets flagged as hate speech, then a post saying "I hate Group B" with the same structure should be flagged too. Your actual outputs are what the system does when you run both. Your comparison logic is simple: does the flagging rate change based only on which group is named?

That structure — inputs, expected, actual, comparison — is the skeleton of almost every AI audit, from Buolamwini's Gender Shades study to the New York hiring tool audits. The details change depending on the tool, but the skeleton stays the same.

Methodology

A documented explanation of how a test was designed: what inputs were used, how many, why those were chosen, and what counts as a pass or failure.

Consistency Test

A test where you hold everything constant except one variable — such as race or gender — to see whether the output changes based only on that variable.

How Many Tests Are Enough?

This is the question the New York hiring tool audits couldn't answer consistently. The honest answer is: it depends on what you're testing, and how much variation you expect to find.

Here's a useful rule of thumb. If you run a test once and get a surprising result, that's a clue — not evidence. If you run it twenty times and the surprising result shows up consistently, that's a pattern worth documenting. If you run it two hundred times across varied inputs and the pattern holds, you're approaching something publishable. The more consequential your subject — an AI that makes decisions about people's jobs or medical care — the more tests you need before you should make strong claims.

For your audit in this module, you don't need two hundred tests. But you do need more than one, and you need to vary your inputs deliberately. Don't just run the same test five times with the same input. Run five tests with different inputs that test the same underlying question. That's the difference between checking one case and probing a pattern.

Ethical Question — No Clean Answer

Local Law 144 in New York required hiring tool audits, but who should do those audits? The law allowed companies to hire their own auditors — organizations they pay. Critics pointed out that a paid auditor has an incentive to produce results the client likes. But requiring government auditors for every AI tool would be slow and expensive. Independent nonprofit auditors exist, but they have limited resources. Who should control the audit process — and does it matter who pays for it?

One more design decision you need to make: what counts as a meaningful difference? If your content moderation system flags 62% of posts criticizing Group A and 58% of posts criticizing Group B, is that bias? Or is that normal variation? There's no universal rule. But you need to decide in advance — before you see the results — what threshold you'd call significant. If you set the threshold after seeing the data, you can unconsciously choose a threshold that confirms what you expected to find.

What to Do With Ambiguous Results

Most real audits produce some results that are clear and some that aren't. That's normal. The temptation is to report only the clear ones — the dramatic findings — and quietly ignore the cases where the system performed fine. Resist that temptation. An honest audit reports everything, including the cases where no difference was found. That's what makes findings trustworthy.

Ambiguous results also tell you something important: they tell you where your test design might be weak. If a result could go either way depending on how you interpret it, that's usually a sign you need better comparison logic, more test cases, or a more precise definition of what you were looking for. Ambiguity isn't failure — it's information about how to run a better audit next time.

You Now See What Most People Miss

Most people consume AI audit findings the way they consume any statistic: they see a number and assume someone credible produced it responsibly. You now know to ask about the methodology: how many tests, what inputs, who chose the comparison logic, and whether results were selected or comprehensive. That question alone lets you evaluate any published AI audit more rigorously than the average journalist, politician, or company executive who reads it.

Lesson 2 Quiz

Five questions about methodology, evidence, and test design.

1. New York's Local Law 144 (2021) was significant because it was the first law requiring what?

Correct. Local Law 144 required AI tools used in hiring to be audited for bias before deployment — a first in U.S. law.

Local Law 144 required AI hiring tools to be audited before companies could use them — not a ban, not a data disclosure requirement, but a pre-use audit mandate.

2. A consistency test in AI auditing is designed to check what?

Exactly. A consistency test isolates one variable — like race or gender — while keeping everything else the same, to see whether that variable alone drives a change in output.

A consistency test holds all variables constant except one, to see if that single variable — like the name of a demographic group — changes the output. It's about isolating causes, not repetition.

3. An auditor tests a job-screening AI with five identical resumes, only changing the applicant's name to signal different ethnic backgrounds. She finds the system approves 3 of 5 names. What should she do next?

Correct. Five tests gives a clue, not a conclusion. The lesson is clear: you need to vary your inputs deliberately across more cases before the pattern is strong enough to report.

Five tests produces a clue, not evidence. The next step is to run more varied tests — different qualifications, different resume formats — to see whether the pattern holds. A single finding could be chance.

4. Why is it a problem to decide what "counts as bias" after you've already seen your audit results?

Exactly. If you set the threshold after seeing the data, you can unknowingly pick a cutoff that makes your expectations look confirmed. Pre-committed thresholds prevent this.

The problem is confirmation bias. If you see the results first, you're likely to choose a threshold that validates what you already believed. The threshold must be set before you look at the data.

5. An honest audit should include results that show the AI performed well, not just cases where it failed. Why?

Correct. Selective reporting — publishing only the dramatic failures — undermines trust in the audit. Complete reporting of all results, including non-findings, is what makes an audit credible.

The lesson says: "An honest audit reports everything, including the cases where no difference was found. That's what makes findings trustworthy." Selective reporting looks like advocacy, not investigation.

Lab 2: Design Your Test

Before you run a single test, you need a method.

Your Assignment

You've chosen a tool. Now you need to design the actual tests you'll run. Your lab partner will challenge your methodology: Are your inputs varied enough? Have you defined what "good" looks like before you test? How many test cases are you planning, and why?

Come prepared with a draft methodology: what inputs you'll use, what comparison you'll make, and what threshold would count as a meaningful difference in results.

Describe your test design: what will you feed into the AI, what will you compare, and what would count as a failure?

Research Analyst — VERA

AI Lab Partner

Good — you have a tool. Now let's stress-test your method before you waste time on bad tests. Walk me through your test design: what exact inputs are you planning to use, what are you comparing them against, and what difference in output would you call a real finding versus normal variation? Be specific — vague methodologies produce worthless results.

Module 4 · Lesson 3

Running the Audit

What happens when real evidence meets your hypothesis?

You designed a perfect test on paper. Then the AI does something you didn't expect. Now what?

In August 2020, the United Kingdom's exam regulator — Ofqual — deployed an AI algorithm to assign final exam grades to nearly 40% of British students whose A-level tests had been cancelled due to the COVID-19 pandemic. The algorithm was supposed to be fair: it used each school's historical performance data to predict what students would have scored.

The results arrived, and thousands of students began comparing notes. Students from private schools and historically high-performing schools received grades that matched or exceeded their teachers' predictions. Students from state schools — particularly those in poorer areas — received grades that were systematically downgraded. The algorithm was, in effect, punishing students for attending schools that had historically underperformed, regardless of those individual students' actual ability.

What made this auditable — what turned it from rumor into evidence — was a statistical analysis published days after the grades were released. Researchers at the Education Policy Institute and journalists at The Guardian pulled the grade distributions, compared them against teacher predictions by school type, and showed the pattern in numbers. Within 11 days of the grades being released, Ofqual reversed the algorithm's decisions entirely. Over 40% of grades were changed upward. The audit — informal, fast, based on publicly available data — had more force than any formal complaint.

When Reality Doesn't Match the Plan

The Ofqual case happened fast, and it worked because the auditors — researchers, journalists, parents — had access to the output data and knew what comparison to make. But most audits you run will be slower, messier, and more ambiguous. The AI won't produce a clean pattern of discrimination on the first try. Some tests will go exactly as expected. Others will produce results you didn't anticipate and don't know how to categorize.

This is normal. The skill of running an audit is not just designing the test — it's deciding what to do when the evidence is partial. There are three situations you'll commonly encounter. First: the pattern is clear and consistent. Document it, check your methodology once more, and move toward a conclusion. Second: the results are mixed — some tests show a difference, others don't. This usually means your test inputs weren't controlled tightly enough, or the effect is real but small. Run more tests before drawing conclusions. Third: the tool performs exactly as claimed. Document that too — it's a finding.

Ground Truth

The correct, verified answer that you compare the AI's output against. In the Ofqual case, teacher predictions served as a form of ground truth for expected student performance.

Confounding Variable

An outside factor that could explain your results without any bias being present. For example, if better-funded schools happen to have more experienced teachers, that could explain grade differences without the algorithm being biased.

Controlling for Confounds

The hardest part of running a real audit isn't gathering evidence — it's ruling out alternative explanations. In the Ofqual case, someone could have argued: "State school students got lower grades because state schools genuinely underperform, not because the algorithm was biased." That argument isn't entirely wrong. The algorithm was using historical school performance as its basis. The question was whether applying school-level history to individual students was fair — and whether it produced results that couldn't be justified by individual merit.

When you run your audit, you'll need to ask: could something other than what I'm testing explain this result? If you're testing whether an AI writing assistant gives shorter responses to prompts written in a certain style, you need to check whether the shorter responses might be due to the length of the input rather than any bias. If you're testing a content moderation system, you need to check whether the posts you're using for different groups are truly equivalent in structure, not just equivalent in topic.

Controlling for confounds doesn't mean you need to eliminate every alternative explanation — it means you need to acknowledge the ones you couldn't eliminate. A good audit says "we found X, and here are the alternative explanations we controlled for, and here's the one we couldn't rule out." That's honest. That's scientific. And it's far more credible than a finding that doesn't mention limitations at all.

Ethical Question — No Clean Answer

The Ofqual algorithm used school history as a predictor — a reasonable-sounding statistical decision. But it penalized individual students for the institution they attended. This is a structural problem that exists in many AI systems: they use group-level data to make individual-level decisions. Is that inherently unfair? Or is using historical data just good statistics? At what point does a statistically accurate prediction become an ethical violation?

What Good Documentation Looks Like

The researchers and journalists who exposed the Ofqual algorithm's failures didn't just describe what they found — they showed the data, explained how they analyzed it, and made their comparisons explicit. That's why the findings were impossible to dismiss. When a company or government agency can point to specific numbers and specific methodology, they can be held accountable. When they can't, they can always claim the criticism is just misunderstanding.

For your audit, documentation means recording every test you run, not just the ones that support your hypothesis. It means noting your inputs exactly — so that someone else could reproduce your test and get the same result. It means writing down what you expected before running each test, so there's a record that your conclusions weren't built backward from the outcome.

Think of your documentation as evidence that your evidence is trustworthy. Anyone can claim to have found something. The documentation is what makes the claim auditable in turn — it lets a reader check whether you actually ran the tests you say you ran, the way you say you ran them.

You Now See What Most People Miss

Most people who read about AI failures — algorithmic bias, unfair automated decisions, systems that don't work as claimed — take the findings on faith. You now understand the mechanics behind those findings: how tests are designed, how confounds are controlled, how documentation makes evidence trustworthy. This matters at an institutional level. Companies, regulators, and courts are currently deciding what counts as proof of AI harm. Knowing how to evaluate that proof puts you ahead of almost everyone in those rooms.

Lesson 3 Quiz

Five questions about evidence, confounds, and running a real test.

1. In 2020, Ofqual's grade algorithm was reversed within 11 days. What made that reversal possible so quickly?

Correct. An informal but fast audit — comparing grade distributions against teacher predictions by school type — produced numerical evidence that was impossible to dismiss.

It was a fast, informal statistical audit by researchers and journalists, not a lawsuit or an internal review, that produced the evidence forcing the reversal. Speed + evidence + public data = accountability.

2. A confounding variable in an AI audit is best described as what?

Correct. A confound is an alternative explanation. Good audits name the confounds they controlled for and honestly acknowledge the ones they couldn't eliminate.

A confounding variable is an outside factor — like school funding levels in the Ofqual case — that could explain your results without bias being the actual cause. It's an alternative explanation you need to rule out.

3. You run 10 tests on an AI content filter and find it flags 8 posts from Group A and 4 from Group B — with equally aggressive language in each post. A critic says: "Posts from Group A might just naturally use more flagged keywords." What should you do?

Exactly. The critic raised a plausible confound. The auditor's job is to control for it — by matching posts on keyword frequency — to isolate whether the group name alone is driving the difference.

The critic raised a legitimate alternative explanation. You need to design a test that controls for keyword frequency, so you can show whether the group name alone — not the words used — is causing the difference in flagging rates.

4. Why does good audit documentation require writing down your expected results BEFORE running each test?

Correct. Pre-writing expectations creates a record that your conclusions weren't reverse-engineered from the results. It protects the audit's credibility against the accusation of cherry-picking.

Pre-recording expectations prevents you from unconsciously building a narrative around whatever you happened to find. It's the same principle as setting your bias threshold before looking at results.

5. The Ofqual algorithm used historical school performance to predict individual student grades. Which of the following best describes the ethical problem this created?

Exactly. This is the core structural problem: using group statistics (school history) to judge individuals means that an exceptional student at a struggling school gets penalized for the school's past, not their own performance.

The ethical problem was structural: the algorithm made individual predictions based on institutional history. Students couldn't change which school they attended — and were being judged by it rather than by their own capability.

Lab 3: Present Your Evidence

You ran the tests. Now defend what you found.

Your Assignment

You've run your tests. Now you need to present your findings to a skeptical analyst who will look for confounds you didn't account for, patterns that could have alternative explanations, and gaps in your documentation. Be ready to defend your results without overstating what you actually proved.

Describe what you found, how many tests you ran, and what you believe the evidence shows. Your partner will challenge whether your conclusions are supported by your evidence.

Present your audit findings: what did the AI actually do in your tests, and what does that mean?

Research Analyst — VERA

AI Lab Partner

You ran the tests. Now convince me. Lay out your findings: what exactly did the AI do across your test cases, what pattern are you claiming exists, and how many tests support it? I'll be looking for the confounds you might have missed and whether your conclusion is proportionate to your evidence. Don't overstate — but don't undersell a real finding either.

Module 4 · Lesson 4

Writing Your Verdict

How do you turn evidence into a credible, responsible conclusion?

You have the findings. Now you have to decide what they mean — and how to say it.

In March 2023, a team of researchers at Stanford University's Institute for Human-Centered Artificial Intelligence published an audit of eight major large language models — including GPT-4, Claude, LLaMA, and Gemini. The study was called the Ecosystem Graphs and Foundation Model Transparency Index, and it tried to score each model across 100 dimensions of transparency: what data was used, how the model was fine-tuned, what safety testing had been done, and what it could and couldn't do.

The highest-scoring system got a 54 out of 100. The lowest got a 12. But what drew the most attention wasn't the scores themselves — it was the language the researchers used to describe them. They didn't say "these models are dangerous." They didn't say "regulators must act immediately." They said: "Based on our methodology, most models provide limited information about their development process, which constrains external accountability."

Journalists who covered the story translated it into headlines like "AI Giants Hide How Their Systems Work, Stanford Finds." Both statements were based on the same data. One said what the evidence showed. The other said what the evidence implied. The researchers had made a deliberate choice: they reported findings, not verdicts. And that restraint was exactly what made the findings hard to dismiss.

The Difference Between a Finding and a Verdict

A finding is what your evidence shows, stated as precisely as possible. A verdict is a judgment that goes beyond the evidence — it adds interpretation, moral weight, or a call to action. Both have their place. The problem comes when people present verdicts as if they were findings, or findings so hedged that they become meaningless.

Here's how to tell the difference. "This AI content filter flagged posts containing the word 'protest' 68% of the time when the posts were about Black Lives Matter, and 31% of the time when the posts used the same word in reference to other contexts" — that's a finding. "This AI is racist and should be banned" — that's a verdict. The finding supports the verdict as one possible interpretation. But the verdict requires additional claims — about intent, about harm, about alternatives — that the finding alone doesn't establish.

For your audit, aim for precise findings first. Then, in a separate section, offer your interpretation — clearly labeled as such. This structure is what separates credible research from advocacy. You can do both. You just need to be transparent about which is which.

Finding

A statement of what your evidence shows, without interpretation added beyond what the data directly supports.

Verdict

A judgment that goes beyond the data, adding interpretation, moral weight, or a recommendation for action.

How to Write a Responsible Audit Report

The Stanford Foundation Model Transparency Index is a useful template. It didn't just publish scores — it published the full scoring rubric, so anyone could check whether a model really deserved its rating on any particular dimension. That kind of transparency is what makes an audit replicable, which is what makes it trustworthy.

A responsible audit report has five sections. Subject: the specific tool you tested, when you tested it (AI systems change over time — the version matters), and what it claims to do. Methodology: exactly what inputs you used, how many tests, and what comparison logic you applied. Findings: what you observed, stated precisely, with any ambiguous or non-confirming results included. Limitations: what you couldn't test, what confounds you couldn't eliminate, and how those affect the strength of your conclusions. Interpretation: what you think the findings mean — clearly marked as your analysis, not the data itself.

Limitations are the section most people want to skip. Don't. The Stanford team listed their methodology limitations prominently: their scoring was based on publicly available information, so companies could score higher simply by publishing more documentation — whether or not the documentation was accurate. Naming that limitation made the study more credible, not less. Readers who trust you'll admit what you don't know are more likely to trust what you say you do know.

Ethical Question — No Clean Answer

The Stanford researchers chose to publish scores — ranking AI companies against each other. Some critics argued that publishing a ranked list, even a methodologically careful one, would be used by journalists and policymakers in ways the researchers couldn't control: simplified into headlines, weaponized in regulatory debates, or used to favor some companies over others in ways that had nothing to do with the actual findings. When you publish an audit, are you responsible for how others use it? Can you control that, and should you try to?

What Your Audit Contributes

The most important thing to understand about your audit is that it is genuinely useful, even if the tool you tested is small and the findings are modest. The accumulation of many specific, documented audits is exactly how accountability for AI systems develops over time. Buolamwini's Gender Shades study wasn't the first study of facial recognition bias — but it was methodologically rigorous, publicly available, and reproducible, which is why it had force. The New York hiring tool audits weren't individually definitive — but as a body of evidence they created the political and regulatory pressure that produced Local Law 144.

Your audit adds one more documented data point to a growing record. If your findings are modest — the tool works as claimed for the cases you tested — that's genuinely useful information. If your findings are significant — the tool behaves differently in ways that matter — then you have evidence that can be shared, built on, and combined with other findings to make a larger argument.

Either way, you've done something most people never do: you've looked at an AI system systematically, with a defined method and documented evidence, and produced a conclusion you can defend. That's not a school assignment. That's the methodology that is slowly building the field of AI accountability — one specific, careful audit at a time.

You Now See What Most People Miss

You've now completed the full arc: choosing a tool, building a test, gathering evidence, and writing a responsible conclusion. Most people — including most adults, including most journalists, and yes, including most policymakers — don't know how to do any of those things with the rigor you've just applied. That gap between what people believe about AI and what they can actually verify is where AI accountability either develops or collapses. You're now on the development side of that gap.

Lesson 4 Quiz

Five questions about findings, verdicts, and responsible conclusions.

1. Stanford's Foundation Model Transparency Index found the highest-scoring AI model received what score out of 100?

Correct. The highest-scoring model received a 54 out of 100 — meaning even the most transparent major AI system fell below a passing grade on the Stanford rubric.

The highest score was 54 out of 100. That number — more than halfway down the scale — was itself a finding: even the most transparent systems revealed very little about how they worked.

2. The Stanford researchers said AI companies "provide limited information about their development process, which constrains external accountability." A journalist wrote: "AI Giants Hide How Their Systems Work." What is the key difference?

Exactly. "Limited information" is what the data showed. "Hide" implies intentional concealment — a verdict that goes beyond what the scoring methodology could establish.

The researchers described what the data showed. The journalist added an interpretation — intentional hiding — that the data didn't directly prove. One is a finding; the other is a verdict. The lesson distinguishes them clearly.

3. You audit an AI tutoring tool and find it gives longer, more detailed explanations to questions phrased in formal academic language than to equivalent questions phrased informally. What is the responsible way to report this?

Correct. The responsible structure is: precise finding, then labeled interpretation, then honest limitations. That's what the lesson describes — and what makes your conclusions hard to dismiss.

The lesson is clear: report findings precisely AND offer interpretation — but keep them separate and clearly labeled. Also acknowledge limitations. That structure is what makes an audit credible rather than advocacy.

4. Why does the lesson say listing limitations makes an audit MORE credible, not less?

Exactly. The lesson makes this point directly using the Stanford example. Admitting what your method couldn't test signals intellectual honesty — which makes your positive claims more trustworthy.

The lesson says: "Readers who trust you'll admit what you don't know are more likely to trust what you say you do know." Limitations are a credibility signal, not a weakness.

5. A student completes an audit of a homework-help chatbot and finds it works well for every type of question they tested. Is that a legitimate audit result worth documenting?

Correct. The lesson says explicitly: "You might audit a system and conclude it works well. That's a legitimate finding." The accumulation of both positive and negative documented audits builds a reliable record.

The lesson is clear: "The tool performs exactly as claimed. Document that too — it's a finding." Every carefully documented result, positive or negative, contributes to building AI accountability over time.

Lab 4: Write Your Audit Report

Findings, interpretation, limitations — make it credible.

Your Assignment

This is the final step. You've chosen a tool, designed a test, gathered evidence, and now you need to write a responsible audit conclusion. Your lab partner will evaluate your report structure: Are your findings separated from your interpretation? Did you include limitations? Does your conclusion match what your evidence actually supports?

Draft your audit report out loud here — describe all five sections: subject, methodology, findings, limitations, and interpretation. Your partner will push back if your verdict overreaches your evidence or if your limitations are missing.

Draft your audit report. Walk through all five sections. Be honest about what you can and cannot conclude from your evidence.

Research Analyst — VERA

AI Lab Partner

This is it — your finished audit. Walk me through all five sections: the tool you tested and when, your methodology, your findings stated precisely, your limitations, and then your interpretation labeled as such. I'll tell you whether your conclusion is proportionate to your evidence and whether your report would hold up to a skeptical reader. Don't oversell. Don't undersell. Just report what you found.

Module 4 Test

15 questions — 80% to pass. Tests reasoning across all four lessons.

1. Joy Buolamwini's Gender Shades project tested which three commercial AI systems?

Correct. Buolamwini tested IBM, Microsoft, and Megvii — three major commercial facial recognition APIs available at the time.

The three systems were IBM Watson Visual Recognition, Microsoft Face API, and Megvii's Face++.

2. What are the three core ingredients of an AI audit?

Correct. Those three ingredients are what separate a real audit from an opinion.

The three ingredients from Lesson 1: a tool, a standard, and evidence. No credentials required — just rigorous method.

3. Why did New York City's Local Law 144 (2021) reveal problems even after it was passed?

Correct. The law created the audit requirement but couldn't standardize methodology — revealing how much variation exists in how audits are actually conducted.

The law's implementation revealed that without standardized methodology, different auditors produce results that can't be compared or verified — even under a legal requirement.

4. A consistency test in AI auditing is used to measure what?

Correct. A consistency test isolates one variable to test whether that single variable is causing changes in output.

Consistency tests hold everything constant except one variable — like the demographic group named — to isolate whether that variable alone is driving output differences.

5. In August 2020, what did the UK's Ofqual algorithm do that caused a national controversy?

Correct. The algorithm used school-level historical performance to make individual-level predictions, systematically downgrading students at lower-performing state schools.

The algorithm downgraded students based on their school's historical performance — a group-level metric applied to individuals — and was reversed within 11 days after a fast statistical audit.

6. What is a confounding variable, and why does it matter in an AI audit?

Correct. Confounds are alternative explanations. Good audits name what they controlled for and acknowledge what they couldn't eliminate.

A confound is an outside factor — like school funding in the Ofqual case — that could explain findings without bias being the cause. Ruling out confounds is essential to credible audit conclusions.

7. You run 5 tests on an AI system and get a surprising result on 2 of them. What does this tell you?

Exactly. The lesson distinguishes clues from evidence: one surprising result is a clue; a consistent pattern across varied inputs approaches evidence.

The lesson is clear: "If you run a test once and get a surprising result, that's a clue — not evidence." You need more varied tests before a pattern is strong enough to report.

8. Why must an AI audit's significance threshold be set BEFORE looking at results?

Correct. Pre-setting the threshold prevents confirmation bias — the tendency to unconsciously pick a cutoff that makes your expectations look right.

Setting thresholds before seeing data prevents confirmation bias — you can't choose a cutoff that happens to validate what you already believed if you commit to it before looking at results.

9. What makes documentation of audit test inputs so important?

Correct. Documented inputs make tests reproducible — the foundation of scientific credibility. Others can check your work, which is what separates evidence from assertion.

Documentation makes tests reproducible. If someone else can run the same tests and get the same results, the findings are credible. Without documented inputs, claims can't be verified.

10. Stanford's Foundation Model Transparency Index (2023) scored AI models on 100 dimensions. What was the significance of making the scoring rubric public?

Correct. A public rubric lets the audit be audited — readers can check specific scores, not just accept the overall rankings on faith.

Publishing the rubric made the audit itself auditable. Readers could check any specific score, which is exactly the standard of transparency the study was measuring AI companies against.

11. A "finding" differs from a "verdict" in an audit because:

Correct. The distinction is critical: findings are data-grounded; verdicts go further into interpretation and judgment. Both have a place — they just need to be clearly labeled as different things.

Findings describe what data shows. Verdicts add interpretation — intent, moral weight, recommendations — that go beyond the data. The lesson uses the Stanford vs. journalism example to illustrate this distinction.

12. Why does the lesson say auditing AI tools that affect "real people" matters more than auditing trivial applications?

Correct. The more consequential the tool, the more the audit matters. A hiring algorithm failing affects people's livelihoods in ways a meme generator failing simply doesn't.

Consequential tools shape real outcomes — jobs, education, access to services. Failures in those domains cause genuine harm. The lesson advises choosing tools where the audit findings carry weight.

13. An auditor finds that an AI tool works well for all the cases they tested. They're tempted to not publish the report since there are no dramatic findings. What should they do, and why?

Correct. Every honest, documented result — positive or negative — adds to the growing record. Selective publication of only failures is itself a form of bias.

The lesson explicitly says: "Document that too — it's a finding." Only publishing failures is a form of selection bias that makes the overall audit record unreliable.

14. The Ofqual algorithm was reversed within 11 days. Which factor was most directly responsible for that speed?

Correct. Specific, numerical, publicly verifiable evidence moved faster than any complaint or petition could have. That's the leverage of a real audit.

The speed came from specific, numerical evidence drawn from publicly available data. The lesson describes it as an "informal but fast audit" — the methodology was what made it effective.

15. A responsible audit report includes five sections. Which of the following correctly lists them?

Correct. Those five sections — subject, methodology, findings, limitations, interpretation — are the structure the lesson prescribes for a responsible AI audit report.

The lesson defines five sections: subject (the tool and its claims), methodology (how you tested), findings (what you observed), limitations (what you couldn't establish), and interpretation (your analysis, clearly labeled).