Module 3 · Lesson 1

What Does "Good" Even Mean?

Before you can judge an AI tool, you need to know what you're judging it on — and that's harder than it sounds.

How did a single botched evaluation cost Amazon millions — and what does that tell us about building a rubric?

In 2014, Amazon's machine-learning team built a tool they were proud of. Its job: read thousands of software-engineering résumés and score each one from one to five stars — the same scale a recruiter would use. The team trained it on ten years of Amazon's own hiring data. They thought they were building an objective judge. They were wrong.

By 2015, internal testers noticed the system was quietly downgrading résumés that included the word "women's" — as in "women's chess club" or "women's coding bootcamp." By 2018, Reuters reported the full story: Amazon had scrapped the tool entirely. It had learned from historical data that Amazon, like most big tech companies, had hired more men than women in technical roles. So it treated "male-coded" language as a quality signal.

Here is the critical detail most people overlook: the tool was performing exactly as designed. Amazon asked it to predict who would get hired. It predicted correctly. The problem was that nobody had written down what "good hiring" actually meant before building the evaluation. Was the goal to reproduce past patterns? To find the highest-skill candidates? To build a more diverse team? Three different goals — three completely different rubrics — and Amazon had never chosen between them.

The Rubric Problem

A rubric is a written-down set of criteria — the things you're measuring and what counts as a good or bad score on each one. Teachers use them when grading essays. Judges use them in competitions. Scientists use them when they review each other's research papers.

When you're testing an AI tool, a rubric does the same job. It forces you to answer a question that feels obvious but usually isn't: what exactly are we trying to measure?

Amazon's team measured "likelihood to be hired." That's a real metric. But it wasn't the right one — because it captured something about the past, not something about actual skill or potential. A rubric without the right criteria is worse than no rubric at all, because it gives you false confidence.

Here's what a rubric actually needs before you test anything:

CriterionOne specific thing you are measuring. "Accuracy" is a criterion. "Speed" is a criterion. "Does not embarrass the company" is also a criterion — companies just rarely write that one down.

WeightHow much each criterion matters relative to the others. If accuracy is twice as important as speed, your rubric should say so — otherwise you might accidentally optimize for the wrong thing.

ThresholdThe minimum score that counts as "acceptable." Without a threshold, every score looks relative and you can always talk yourself into excusing a bad result.

Why Most People Skip This Step

In October 2022, a team of researchers from Stanford and Google published a paper called "Holistic Evaluation of Language Models" — known as HELM. Their central complaint was stark: almost every AI benchmark published before 2022 measured only one or two things (usually raw accuracy on a specific task) and ignored everything else. Speed, fairness, robustness, cost — all treated as someone else's problem.

The researchers found they couldn't even compare tools fairly because each one had been tested on different criteria. It was like comparing soccer players by having some of them play a real match and others do a skills drill, then announcing a winner.

Why does this happen? Because writing a rubric is slow and boring and building the actual tool feels exciting. Teams under deadline pressure skip the boring part. Then they test against whatever metric is easiest to calculate. Then they ship.

The result is that most AI tools in the world have never been formally evaluated against a rubric that covers all the things that actually matter to the people using them. You can now see what most people — including most professional engineers — skip entirely.

Ethical Question — No Clean Answer

Amazon's tool gave accurate predictions based on what it was trained to predict. If a company's historical hiring was biased, and an AI learns from that history, who is responsible for the outcome — the tool, the engineers who built it, the executives who approved it, or the system that created the biased history in the first place? There is no answer here that makes everyone comfortable. Sit with it.

The Five Dimensions That Actually Matter

Based on the HELM research and on real-world deployment failures like Amazon's, here are five dimensions that almost every serious AI evaluation should include. This module will build your rubric around these five — and by the end, you'll be able to apply them to any AI tool you encounter.

Dimension	What It Measures	Why It Can Be Tricky
Accuracy	Does it produce correct outputs on the task it claims to do?	Depends entirely on what "correct" means — which you have to define first.
Fairness	Does it perform equally well across different groups of people?	A tool can be accurate on average but terrible for specific groups.
Robustness	Does it stay reliable when inputs are slightly unusual or unexpected?	Most tools are tested on clean, typical data. Real life is messy.
Transparency	Can you understand why it gave a particular output?	High accuracy and low transparency often go together — the best models are the hardest to explain.
Fit for Purpose	Is this tool the right kind of AI for this specific job?	A hammer can be great at hammering and terrible at measuring. Same tool, wrong task.

In Lessons 2, 3, and 4 you'll stress-test each of these dimensions with real cases and build the actual scoring system. But right now, the important move is recognizing that all five dimensions must be present in any serious rubric. Leave one out and your evaluation has a blind spot — and blind spots are exactly where the expensive failures hide.

You Can Now See What Most People Miss

When a company announces their new AI is "95% accurate," you now know the right questions: Accurate on what task, measured how, across which groups, under what conditions? Those four follow-up questions — which almost no journalist and very few product managers ask — are the difference between a meaningful evaluation and a marketing slide.

Lesson 1 Quiz

What Does "Good" Even Mean? · 5 questions

1. Amazon's hiring tool was scrapped because it had learned to penalize certain résumés. What was the root cause of this problem?

Exactly. The tool did what it was asked to do — predict hiring outcomes. The mistake was in choosing that criterion without asking whether past hiring patterns were a fair standard. A different criterion would have produced a different tool.

The engineers didn't make a coding error — the tool performed as designed. The issue was that the design goal itself was flawed, because no one wrote a rubric that defined what "good hiring" actually meant.

2. What is a "threshold" in a testing rubric?

Right. Without a threshold, you end up comparing scores without knowing whether any of them are actually good enough. A threshold turns a relative score into a pass/fail judgment — which is what you need to make a real decision.

A threshold is the floor, not the ceiling — the minimum acceptable score. Without one, every result looks negotiable and it becomes too easy to rationalize a mediocre tool as "good enough."

3. A new AI writing assistant claims it is "97% accurate." You now know this claim is incomplete. Which follow-up question is MOST important to ask first?

Yes. "97% accurate" is meaningless without knowing what task is being measured, what "correct" means in that context, and whether the accuracy is consistent across different users and inputs. These are the rubric questions.

Training time and parameter count are interesting technical details, but they don't tell you whether the accuracy claim is meaningful. The rubric questions — what task, measured how, across which groups — are the ones that expose whether a claim holds up.

4. The Stanford/Google HELM paper (2022) found that most AI benchmarks before that year had a major flaw. What was it?

Correct. HELM's core finding was that most benchmarks were narrow — they told you how well a model did on a specific task but ignored everything else that matters in real deployment: fairness, transparency, cost, and robustness to unusual inputs.

While dataset contamination is a real problem, HELM's main complaint was about narrow criteria — most benchmarks only measured one or two things and ignored the full range of dimensions that matter in real-world use.

5. Which of the five rubric dimensions asks: "Does this tool perform equally well for everyone who uses it?"

Yes. Fairness specifically looks at whether performance varies across different groups — by gender, race, age, language, or other characteristics. A tool can be 90% accurate on average but 60% accurate for a specific group, and only a fairness check will catch that.

Fairness is the dimension that measures consistency across groups. Accuracy measures whether outputs are correct on average — but average accuracy can hide major disparities between different types of users.

Lab 1: Define Before You Measure

Role: Rubric Architect · You design the criteria before anyone runs a test.

Your Assignment

A school district in Austin, Texas is about to buy an AI tool that reads student essays and gives them a score from 1–10. Before they spend the money, you've been asked to design the rubric that will be used to evaluate whether the tool is actually good enough to use. Your lab partner below has opinions — and will push back on yours.

Start by telling your lab partner: what criteria should be on this rubric, and why? Then defend your choices when challenged.

Opening move: Name at least two criteria you think must be on the rubric for an AI essay-grader used with real students. Explain why each one matters — don't just list them.

Lab Partner — Rubric Architect Session AI

Alright, I've seen a lot of rubrics that sound rigorous but fall apart when you actually test something against them. You're designing criteria for an AI essay-grader that a school district will use on real students. Tell me: what goes on the rubric, and more importantly — why those things and not other things? I'll tell you where I think you're wrong.

Module 3 · Lesson 2

Testing for Accuracy — and Its Limits

Accuracy is the first thing everyone measures. It's also the first thing people misuse.

In 2020, a dermatology AI outperformed human doctors on a benchmark — then struggled in clinics. Why?

In January 2020, a team at the University of Modena published results showing their deep-learning model for detecting skin cancer achieved an accuracy of 91.3% on a standard benchmark dataset — compared to 77.4% for the average dermatologist. Headlines ran. The story was picked up by the BBC, Forbes, and dozens of science outlets. "AI Beats Doctors," the summaries said.

What the headlines didn't say: the benchmark dataset contained almost entirely high-quality, dermatologist-taken photographs under controlled lighting. When the same model was tested on images taken with ordinary smartphones — which is how most teledermatology actually works — performance dropped by roughly 20 percentage points. When tested on patients with darker skin tones, it dropped further still, because the training dataset was over 80% images of light-skinned patients.

The model was genuinely accurate — at the specific, narrow task it was tested on. But "accurate on a curated benchmark dataset" and "accurate in the real world" are two completely different claims. The rubric had measured accuracy. It just measured the wrong version of it.

What Accuracy Really Means in a Rubric

When you write "accuracy" on a rubric, you have to specify three things or the criterion is useless:

1. Accurate on what inputs? — The dermatology AI was accurate on professional photographs. Real clinics use smartphones. The inputs in the test had to match the inputs in real use.

2. Accurate by what measure? — Overall accuracy (what percentage of all predictions are correct) can be misleading. A test that is 95% "not cancer" cases would produce a model that achieves 95% accuracy by just saying "not cancer" every single time — and would miss every real cancer case. Researchers use additional measures: sensitivity (how often does it catch real positives?) and specificity (how often does it correctly clear negatives?).

3. Accurate for whom? — This is where accuracy and fairness overlap. A single overall accuracy number hides variation across subgroups. Your rubric should require accuracy to be reported separately for different groups when the tool will be used across a diverse population.

BenchmarkA standard test used to compare AI tools. Benchmarks are useful shortcuts — but they only measure what their designers chose to measure, which may not match your actual use case.

Distribution ShiftWhen the inputs during real use look different from the inputs used during testing. The dermatology AI experienced distribution shift: test images were professional, real-world images were from phones.

How to Write an Accuracy Criterion That Holds Up

Here's a weak accuracy criterion you'd find in most corporate evaluations:

Weak Criterion (Do Not Use)

"The tool must achieve at least 90% accuracy."

Here's a stronger version that actually holds up under scrutiny:

Strong Criterion

"The tool must achieve at least 85% sensitivity and 85% specificity on inputs representative of real deployment conditions — including the range of device types, lighting conditions, and demographic groups present in our actual user population. Results must be reported separately by subgroup."

Notice what changed: the threshold got slightly lower (85% instead of 90%), but the criterion is vastly more demanding — because it now specifies the right inputs, the right measures, and requires disaggregated (separated-out) results. A tool that fails any one of those sub-requirements fails the criterion entirely, no matter how good it looks in aggregate.

This is what a rubric does. It prevents a tool from hiding its weaknesses behind its strengths.

Ethical Question — No Clean Answer

The dermatology AI genuinely performed well on the data it was tested on. The researchers published an honest result. Journalists made it into a larger claim than the data supported. When an AI tool underperforms for certain groups in real use — who should be held responsible: the researchers who built it, the journalists who simplified it, the hospitals that deployed it, or the regulatory agencies that approved it? All of them had information and made choices. None of them alone caused the problem.

Accuracy in Your Rubric: The Template

When you add accuracy to your rubric, use this three-part structure:

Part	Question to Answer	Example Answer
Task Definition	What exactly is the tool being asked to do?	"Flag student essays likely to contain plagiarism."
Test Conditions	On what inputs? How similar to real use?	"On essays written by students aged 11–16, submitted via the school's actual portal."
Measure + Threshold	What counts as correct, and what's the minimum?	"At least 80% sensitivity (catching real plagiarism) and 90% specificity (clearing innocent work). Reported by grade level."

If you can fill in all three parts for accuracy, you have a criterion that means something. If any part is blank, the criterion is a decoration — it looks official but doesn't constrain anything.

Knowing how to spot the difference between a meaningful accuracy claim and a marketing number is a skill that most people working with AI tools — inside and outside of companies — do not have. You now do.

Lesson 2 Quiz

Testing for Accuracy — and Its Limits · 5 questions

1. The 2020 skin cancer AI achieved 91% accuracy on a benchmark but struggled in actual clinics. What specific problem caused this gap?

Correct. Distribution shift is exactly what happened — the inputs during testing looked very different from the inputs in real deployment. A rubric that specifies "accuracy on inputs representative of real use" would have caught this before deployment.

Distribution shift was the core problem. The test conditions were idealized (professional photos, mostly light-skinned patients) while real clinic conditions were messier and more diverse. The rubric failed to specify that test inputs must match real-world inputs.

2. A medical AI reports 98% overall accuracy at detecting a disease that affects only 2% of the population. What is the most serious problem with this accuracy claim?

Exactly right. When a condition is rare, overall accuracy is almost useless as a metric — because the easiest way to score well is to always predict the majority class. Sensitivity (catching real cases) and specificity (correctly clearing healthy patients) are what actually matter here.

The problem is more fundamental: a model that always says "no disease" would score 98% accuracy on this dataset — and catch zero actual cases. This is why overall accuracy is the wrong measure for rare-condition detection.

3. "Sensitivity" in a medical AI test means:

Right. Sensitivity measures the "catch rate" — out of all the real positive cases, how many did the model flag? High sensitivity means few real cases get missed. The related measure, specificity, measures how often the model correctly clears people who don't have the condition.

Sensitivity is the "catch rate" — how often the model correctly flags real cases. Specificity is the related measure for correctly clearing people without the condition. Both are needed when overall accuracy is misleading.

4. You're writing an accuracy criterion for an AI that will translate school announcements into Spanish for parents. Which version is stronger?

Yes. This criterion specifies: a real-world measure (understandability by native speakers), realistic test inputs (actual school announcements), and coverage of the actual user population (multiple dialects). The other options are either too vague, measuring the wrong thing, or comparing against a benchmark that may not match real use.

Option B is stronger because it specifies the measurement method (rated by native speakers), the test inputs (real announcements from the district), and the user population (multiple dialects). "95% accuracy" without those details is a number that means whatever you want it to mean.

5. An AI tool for identifying fake news articles scores 88% accuracy overall. A journalist reports this as "the AI is right 88% of the time." What's missing from this picture?

Exactly. An overall number hides the error pattern, the test conditions, and any variation across subgroups. Labeling real news as fake has different consequences than missing fake news — and a rubric needs to treat those as separate criteria, not average them together.

The accuracy number hides the type of errors (which direction are the 12% wrong?), the test conditions, and whether performance varies by topic or language. A rubric breaks accuracy into specific sub-criteria so these gaps can't hide behind a single number.

Lab 2: Stress-Testing an Accuracy Claim

Role: Evaluator · You receive a vendor's accuracy report and find the holes.

Your Assignment

A company has pitched their AI reading-level assessment tool to your school. Their sales sheet says: "93% accuracy — validated on 10,000 student texts." Your principal wants to know if this number means anything. You've been put in charge of figuring that out.

Your lab partner has seen this kind of claim dozens of times. Start by telling them what questions you'd ask the vendor to determine whether 93% actually means anything — and be specific about why each question matters.

Opening move: What are the two most important questions you'd ask the vendor about their 93% accuracy claim? Be specific about what answers would make you more or less confident in the tool.

Lab Partner — Accuracy Evaluator Session AI

93% accuracy on 10,000 student texts. I've seen that sentence in probably fifty vendor decks. Sometimes it means something. Often it doesn't. What do you want to know before you trust it?

Module 3 · Lesson 3

Fairness, Robustness, and Transparency

The three dimensions most evaluation rubrics skip — and the three that matter most when something goes wrong.

In 2019, a healthcare algorithm used on 200 million Americans was found to be systematically giving Black patients lower care scores. How do you test for something like that?

In October 2019, researchers at Berkeley published a paper in the journal Science documenting a healthcare management algorithm used by major US insurers — built by a company called Optum — that was being used to decide which patients needed additional care. The algorithm was deployed on an estimated 200 million people per year.

The researchers found that Black patients who were just as sick as white patients were systematically assigned lower risk scores — meaning they received less additional care. The gap was striking: at any given risk score, Black patients were on average significantly sicker than white patients given the same score.

The cause wasn't a coding error. The algorithm predicted future healthcare costs as a proxy for healthcare need. And because Black patients had historically spent less on healthcare — due to systemic economic barriers and unequal access — the algorithm learned that they "needed" less. The tool was technically accurate at predicting costs. It was profoundly unfair at predicting need.

Optum disputed aspects of the study but acknowledged the disparity and updated the algorithm. The damage, however, had already been done across years of deployment — and no rubric in the procurement process had included a fairness criterion that would have caught it.

Writing a Fairness Criterion

Fairness in a rubric is not about whether an AI was built with good intentions. It's about whether it produces consistent, equitable outcomes across the people who will use it. Here's how to write a fairness criterion that actually works:

Step 1 — Identify the relevant groups. Who will use this tool? Are there groups whose outcomes you need to check separately? For a healthcare algorithm: patients by race, income level, age, and chronic condition status. For a grading tool: students by first language, grade level, and disability status.

Step 2 — Define what "equal" means. Equal can mean different things. Equal accuracy rates across groups? Equal error rates? Equal outcomes? These are different — and sometimes in tension. You must choose.

Step 3 — Set a maximum allowed gap. If the tool performs 5 percentage points worse for one group than another, is that acceptable? 10 points? The rubric must specify a number. Without a number, every gap is negotiable.

Strong Fairness Criterion — Example

"The tool's error rate must not differ by more than 5 percentage points across demographic groups defined by race, gender, and first language, as measured on a test set that proportionally represents the actual user population."

Ethical Question — No Clean Answer

The Optum algorithm made a technically rational choice: predict costs, because costs are measurable. The problem is that cost and need are not the same — especially when access to care is unequal. If you're building a rubric for a healthcare AI, should you require equal accuracy, equal outcomes, or some other definition of fairness? And what happens when achieving one kind of fairness makes another kind harder? Researchers call this the "impossibility theorem" of fairness — you genuinely cannot satisfy all definitions simultaneously.

Robustness: Testing at the Edges

In 2021, a research team at MIT published results showing that self-driving car systems from multiple manufacturers performed significantly worse in rain and snow than in clear conditions — even though most of their training data came from sunny California weather. This is a robustness failure: the tool works under ideal conditions but degrades when conditions become slightly unusual.

Robustness testing means deliberately giving a tool inputs that are different from the "typical" case and measuring how gracefully (or badly) it handles them. For a writing-assistance AI, robustness tests might include:

Test Type	What You Do	What Failure Looks Like
Edge Case	Feed it an unusually short or long input	Crashes, refuses, or gives absurd output
Adversarial	Add deliberate misspellings or unusual formatting	Produces confident but wrong output
Dialect/Language Shift	Use inputs from a different dialect or register	Drops in quality specifically for those inputs
Out-of-Domain	Give it a task slightly outside its training focus	Doesn't recognize its own limits — continues anyway

A robustness criterion in your rubric might say: "The tool must maintain at least 80% of its peak accuracy performance when tested on inputs with intentional formatting errors, non-standard dialects, and lengths more than twice the average training example length." This is measurable, specific, and actually tests whether the tool works in real life, not just in a lab.

Transparency: Can You Understand Why?

In 2016, ProPublica published an investigation into a tool called COMPAS — used by courts in the US to predict whether a defendant was likely to commit another crime. Judges used its scores when deciding on bail and sentencing. Nobody could explain how it worked — not even the company that made it, which called the algorithm proprietary (meaning they owned it and wouldn't reveal it).

The ProPublica investigation found COMPAS was twice as likely to falsely flag Black defendants as high-risk compared to white defendants. Because the algorithm was a black box (no explanation available), there was no way for defendants to challenge the score in court. You can't argue against a number when no one can say where the number came from.

Transparency in a rubric doesn't always mean "fully open source." It means: can a decision made by this tool be explained to the person it affects? There are different levels:

Full TransparencyThe tool provides a human-readable explanation for every output. Example: "This essay received a 6/10 because it scored low on paragraph structure and cited sources incorrectly."

Audit TrailThe tool logs its inputs and outputs so that a decision can be reviewed after the fact. Doesn't explain itself, but creates accountability.

Black BoxThe tool gives an output with no explanation and no audit trail. Acceptable for low-stakes uses. Dangerous for decisions that affect people's lives, education, or freedom.

Your rubric should specify what transparency level is required for the stakes involved. A music recommendation AI can be a black box. A tool used in school discipline, criminal justice, healthcare, or hiring should not be. The rubric makes this explicit — so a vendor can't sell you a black box when your situation requires accountability.

This is what most institutions buying AI tools right now are failing to ask. You now know what question to put in writing before signing anything.

Lesson 3 Quiz

Fairness, Robustness, and Transparency · 5 questions

1. The Optum healthcare algorithm was "accurate" at its stated task but unfair. What was the key design decision that caused the fairness problem?

Exactly. Predicting cost instead of need was the rubric-level failure. Cost is measurable, so it was used. Need is what actually mattered. A fairness criterion that required testing whether predicted scores matched actual medical outcomes across demographic groups would have surfaced this before deployment.

The algorithm didn't use race directly — it used healthcare cost, which correlated with race because access to care is unequal across racial groups. This is called "proxy discrimination" and it's one reason fairness criteria must test outcomes by group, not just inspect input variables.

2. A school adopts an AI attendance tracker that predicts which students are "at risk" of dropping out. To write a strong fairness criterion, your FIRST step should be:

Right. Before you can measure fairness, you have to know which groups to measure it across. That's always the first step — and it requires thinking about who uses the tool and who might be harmed if it works differently for them.

Identifying the relevant groups is always the first step in writing a fairness criterion — because you can't measure whether a tool is fair across groups until you've defined which groups to check. Setting thresholds and choosing comparison methods come later.

3. An AI writing assistant performs well on standard essay prompts but produces poor suggestions when students write in African American Vernacular English (AAVE). This is primarily a failure of:

Yes — it's both. It's a fairness failure because students who write in AAVE get worse service than students who write in standard American English. It's also a robustness failure because the tool degrades under dialect variation, which is exactly the kind of "edge case" robustness testing is designed to catch.

This is both a fairness failure (unequal outcomes by linguistic group) and a robustness failure (performance degrades on non-standard inputs). These two dimensions often appear together — and a rubric that includes both would catch problems that a rubric measuring only one would miss.

4. Which tool is most appropriate to operate as a "black box" — with no explanation for its outputs?

Right. Music recommendations are low-stakes — a bad recommendation costs you nothing meaningful, and there's no one whose life is affected by the explanation. All the other options involve decisions that affect people's freedom, education, or health — those require accountability, and that means the black-box level of transparency is unacceptable.

Music recommendations are the low-stakes case where a black box is acceptable — the consequence of a bad suggestion is just a song you don't like. Bail, test scoring, and healthcare triage all affect people's lives in serious ways, which means the rubric must require at least an audit trail, if not full explanations.

5. In 2016, ProPublica found the COMPAS tool predicted recidivism unfairly across racial groups. What made this especially difficult for defendants to challenge?

Exactly. Proprietary black-box algorithms in high-stakes settings create an accountability vacuum — people can't challenge what they can't see. This is the core transparency problem your rubric needs to catch: when a tool affects someone's life, they deserve to know why it scored them the way it did.

The key problem was transparency — the algorithm was a proprietary black box, so no one could explain the basis of a score to the person it affected. This is exactly why transparency level is a required rubric criterion for high-stakes AI tools.

Lab 3: Auditing for Fairness and Transparency

Role: Auditor · You examine a deployed tool and write the fairness and transparency sections of a rubric.

Your Assignment

A middle school in Chicago is using an AI tool to decide which students get placed in an accelerated math track. The tool takes in grades, test scores, and teacher ratings, and outputs a "readiness score" from 1–100. Students above 70 are considered for the accelerated track. The school won't tell anyone how the tool calculates the score — it came from a vendor who calls it proprietary.

A group of parents has asked you to write the fairness and transparency sections of an evaluation rubric for this tool. Your lab partner will push you to be precise. Vague criteria won't be accepted.

Opening move: Write your first fairness criterion for this tool. Be specific about the groups you'd test, what "equal" means in this context, and what threshold you'd require. Then explain your choices.

Lab Partner — Fairness Auditor Session AI

A proprietary "readiness score" deciding which kids get into accelerated math. Parents want a rubric. I've seen tools like this — they often pass internal audits and still produce wildly unequal outcomes. Give me your first fairness criterion. And I need specifics: which groups, what measure, what's your maximum allowed gap. Don't give me something I could put on any rubric for any tool.

Module 3 · Lesson 4

Fit for Purpose — and Putting It All Together

A brilliant tool solving the wrong problem is not a good tool. Your rubric needs a criterion for that.

In 2023, a US law firm used ChatGPT to write legal briefs — and the AI invented six court cases that never existed. What did their rubric miss?

In May 2023, a federal judge in New York held a very unusual hearing. The case was Mata v. Avianca — a personal injury lawsuit against an airline. Lawyers from the firm Levidow, Levidow and Oberman had submitted a legal brief that cited six court cases as precedents. The opposing lawyers couldn't find any of those cases. The judge ordered the filing attorneys to explain themselves.

The attorneys admitted that Steven Schwartz, a lawyer at the firm with 30 years of experience, had used ChatGPT to help research the brief. ChatGPT had generated the citations. All six cases were fabricated — they had never existed. When Schwartz had asked ChatGPT to confirm the cases were real, it had said yes. When he asked it to provide the actual text of the rulings, it produced plausible-sounding but entirely made-up text.

The judge sanctioned the lawyers and the firm $5,000 each. The story ran internationally. And the core problem was almost embarrassingly simple: ChatGPT was not built for legal citation research. It is a text-generation tool. It generates plausible-sounding sequences of words. Legal citations require confirmed facts from verifiable sources. No one had asked whether the tool was appropriate for the task before using it.

The Fifth Dimension: Fit for Purpose

The first four dimensions — accuracy, fairness, robustness, transparency — assume you already have the right kind of tool and are measuring how well it performs. "Fit for Purpose" is the question you ask before all of that: is this type of AI even appropriate for this task?

Different AI systems have fundamentally different capabilities and failure modes. A rubric that doesn't ask this question first can produce high scores on all four dimensions — and still point you toward a catastrophically wrong choice.

Generative AIProduces new content — text, images, code. Excellent at drafting, brainstorming, summarizing. Poor at tasks requiring confirmed facts from external sources, precise citations, or guaranteed accuracy.

Classification AIAssigns inputs to categories (spam/not spam, fraudulent/legitimate, cancer/not cancer). Built for pattern matching. Does not generate new content. Should not be used for open-ended tasks.

Retrieval AISearches and retrieves information from a defined database or document set. Highly accurate for fact-finding within its data scope. Cannot generate, create, or go beyond its source material.

Schwartz used a generative AI for a retrieval task. That's not a failure of accuracy, fairness, robustness, or transparency — it's a failure to match tool type to task type. A fit-for-purpose criterion would have asked: "Does this AI retrieve verified information from authoritative legal databases, or does it generate plausible text?" That one question prevents the entire disaster.

Writing a Fit-for-Purpose Criterion

A fit-for-purpose criterion has two parts. First, a task analysis: what exactly does this job require? Second, a tool-type match: does the AI's underlying mechanism actually do that?

Task Type	What It Requires	Right Tool Type	Wrong Tool Type
Find confirmed facts from records	Retrieval from authoritative sources	Retrieval AI, database search	Generative AI (will hallucinate)
Sort incoming requests into categories	Pattern matching, consistent categorization	Classification AI	Generative AI (inconsistent, unpredictable)
Draft text from a brief outline	Fluent language generation	Generative AI	Classification AI (can't generate)
Flag fraud in financial transactions	Fast pattern matching at scale	Classification AI	Retrieval AI (not built for real-time classification)

A strong fit-for-purpose criterion in your rubric might look like this: "Prior to evaluation against any other criteria, the selection panel must confirm in writing that the tool's underlying architecture matches the task's core requirement. If the task requires retrieval of verified facts, a purely generative system cannot pass this criterion regardless of its accuracy scores on other dimensions."

That last clause matters. A tool cannot "make up for" a fit-for-purpose failure by scoring well on other criteria. This is a gate criterion — it comes first, and a failure here ends the evaluation.

Ethical Question — No Clean Answer

Steven Schwartz had 30 years of legal experience. He asked ChatGPT if the cases were real and it confirmed they were. When a professional trusts an AI tool's confident-sounding response in a domain they don't fully understand (in this case, how language models actually work) — how much of the responsibility for the harm belongs to the professional, and how much belongs to the tool's designers for building something that confidently fabricates without warning? And what about the law firm's clients, who paid for research that never happened?

Your Complete Rubric Template

You've now built all five dimensions. Here is the complete rubric template you can apply to any AI tool evaluation. This is what a serious, professional-grade evaluation framework looks like — and most organizations buying AI tools right now are using something far less complete.

Dimension	Gate / Scored	Minimum Requirement	What Must Be Reported
Fit for Purpose	Gate — must pass first	Tool architecture matches task type. Written confirmation required.	Tool type, task type, reasoning for match.
Accuracy	Scored	Sensitivity and specificity specified by task; tested on inputs representative of real deployment.	Accuracy by subgroup, test conditions described.
Fairness	Scored	Maximum performance gap across defined demographic groups specified in advance.	Performance reported by each group; gap compared to threshold.
Robustness	Scored	Performance maintained at a stated percentage of peak under edge cases, adversarial inputs, and dialect variation.	Degradation rate under each test type.
Transparency	Scored — scaled by stakes	Low-stakes: audit trail. High-stakes: human-readable explanation per output.	Transparency level, how decisions can be reviewed or appealed.

What You Can Do Now That Most People Can't

When a government agency, a school district, a hospital, or a company announces it is adopting an AI tool, you now know what a complete evaluation rubric looks like — and you can tell, often from a single press release, whether anyone used one. Most haven't. That gap between the rubric that should exist and the rubric that does exist is where most of the high-profile AI failures of the last decade have occurred. You know how to close it.

Lesson 4 Quiz

Fit for Purpose — and Putting It All Together · 5 questions

1. In the Mata v. Avianca case (2023), attorney Steven Schwartz used ChatGPT for legal research. Which dimension of the rubric did his process most fundamentally fail?

Correct. The accuracy failure was downstream of the fit-for-purpose failure. ChatGPT was never built to retrieve confirmed legal citations — it generates plausible text. Using it for a fact-retrieval task is the root cause. A rubric with a fit-for-purpose gate criterion would have stopped the evaluation before it even reached accuracy testing.

The accuracy failure was real, but it was a symptom, not the root cause. The root cause was using a generative text tool for a task that requires confirmed retrieval from authoritative sources. Fit for Purpose is the gate criterion that prevents this — it must be checked before any other dimension.

2. A hospital wants to use AI to automatically sort incoming patient messages into "urgent" and "non-urgent" categories. Which AI type is most fit for this purpose?

Right. Sorting into categories is exactly what classification AI is designed to do — it's fast, consistent, and built for pattern matching at scale. Generative AI would produce unpredictable outputs; retrieval AI is designed for lookup, not categorization; and a general benchmark score doesn't tell you whether the tool type matches the task.

Categorization tasks require classification AI — built for exactly this kind of consistent, fast pattern matching. Generative AI produces text (not reliable categories), retrieval AI finds information (not classifies it), and benchmark scores don't tell you whether the tool's architecture matches the task.

3. In your complete rubric template, "Fit for Purpose" is a "gate criterion." What does that mean in practice?

Exactly. A gate criterion comes first and stops the evaluation if failed. A tool cannot compensate for a fit-for-purpose failure by scoring well elsewhere — because if the tool is the wrong type for the job, high scores on other dimensions are irrelevant and potentially misleading.

A gate criterion means you evaluate it first, and failure ends the process — there is no override. The logic is that if a tool is fundamentally the wrong type for the job, its scores on other dimensions don't matter and can even create false confidence.

4. A school board is evaluating two AI tools for student math tutoring. Tool A scores 91% on accuracy but is a black box with no explanations. Tool B scores 84% on accuracy and provides a human-readable explanation for every suggestion. For a student-facing educational tool, which rubric dimension determines that Tool B may actually be the better choice?

Yes. Transparency — specifically the "full transparency" level that provides human-readable explanations — is the defining criterion here. A tutoring tool that gives students a correct answer with no explanation doesn't actually teach anything. The rubric's transparency requirement, scaled to the educational stakes involved, points to Tool B despite its lower accuracy score.

Transparency is the key dimension. In an educational context, the purpose of feedback is to help students learn — which requires understanding why they got something wrong. A rubric that treats transparency as a required criterion for high-stakes, student-facing tools would flag Tool A's black-box design as a disqualifying problem regardless of its accuracy advantage.

5. A city government wants to adopt an AI tool to predict which neighborhoods need infrastructure repairs. Before applying any other rubric criteria, which question must the evaluation team answer first?

Right — fit for purpose is always the first question. Infrastructure decisions require predictions grounded in real, verified data about actual conditions. A generative AI could produce confident-sounding neighborhood risk scores with no connection to actual infrastructure data. Confirming tool type and data sourcing before anything else is the gate that prevents that failure mode.

Fit for purpose must come first. For a prediction task involving public infrastructure and resource allocation, the evaluation team needs to confirm that the tool's predictions are grounded in real, verified data — not generated. Accuracy, transparency, and cost are all important but meaningless if the tool type is wrong for the task.

Lab 4: Build the Full Rubric

Role: Lead Evaluator · You write the complete five-dimension rubric for a real scenario.

Your Assignment

A large city school district is considering adopting an AI tool to help decide which students qualify for special education services. The tool analyzes teacher reports, test scores, and behavioral records and outputs a recommendation: "evaluate further" or "no evaluation needed." This decision affects thousands of students per year.

You are the lead evaluator. Your job is to write the complete five-dimension rubric — fit for purpose, accuracy, fairness, robustness, and transparency — before the district signs any contract. Your lab partner will review each criterion and challenge you to make it more specific, more enforceable, or more honest about the tradeoffs.

Opening move: Start with your Fit for Purpose criterion. What type of AI is appropriate for this task, and what would disqualify a tool from passing this gate — regardless of its accuracy scores?

Lab Partner — Full Rubric Build Session AI

Special education eligibility decisions. High stakes, legally regulated, affects kids' entire educational trajectory. I've seen rubrics built for tools like this that sound thorough until you push on them — then they fall apart. Start with Fit for Purpose. Tell me what type of AI can actually do this job, and what would fail the gate before we even look at accuracy or fairness.

Module 3 Test

Build Your Own Testing Rubric · 15 questions · Pass at 80%

1. What is the primary purpose of an evaluation rubric when testing an AI tool?

Right. A rubric defines criteria, measurement methods, and thresholds before testing — which is what prevents evaluators from adjusting their standards after seeing results they like or dislike.

A rubric's purpose is to define criteria, methods, and thresholds before testing begins. This pre-commitment prevents standards from shifting based on what the results turn out to be.

2. Amazon's 2018 recruitment AI penalized résumés containing the word "women's." The root cause was:

Correct. The criterion was "predict who gets hired" — a prediction the tool made accurately based on historical data. The problem was that historical hiring patterns embedded gender bias, and no rubric had asked whether "who got hired in the past" was a valid proxy for "who should be hired."

The tool performed as designed — predicting historical hiring outcomes. The failure was in choosing that criterion without questioning whether past hiring patterns were a fair or appropriate standard to optimize for.

3. "Distribution shift" in AI testing means:

Correct. Distribution shift is the gap between test conditions and real-world conditions. The dermatology AI showed this: tested on professional photos, deployed with smartphone images. A rubric specifying that test inputs must match real deployment inputs would catch this before deployment.

Distribution shift is when the test inputs and real-world inputs are different. It's one of the most common reasons an AI that looks great in testing performs poorly in actual use.

4. A content moderation AI is 94% accurate overall but removes content in Spanish at twice the rate it removes equivalent content in English. Which dimension of the rubric does this most directly violate?

Right. High overall accuracy hiding major performance disparities between language groups is a fairness failure. The rubric's fairness criterion — requiring disaggregated performance reporting — would surface this disparity that the overall number conceals.

Unequal treatment across language groups is a fairness failure. The 94% overall accuracy number hides a disparity that a fairness criterion — requiring performance to be reported separately by language — would catch and flag.

5. The HELM evaluation framework (Stanford/Google, 2022) argued that most prior AI benchmarks were inadequate because:

Correct. HELM's core finding was that narrow benchmarks give a misleading picture — a tool can score well on a single accuracy metric while failing badly on fairness, robustness, or real-world performance. A multi-dimensional rubric is the solution.

HELM's main critique was about narrow criteria — measuring too few dimensions. This led to misleading comparisons and tools that looked good on benchmarks but failed in deployment on dimensions nobody had thought to test.

6. You are writing an accuracy criterion for an AI tool that flags potentially plagiarized student essays. Which version is strongest?

Yes. This criterion specifies the right measures (sensitivity and specificity, not just overall accuracy), representative test inputs (matching the real student population), and required disaggregation (by grade level and first language). Each of those additions closes a gap that vaguer criteria leave open.

Option C specifies what's being measured (sensitivity and specificity), what inputs to use (representative of real students), and requires results broken down by relevant subgroups. These three elements together make it a criterion that means something — the others leave too much undefined.

7. The COMPAS recidivism tool was found to be biased against Black defendants. What made this especially hard to challenge legally?

Right. A proprietary black-box algorithm in a high-stakes legal context means no one can explain a score to the person it affects — and without an explanation, there's nothing to challenge. This is why the rubric's transparency criterion must require at minimum an audit trail, and for high-stakes decisions, a human-readable explanation.

The proprietary, black-box nature of COMPAS was the core transparency failure. When an algorithm affects someone's freedom but no one can explain how the score was produced, accountability is impossible. The rubric's transparency criterion addresses this directly.

8. In your complete rubric template, which transparency level is appropriate for an AI tool used to decide which students receive after-school tutoring resources?

Right. Decisions that affect individual students' access to educational resources should be reviewable — by the student, their parents, and their teachers. An audit trail is the floor; explanations are better. A black box is inappropriate here because accountability requires understanding.

When an AI decision affects an individual's access to educational resources, it is not low stakes. Parents and educators should be able to understand why a student was or wasn't recommended — which requires at minimum an audit trail and ideally a human-readable explanation.

9. A robustness test for an AI customer service chatbot would most usefully include:

Correct. Robustness testing deliberately introduces the kinds of messiness that appear in real use — typos, dialect variation, off-topic questions — to measure how gracefully the tool handles inputs that differ from its ideal training conditions.

Robustness tests are specifically designed to test edge cases and unusual inputs — not the ideal conditions used for accuracy testing. The goal is to measure how much performance degrades when inputs get messier, which is what real-world use looks like.

10. Which of the following BEST describes why "overall accuracy" is an insufficient fairness metric?

Exactly. An average can hide enormous variation. A tool can be 92% accurate overall while being 60% accurate for a minority subgroup — and the overall number will never reveal that unless you require disaggregated reporting. That's what a fairness criterion forces.

Overall accuracy averages across all users — and high averages can conceal very poor performance for specific groups. Fairness requires measuring accuracy (and error rates) separately for each group that matters, which is exactly what disaggregated reporting in the rubric achieves.

11. A company claims their AI translation tool is "industry-leading" because it outperforms competitors on the WMT benchmark. Why might this claim not be enough to pass your rubric's accuracy criterion?

Right — though the threshold point is also worth thinking about. The core issue is that a benchmark score on a specific test set may not transfer to your actual use case. Your rubric specifies that accuracy must be measured on inputs representative of your real deployment conditions — which the WMT benchmark may not be.

Benchmark scores on standard datasets may not reflect performance on your specific inputs. Your rubric's accuracy criterion requires that testing be done on conditions representative of your actual use case — which a generic benchmark may not provide. Outperforming competitors on their chosen test set is not the same as passing your rubric.

12. The Optum healthcare algorithm (2019) used healthcare cost as a proxy for healthcare need. This caused a fairness failure because:

Correct. This is a proxy discrimination pattern: when a variable that seems neutral (cost) is systematically correlated with a protected characteristic (race) due to structural inequality, using it as a proxy for need produces unfair outcomes even without any intent to discriminate. A fairness criterion requiring outcome testing by demographic group would have caught this.

Cost and need diverged across racial groups because access to care was historically unequal. Patients with equal medical need showed different historical costs depending on their ability to access care — so optimizing for cost produced racially unequal predictions of need. This is called proxy discrimination.

13. A rubric's fit-for-purpose criterion acts as a "gate." Which of the following scenarios correctly applies this gate?

Yes — this is exactly how the gate works. A generative AI cannot retrieve confirmed facts; it can only generate text that sounds like confirmed facts. Because the task requires verified retrieval, the tool fails the gate regardless of how it scores on any other dimension. Evaluation stops here.

Option C correctly applies the gate logic: the tool type (generative) is wrong for the task type (verified retrieval), so evaluation stops — it doesn't matter how the tool scores on accuracy or other criteria, because those scores would be measuring the wrong capability for the wrong task.

14. A school district's procurement team evaluates an AI grading tool and reports it is "accurate and fast." Which elements of a complete rubric are missing from this evaluation?

Exactly. "Accurate and fast" addresses one dimension (accuracy, incompletely) and a performance attribute (speed, which isn't even on the five-dimension rubric as a primary criterion). Fit for purpose, fairness, robustness, and transparency are all unaddressed — and all relevant for a tool used to grade students.

"Accurate and fast" only touches one rubric dimension (accuracy, and incompletely at that). All other dimensions — fit for purpose, fairness, robustness, and transparency — are unaddressed. For a tool making decisions that affect students' academic records, all five dimensions are relevant.

15. You're advising a city that wants to use AI to help allocate funding for after-school programs across neighborhoods. Using the complete five-dimension rubric, which dimension should you evaluate FIRST, and why?

Right. Fit for Purpose is always the gate — it comes first in every evaluation. Only once you've confirmed that the tool type matches the task (and that it operates on verified, relevant data about actual neighborhood needs) does it make sense to evaluate accuracy, fairness, robustness, or transparency. A tool that fails the gate cannot redeem itself with strong scores on other dimensions.

Fit for Purpose is always evaluated first because it's the gate criterion. If the tool type is wrong for the task, no score on any other dimension matters — they would all be measuring the wrong thing. Fairness and transparency are critically important here, but they come after confirming the tool is even appropriate for resource allocation.