In 2014, Amazon's machine-learning team built a tool they were proud of. Its job: read thousands of software-engineering rΓ©sumΓ©s and score each one from one to five stars β the same scale a recruiter would use. The team trained it on ten years of Amazon's own hiring data. They thought they were building an objective judge. They were wrong.
By 2015, internal testers noticed the system was quietly downgrading rΓ©sumΓ©s that included the word "women's" β as in "women's chess club" or "women's coding bootcamp." By 2018, Reuters reported the full story: Amazon had scrapped the tool entirely. It had learned from historical data that Amazon, like most big tech companies, had hired more men than women in technical roles. So it treated "male-coded" language as a quality signal.
Here is the critical detail most people overlook: the tool was performing exactly as designed. Amazon asked it to predict who would get hired. It predicted correctly. The problem was that nobody had written down what "good hiring" actually meant before building the evaluation. Was the goal to reproduce past patterns? To find the highest-skill candidates? To build a more diverse team? Three different goals β three completely different rubrics β and Amazon had never chosen between them.
A rubric is a written-down set of criteria β the things you're measuring and what counts as a good or bad score on each one. Teachers use them when grading essays. Judges use them in competitions. Scientists use them when they review each other's research papers.
When you're testing an AI tool, a rubric does the same job. It forces you to answer a question that feels obvious but usually isn't: what exactly are we trying to measure?
Amazon's team measured "likelihood to be hired." That's a real metric. But it wasn't the right one β because it captured something about the past, not something about actual skill or potential. A rubric without the right criteria is worse than no rubric at all, because it gives you false confidence.
Here's what a rubric actually needs before you test anything:
In October 2022, a team of researchers from Stanford and Google published a paper called "Holistic Evaluation of Language Models" β known as HELM. Their central complaint was stark: almost every AI benchmark published before 2022 measured only one or two things (usually raw accuracy on a specific task) and ignored everything else. Speed, fairness, robustness, cost β all treated as someone else's problem.
The researchers found they couldn't even compare tools fairly because each one had been tested on different criteria. It was like comparing soccer players by having some of them play a real match and others do a skills drill, then announcing a winner.
Why does this happen? Because writing a rubric is slow and boring and building the actual tool feels exciting. Teams under deadline pressure skip the boring part. Then they test against whatever metric is easiest to calculate. Then they ship.
The result is that most AI tools in the world have never been formally evaluated against a rubric that covers all the things that actually matter to the people using them. You can now see what most people β including most professional engineers β skip entirely.
Amazon's tool gave accurate predictions based on what it was trained to predict. If a company's historical hiring was biased, and an AI learns from that history, who is responsible for the outcome β the tool, the engineers who built it, the executives who approved it, or the system that created the biased history in the first place? There is no answer here that makes everyone comfortable. Sit with it.
Based on the HELM research and on real-world deployment failures like Amazon's, here are five dimensions that almost every serious AI evaluation should include. This module will build your rubric around these five β and by the end, you'll be able to apply them to any AI tool you encounter.
| Dimension | What It Measures | Why It Can Be Tricky |
|---|---|---|
| Accuracy | Does it produce correct outputs on the task it claims to do? | Depends entirely on what "correct" means β which you have to define first. |
| Fairness | Does it perform equally well across different groups of people? | A tool can be accurate on average but terrible for specific groups. |
| Robustness | Does it stay reliable when inputs are slightly unusual or unexpected? | Most tools are tested on clean, typical data. Real life is messy. |
| Transparency | Can you understand why it gave a particular output? | High accuracy and low transparency often go together β the best models are the hardest to explain. |
| Fit for Purpose | Is this tool the right kind of AI for this specific job? | A hammer can be great at hammering and terrible at measuring. Same tool, wrong task. |
In Lessons 2, 3, and 4 you'll stress-test each of these dimensions with real cases and build the actual scoring system. But right now, the important move is recognizing that all five dimensions must be present in any serious rubric. Leave one out and your evaluation has a blind spot β and blind spots are exactly where the expensive failures hide.
When a company announces their new AI is "95% accurate," you now know the right questions: Accurate on what task, measured how, across which groups, under what conditions? Those four follow-up questions β which almost no journalist and very few product managers ask β are the difference between a meaningful evaluation and a marketing slide.
A school district in Austin, Texas is about to buy an AI tool that reads student essays and gives them a score from 1β10. Before they spend the money, you've been asked to design the rubric that will be used to evaluate whether the tool is actually good enough to use. Your lab partner below has opinions β and will push back on yours.
Start by telling your lab partner: what criteria should be on this rubric, and why? Then defend your choices when challenged.
In January 2020, a team at the University of Modena published results showing their deep-learning model for detecting skin cancer achieved an accuracy of 91.3% on a standard benchmark dataset β compared to 77.4% for the average dermatologist. Headlines ran. The story was picked up by the BBC, Forbes, and dozens of science outlets. "AI Beats Doctors," the summaries said.
What the headlines didn't say: the benchmark dataset contained almost entirely high-quality, dermatologist-taken photographs under controlled lighting. When the same model was tested on images taken with ordinary smartphones β which is how most teledermatology actually works β performance dropped by roughly 20 percentage points. When tested on patients with darker skin tones, it dropped further still, because the training dataset was over 80% images of light-skinned patients.
The model was genuinely accurate β at the specific, narrow task it was tested on. But "accurate on a curated benchmark dataset" and "accurate in the real world" are two completely different claims. The rubric had measured accuracy. It just measured the wrong version of it.
When you write "accuracy" on a rubric, you have to specify three things or the criterion is useless:
1. Accurate on what inputs? β The dermatology AI was accurate on professional photographs. Real clinics use smartphones. The inputs in the test had to match the inputs in real use.
2. Accurate by what measure? β Overall accuracy (what percentage of all predictions are correct) can be misleading. A test that is 95% "not cancer" cases would produce a model that achieves 95% accuracy by just saying "not cancer" every single time β and would miss every real cancer case. Researchers use additional measures: sensitivity (how often does it catch real positives?) and specificity (how often does it correctly clear negatives?).
3. Accurate for whom? β This is where accuracy and fairness overlap. A single overall accuracy number hides variation across subgroups. Your rubric should require accuracy to be reported separately for different groups when the tool will be used across a diverse population.
Here's a weak accuracy criterion you'd find in most corporate evaluations:
"The tool must achieve at least 90% accuracy."
Here's a stronger version that actually holds up under scrutiny:
"The tool must achieve at least 85% sensitivity and 85% specificity on inputs representative of real deployment conditions β including the range of device types, lighting conditions, and demographic groups present in our actual user population. Results must be reported separately by subgroup."
Notice what changed: the threshold got slightly lower (85% instead of 90%), but the criterion is vastly more demanding β because it now specifies the right inputs, the right measures, and requires disaggregated (separated-out) results. A tool that fails any one of those sub-requirements fails the criterion entirely, no matter how good it looks in aggregate.
This is what a rubric does. It prevents a tool from hiding its weaknesses behind its strengths.
The dermatology AI genuinely performed well on the data it was tested on. The researchers published an honest result. Journalists made it into a larger claim than the data supported. When an AI tool underperforms for certain groups in real use β who should be held responsible: the researchers who built it, the journalists who simplified it, the hospitals that deployed it, or the regulatory agencies that approved it? All of them had information and made choices. None of them alone caused the problem.
When you add accuracy to your rubric, use this three-part structure:
| Part | Question to Answer | Example Answer |
|---|---|---|
| Task Definition | What exactly is the tool being asked to do? | "Flag student essays likely to contain plagiarism." |
| Test Conditions | On what inputs? How similar to real use? | "On essays written by students aged 11β16, submitted via the school's actual portal." |
| Measure + Threshold | What counts as correct, and what's the minimum? | "At least 80% sensitivity (catching real plagiarism) and 90% specificity (clearing innocent work). Reported by grade level." |
If you can fill in all three parts for accuracy, you have a criterion that means something. If any part is blank, the criterion is a decoration β it looks official but doesn't constrain anything.
Knowing how to spot the difference between a meaningful accuracy claim and a marketing number is a skill that most people working with AI tools β inside and outside of companies β do not have. You now do.
A company has pitched their AI reading-level assessment tool to your school. Their sales sheet says: "93% accuracy β validated on 10,000 student texts." Your principal wants to know if this number means anything. You've been put in charge of figuring that out.
Your lab partner has seen this kind of claim dozens of times. Start by telling them what questions you'd ask the vendor to determine whether 93% actually means anything β and be specific about why each question matters.
In October 2019, researchers at Berkeley published a paper in the journal Science documenting a healthcare management algorithm used by major US insurers β built by a company called Optum β that was being used to decide which patients needed additional care. The algorithm was deployed on an estimated 200 million people per year.
The researchers found that Black patients who were just as sick as white patients were systematically assigned lower risk scores β meaning they received less additional care. The gap was striking: at any given risk score, Black patients were on average significantly sicker than white patients given the same score.
The cause wasn't a coding error. The algorithm predicted future healthcare costs as a proxy for healthcare need. And because Black patients had historically spent less on healthcare β due to systemic economic barriers and unequal access β the algorithm learned that they "needed" less. The tool was technically accurate at predicting costs. It was profoundly unfair at predicting need.
Optum disputed aspects of the study but acknowledged the disparity and updated the algorithm. The damage, however, had already been done across years of deployment β and no rubric in the procurement process had included a fairness criterion that would have caught it.
Fairness in a rubric is not about whether an AI was built with good intentions. It's about whether it produces consistent, equitable outcomes across the people who will use it. Here's how to write a fairness criterion that actually works:
Step 1 β Identify the relevant groups. Who will use this tool? Are there groups whose outcomes you need to check separately? For a healthcare algorithm: patients by race, income level, age, and chronic condition status. For a grading tool: students by first language, grade level, and disability status.
Step 2 β Define what "equal" means. Equal can mean different things. Equal accuracy rates across groups? Equal error rates? Equal outcomes? These are different β and sometimes in tension. You must choose.
Step 3 β Set a maximum allowed gap. If the tool performs 5 percentage points worse for one group than another, is that acceptable? 10 points? The rubric must specify a number. Without a number, every gap is negotiable.
"The tool's error rate must not differ by more than 5 percentage points across demographic groups defined by race, gender, and first language, as measured on a test set that proportionally represents the actual user population."
The Optum algorithm made a technically rational choice: predict costs, because costs are measurable. The problem is that cost and need are not the same β especially when access to care is unequal. If you're building a rubric for a healthcare AI, should you require equal accuracy, equal outcomes, or some other definition of fairness? And what happens when achieving one kind of fairness makes another kind harder? Researchers call this the "impossibility theorem" of fairness β you genuinely cannot satisfy all definitions simultaneously.
In 2021, a research team at MIT published results showing that self-driving car systems from multiple manufacturers performed significantly worse in rain and snow than in clear conditions β even though most of their training data came from sunny California weather. This is a robustness failure: the tool works under ideal conditions but degrades when conditions become slightly unusual.
Robustness testing means deliberately giving a tool inputs that are different from the "typical" case and measuring how gracefully (or badly) it handles them. For a writing-assistance AI, robustness tests might include:
| Test Type | What You Do | What Failure Looks Like |
|---|---|---|
| Edge Case | Feed it an unusually short or long input | Crashes, refuses, or gives absurd output |
| Adversarial | Add deliberate misspellings or unusual formatting | Produces confident but wrong output |
| Dialect/Language Shift | Use inputs from a different dialect or register | Drops in quality specifically for those inputs |
| Out-of-Domain | Give it a task slightly outside its training focus | Doesn't recognize its own limits β continues anyway |
A robustness criterion in your rubric might say: "The tool must maintain at least 80% of its peak accuracy performance when tested on inputs with intentional formatting errors, non-standard dialects, and lengths more than twice the average training example length." This is measurable, specific, and actually tests whether the tool works in real life, not just in a lab.
In 2016, ProPublica published an investigation into a tool called COMPAS β used by courts in the US to predict whether a defendant was likely to commit another crime. Judges used its scores when deciding on bail and sentencing. Nobody could explain how it worked β not even the company that made it, which called the algorithm proprietary (meaning they owned it and wouldn't reveal it).
The ProPublica investigation found COMPAS was twice as likely to falsely flag Black defendants as high-risk compared to white defendants. Because the algorithm was a black box (no explanation available), there was no way for defendants to challenge the score in court. You can't argue against a number when no one can say where the number came from.
Transparency in a rubric doesn't always mean "fully open source." It means: can a decision made by this tool be explained to the person it affects? There are different levels:
Your rubric should specify what transparency level is required for the stakes involved. A music recommendation AI can be a black box. A tool used in school discipline, criminal justice, healthcare, or hiring should not be. The rubric makes this explicit β so a vendor can't sell you a black box when your situation requires accountability.
This is what most institutions buying AI tools right now are failing to ask. You now know what question to put in writing before signing anything.
A middle school in Chicago is using an AI tool to decide which students get placed in an accelerated math track. The tool takes in grades, test scores, and teacher ratings, and outputs a "readiness score" from 1β100. Students above 70 are considered for the accelerated track. The school won't tell anyone how the tool calculates the score β it came from a vendor who calls it proprietary.
A group of parents has asked you to write the fairness and transparency sections of an evaluation rubric for this tool. Your lab partner will push you to be precise. Vague criteria won't be accepted.
In May 2023, a federal judge in New York held a very unusual hearing. The case was Mata v. Avianca β a personal injury lawsuit against an airline. Lawyers from the firm Levidow, Levidow and Oberman had submitted a legal brief that cited six court cases as precedents. The opposing lawyers couldn't find any of those cases. The judge ordered the filing attorneys to explain themselves.
The attorneys admitted that Steven Schwartz, a lawyer at the firm with 30 years of experience, had used ChatGPT to help research the brief. ChatGPT had generated the citations. All six cases were fabricated β they had never existed. When Schwartz had asked ChatGPT to confirm the cases were real, it had said yes. When he asked it to provide the actual text of the rulings, it produced plausible-sounding but entirely made-up text.
The judge sanctioned the lawyers and the firm $5,000 each. The story ran internationally. And the core problem was almost embarrassingly simple: ChatGPT was not built for legal citation research. It is a text-generation tool. It generates plausible-sounding sequences of words. Legal citations require confirmed facts from verifiable sources. No one had asked whether the tool was appropriate for the task before using it.
The first four dimensions β accuracy, fairness, robustness, transparency β assume you already have the right kind of tool and are measuring how well it performs. "Fit for Purpose" is the question you ask before all of that: is this type of AI even appropriate for this task?
Different AI systems have fundamentally different capabilities and failure modes. A rubric that doesn't ask this question first can produce high scores on all four dimensions β and still point you toward a catastrophically wrong choice.
Schwartz used a generative AI for a retrieval task. That's not a failure of accuracy, fairness, robustness, or transparency β it's a failure to match tool type to task type. A fit-for-purpose criterion would have asked: "Does this AI retrieve verified information from authoritative legal databases, or does it generate plausible text?" That one question prevents the entire disaster.
A fit-for-purpose criterion has two parts. First, a task analysis: what exactly does this job require? Second, a tool-type match: does the AI's underlying mechanism actually do that?
| Task Type | What It Requires | Right Tool Type | Wrong Tool Type |
|---|---|---|---|
| Find confirmed facts from records | Retrieval from authoritative sources | Retrieval AI, database search | Generative AI (will hallucinate) |
| Sort incoming requests into categories | Pattern matching, consistent categorization | Classification AI | Generative AI (inconsistent, unpredictable) |
| Draft text from a brief outline | Fluent language generation | Generative AI | Classification AI (can't generate) |
| Flag fraud in financial transactions | Fast pattern matching at scale | Classification AI | Retrieval AI (not built for real-time classification) |
A strong fit-for-purpose criterion in your rubric might look like this: "Prior to evaluation against any other criteria, the selection panel must confirm in writing that the tool's underlying architecture matches the task's core requirement. If the task requires retrieval of verified facts, a purely generative system cannot pass this criterion regardless of its accuracy scores on other dimensions."
That last clause matters. A tool cannot "make up for" a fit-for-purpose failure by scoring well on other criteria. This is a gate criterion β it comes first, and a failure here ends the evaluation.
Steven Schwartz had 30 years of legal experience. He asked ChatGPT if the cases were real and it confirmed they were. When a professional trusts an AI tool's confident-sounding response in a domain they don't fully understand (in this case, how language models actually work) β how much of the responsibility for the harm belongs to the professional, and how much belongs to the tool's designers for building something that confidently fabricates without warning? And what about the law firm's clients, who paid for research that never happened?
You've now built all five dimensions. Here is the complete rubric template you can apply to any AI tool evaluation. This is what a serious, professional-grade evaluation framework looks like β and most organizations buying AI tools right now are using something far less complete.
| Dimension | Gate / Scored | Minimum Requirement | What Must Be Reported |
|---|---|---|---|
| Fit for Purpose | Gate β must pass first | Tool architecture matches task type. Written confirmation required. | Tool type, task type, reasoning for match. |
| Accuracy | Scored | Sensitivity and specificity specified by task; tested on inputs representative of real deployment. | Accuracy by subgroup, test conditions described. |
| Fairness | Scored | Maximum performance gap across defined demographic groups specified in advance. | Performance reported by each group; gap compared to threshold. |
| Robustness | Scored | Performance maintained at a stated percentage of peak under edge cases, adversarial inputs, and dialect variation. | Degradation rate under each test type. |
| Transparency | Scored β scaled by stakes | Low-stakes: audit trail. High-stakes: human-readable explanation per output. | Transparency level, how decisions can be reviewed or appealed. |
When a government agency, a school district, a hospital, or a company announces it is adopting an AI tool, you now know what a complete evaluation rubric looks like β and you can tell, often from a single press release, whether anyone used one. Most haven't. That gap between the rubric that should exist and the rubric that does exist is where most of the high-profile AI failures of the last decade have occurred. You know how to close it.
A large city school district is considering adopting an AI tool to help decide which students qualify for special education services. The tool analyzes teacher reports, test scores, and behavioral records and outputs a recommendation: "evaluate further" or "no evaluation needed." This decision affects thousands of students per year.
You are the lead evaluator. Your job is to write the complete five-dimension rubric β fit for purpose, accuracy, fairness, robustness, and transparency β before the district signs any contract. Your lab partner will review each criterion and challenge you to make it more specific, more enforceable, or more honest about the tradeoffs.