Module 2 · Lesson 1

The Benchmark That Broke

How a test that was supposed to prove AI intelligence ended up proving the opposite

If an AI aces every exam, does that mean it's smart — or just that it studied the answers?

In August 2023, researchers at UC Berkeley published a short, alarming paper. They had noticed something strange about GPT-4 and other top AI models: the models were scoring extraordinarily high on a benchmark called MMLU — Massive Multitask Language Understanding — a test covering 57 subjects from college physics to moral philosophy.

But when the researchers looked more carefully, they found that many of the MMLU questions had the wrong answers listed as correct. The test had been written by humans and never carefully proofread. Some questions were ambiguous. Some had two defensible correct answers. And the AI models — trained on enormous amounts of internet text that included MMLU practice materials — had learned to give the answers that matched the answer key, not the answers that were actually right.

Aarohi Narayanan, a graduate student on the team, described it bluntly: "We found cases where the model answered incorrectly but would have been correct according to any reasonable interpretation, and cases where it answered correctly but for clearly the wrong reason." The models had gotten good at the test, not at the thing the test was supposed to measure.

What Is a Benchmark, Exactly?

A benchmark is simply a test — a standardized set of questions or tasks used to measure how well an AI performs at something. Just like your school uses standardized tests to see how students are doing in math or reading, AI researchers use benchmarks to compare different AI systems against each other.

The idea sounds reasonable. You want to know which AI is better at understanding language? Give them all the same test and see who scores higher. Simple, right?

Here's the problem: a test only measures what it was designed to measure, and only if it was designed well. When the test has errors, when the AI has seen similar questions before, or when the skill being tested isn't the skill you actually care about — the score stops meaning anything real.

Benchmark A standardized test used to compare AI systems. The score tells you how well an AI did on that specific test — not necessarily how capable the AI is in the real world.

MMLU Massive Multitask Language Understanding — one of the most widely cited AI benchmarks, covering 57 academic subjects. Released in 2020, it became the go-to measure of AI "general intelligence" — until researchers found serious problems with it.

Why AI Companies Love High Benchmark Scores

Here's something worth knowing: AI companies choose which benchmarks to publish. When OpenAI released GPT-4 in March 2023, its technical report listed scores on dozens of tests — and the model did remarkably well on most of them. The report included scores on the BAR exam (the test lawyers take), the SAT, the GRE, and many more.

Those scores traveled fast. News headlines read things like "GPT-4 scores in the top 10% of bar exam takers." But those human exams were not designed to test AI — they were designed to test whether a lawyer or doctor has absorbed certain knowledge and can reason under time pressure. An AI doesn't experience time pressure. An AI has potentially been trained on test prep materials. An AI doesn't have to remember things — it processes everything at once.

When you see a company announce their model "achieved state-of-the-art performance" on a benchmark, it almost never means the AI is smarter than what came before. It usually means the AI did better on that specific test. Whether that test actually reflects the thing you care about is a separate question entirely — and one most headlines skip over.

The Selective Reporting Problem

AI labs run their models on many benchmarks during development. They are not required to publish all the results. The ones you see in press releases are the ones that look good. This is called cherry-picking, and it's completely legal and extremely common. Knowing this changes how you should read every "breakthrough" headline.

The Contamination Problem

There's another problem on top of cherry-picking: data contamination. This is what the Berkeley researchers were really uncovering. AI language models are trained on vast amounts of text scraped from the internet. That internet includes websites where people post MMLU questions and answers, blog posts analyzing SAT problems, Reddit threads discussing bar exam strategies.

So when you test the AI on those same problems, you're not testing whether it can reason — you're partially testing whether it saw those questions during training. It's like studying by memorizing the answer key, then being graded on the same test. You might get an A without understanding anything.

Researchers at Stanford, Princeton, and other universities have tried to measure how severe this contamination is. The honest answer is: we don't fully know, because AI companies don't always disclose exactly what data their models were trained on. Some companies have been more transparent than others, but none have been fully transparent.

This creates a situation where the people designing the tests, the people building the AI, and the people who fund the AI companies are often not the same as the people who need to trust the AI's scores. You — the person eventually using these tools — are at the end of a very long chain of decisions you weren't part of.

Data Contamination When an AI is trained on data that includes the questions (or answers) from the benchmark it will later be tested on. The model "remembers" the test rather than reasoning through it fresh.

You Now See What Most People Miss

When you read that an AI scored 90% on some benchmark, you now know to ask: Was that benchmark well-designed? Did the AI train on similar data? Did the company cherry-pick this score? Most journalists don't ask these questions. Most adults don't either. You do now.

An Ethical Question Worth Sitting With

Here's something that doesn't have a clean answer: AI companies have a financial incentive to report high benchmark scores. Higher scores attract investors, generate press coverage, and push people to use their products. The benchmarks themselves are often created by academic researchers with no financial stake — but those researchers have a different incentive: they want their benchmark to be used widely, which means it helps them if major AI companies adopt it and report scores on it.

So you have a situation where the people creating the measuring stick benefit when the measuring stick gets used, and the people being measured benefit when the scores look good. The person with no say in any of this is you — the eventual user who relies on those scores to decide which AI to trust.

Should AI companies be required to report all benchmark scores, not just the favorable ones? Who should get to decide which benchmarks matter — the companies, the researchers, or the public? There's no obvious answer. Sit with it.

Lesson 1 Quiz

Five questions · Test your reasoning, not your memory

1. In August 2023, Berkeley researchers found that top AI models scored high on MMLU primarily because:

Correct. The models had been exposed to benchmark materials during training (data contamination), and the test itself had errors — so high scores reflected test familiarity, not genuine reasoning ability.

Not quite. The issue was data contamination — the AI training data included MMLU questions and answers from the internet — combined with flaws in the test itself. Nobody programmed the answers in directly, and the test wasn't designed to favor AI.

2. A new AI model scores 95% on a math reasoning benchmark. Which question is MOST important to ask before concluding this AI is excellent at math?

Exactly right. Data contamination is the most serious threat to benchmark validity. If the model trained on similar problems, a 95% score tells you very little about real mathematical reasoning ability.

That's not the critical question here. The core issue is data contamination — whether the model encountered similar or identical questions during training. Without knowing that, even a perfect score is hard to interpret.

3. Why is it significant that AI companies choose which benchmark results to publish?

Right. Selective reporting — publishing only favorable scores — is legal and common in the AI industry. It means the benchmark scores you see in headlines are a curated highlight reel, not a complete picture.

Not quite. Selective reporting doesn't prove all unpublished results are negative, and it doesn't mean benchmarks are useless — it means the results you see have been filtered. That's an important distinction.

4. A school district is deciding which AI tutoring tool to buy. The vendor shows benchmark scores from five different tests, all of which the AI aced. Applying what you learned, what should the district's purchasing committee investigate first?

Exactly. The committee should ask: how many benchmarks did you actually run? What did training data include? Selective reporting and contamination are the two biggest threats to benchmark credibility — and the most important things to probe before making a purchasing decision.

That's not the most critical investigation here. The core concerns from this lesson are selective reporting (were other benchmarks hidden?) and data contamination (did the AI train on similar questions?). Those should be the first questions.

5. What does "data contamination" mean in the context of AI benchmarking?

Correct. Data contamination is when the test materials appear in the training data, making benchmark scores reflect memorization rather than genuine reasoning. It's one of the hardest problems in AI evaluation to fully solve.

Not quite. Data contamination means the AI was trained on data that included the benchmark questions — so it may have "seen" the answers before the test, like a student who memorized an answer key.

Lab 1: The Benchmark Auditor

Role: Independent benchmark auditor · Challenge an AI that defends inflated scores

Your Role

You're an independent benchmark auditor. A fictional AI company called Axiom Labs has just announced that their new model, Axiom-7, scored 96% on MMLU and passed the bar exam in the top 5% of takers. The company's AI spokesperson is very proud of these results.

Your job is to challenge those claims. Ask hard questions about data contamination, selective reporting, benchmark design, and what the scores actually prove. The spokesperson will defend Axiom-7 — push back with what you've learned.

Start by asking your first tough question about Axiom-7's benchmark results. Remember: you're the auditor, not the fan.

Axiom Labs — AI Spokesperson

Benchmark Audit Mode

Welcome, auditor. I'm the Axiom Labs spokesperson. Axiom-7 just scored 96.2% on MMLU and ranked in the top 5% of bar exam takers — our best results yet. I'm confident in these numbers. What would you like to know?

Module 2 · Lesson 2

Passing Without Understanding

The difference between scoring high and actually knowing something

Can you pass a medical licensing exam without knowing how to treat a patient? An AI did — and people celebrated.

In January 2023, a team of researchers published a paper in PLOS Digital Health reporting that ChatGPT had passed all three parts of the United States Medical Licensing Examination — the USMLE — without any special medical training. The headlines were immediate and electric. "AI passes medical licensing exam." Some articles suggested this meant AI could replace doctors. Others said it proved AI was now "medical-grade."

But the lead researcher, Dr. Victor Tseng, was considerably more careful than the headlines. He pointed out that the USMLE tests whether someone can recall and apply medical knowledge in a structured text format. It does not test whether someone can listen to a frightened patient, notice that they're leaving out important symptoms, manage uncertainty in real time, or make a judgment call when two treatment options are equally supported by evidence.

ChatGPT had passed the exam. It had done so by being very good at one thing: reading a description of a medical situation and selecting the most statistically likely answer from multiple options. That is a genuinely impressive skill. It is also not the same skill as practicing medicine. The test measured one thing. The headlines claimed something much larger.

The Difference Between Narrow and General Capability

This story reveals a distinction that almost every AI benchmark obscures: the difference between narrow capability and general capability.

A narrow capability is being good at a specific, well-defined task — like answering multiple-choice medical questions. A general capability is being good at a broad, flexible, real-world version of something — like being a trustworthy doctor.

Every benchmark tests narrow capability. It cannot test anything else, because you can only measure what you can precisely define. "Being a good doctor" is extremely hard to define precisely. "Getting 70% of these questions right" is easy to measure.

The dangerous slide happens in headlines and press releases: a company measures narrow capability, reports it accurately, and then the broader world interprets it as proof of general capability. That gap — between what was measured and what people believe was measured — is where most AI misinformation lives.

Narrow Capability Being good at one specific, precisely defined task — like answering multiple-choice questions from a fixed set of options.

General Capability Being flexibly good at a broad, real-world version of a skill — the kind that requires handling surprises, uncertainty, and edge cases. Much harder to measure.

What Benchmarks Are Actually Designed to Test

Good benchmark designers are usually quite aware of what they're measuring and what they're not. The people who built the USMLE didn't claim it could identify a great physician — they built it to create a minimum bar for medical knowledge recall. The problem isn't usually the benchmark itself. It's the chain of interpretation that follows.

Here's how that chain works: Researchers create a benchmark to test a narrow skill → AI company runs their model on it and reports a good score → journalists write "AI masters medicine" → the public forms a belief about AI that the benchmark never supported → policymakers make decisions based on that belief.

Each step in that chain adds a layer of distortion. And by the time a policy gets written or a purchasing decision gets made, the original careful definition from step one has completely evaporated.

What USMLE Measures

Medical knowledge recall. Ability to match symptoms to likely diagnoses using text descriptions. Statistical pattern recognition across clinical scenarios.

What Headlines Claimed

"AI passes medical licensing exam" — implying medical competence, potential to replace physicians, and trustworthiness in clinical settings.

The Winograd Schema: A Benchmark That Was Honest About Its Limits

Not all benchmarks fall into the interpretation trap. One of the more honest examples is the Winograd Schema Challenge, developed by Hector Levesque at the University of Toronto in 2012. It presents sentences where pronouns are ambiguous, and you have to figure out what the pronoun refers to using common sense — not word frequency or grammar tricks.

Example: "The trophy didn't fit in the suitcase because it was too big. What was too big?" Humans immediately say "the trophy." Early AI systems would guess wrong or guess randomly because no grammar rule helps. Winograd's designers were very careful to say: this tests one narrow slice of common-sense reasoning, and nothing more.

By 2019, GPT-2 and similar models were starting to do reasonably well on Winograd. But the designers didn't celebrate by saying AI had "mastered common sense." They pointed out that models might be using statistical patterns in the training data rather than genuine reasoning — and immediately started designing harder versions to probe that distinction.

That kind of intellectual honesty is rare in the benchmark world. When you see it, it's worth noticing. It usually signals researchers who care more about truth than about their model looking good.

How to Read a Benchmark Claim Like a Researcher

Ask: What exactly did this test measure? Is that the same as the headline claim? What did this test explicitly not measure? Did the researchers who designed the benchmark make the same claim the journalists made? If the researchers are more cautious than the headline, trust the researchers.

An Ethical Question Worth Sitting With

Hospitals and insurance companies are now making decisions about whether to use AI tools in clinical settings, partly based on benchmark scores like USMLE performance. Some of these tools are being used to suggest diagnoses, flag unusual lab results, or recommend treatments.

If those tools were adopted because of a score that measured narrow capability — but the decision-makers believed it proved general medical competence — and something goes wrong with a patient, who is responsible? The AI company that accurately reported the score? The journalists who overstated it? The hospital that didn't dig into what the score actually meant? The policymakers who approved the tool?

There's no clean answer. But the fact that you can now trace this chain of responsibility — from benchmark design to headline to policy to patient outcome — puts you in a position most people in that chain are not in. Use it.

Lesson 2 Quiz

Five questions · Think about what benchmarks actually prove

1. When ChatGPT passed the USMLE in January 2023, what did that result actually prove?

Exactly right. Passing the USMLE proved narrow capability — strong performance on structured multiple-choice questions. It said nothing about clinical judgment, patient communication, or handling real-world medical uncertainty.

That's a broader claim than the evidence supports. The USMLE result proved narrow capability: answering structured multiple-choice questions well. That's genuinely impressive but different from general medical competence.

2. What is the main danger in the chain from "benchmark result" to "public belief" to "policy decision"?

Correct. The distortion accumulates at each step: careful benchmark design → company reporting → journalistic interpretation → public belief → policy. By the end, the original precise meaning has evaporated.

The issue is compounding distortion across the chain — not intentional deception or simple number-reading errors. Each step simplifies and slightly over-claims, so the policy that emerges may be based on a belief the original benchmark never supported.

3. What made the Winograd Schema Challenge an unusually honest benchmark?

Right. Intellectual honesty in benchmark design means being explicit about limits — and when AI gets better at the test, updating the test rather than celebrating prematurely. That's what Winograd's designers did.

The Winograd Challenge was notable for its intellectual honesty: designers were explicit about what it measured, didn't overclaim when AI improved, and kept raising the bar. That's rare and worth recognizing.

4. A company claims their AI "mastered coding" because it scored 90% on a benchmark of programming puzzles. What is the most important limitation of this claim?

Exactly. Puzzle-solving is a narrow capability. Real software development involves understanding changing requirements, debugging unfamiliar systems, communicating with teammates, and navigating ambiguity — none of which a puzzle benchmark tests.

The key issue is the narrow vs. general capability gap. Puzzles test a well-defined subset of coding. Real-world software development requires handling ambiguity, broken requirements, and collaboration — things no puzzle benchmark captures.

5. According to this lesson, who is usually MORE careful about what a benchmark result means?

Correct. Benchmark designers know their instrument's limits better than anyone. When a researcher is more cautious than the headline — like Dr. Tseng after the USMLE paper — trust the researcher's caution over the headline's enthusiasm.

Benchmark designers are typically the most careful interpreters of their own results. They know what the test was and wasn't designed to measure. Journalists and marketing departments typically oversimplify — so when the researcher sounds cautious, that's a signal worth heeding.

Lab 2: The Headline Investigator

Role: Science journalist fact-checker · Interrogate overclaiming AI headlines

Your Role

You're a science journalist fact-checker at a major publication. Your editor just sent you three AI headlines to verify before publication. Your research partner — an AI assistant — has access to the underlying papers but tends to take things at face value.

Your job is to interrogate your partner: push them to identify the gap between what the benchmark measured and what the headline claims. Challenge every assumption. The partner will respond and push back.

Start with this headline: "AI Achieves Human-Level Reading Comprehension." Ask your partner what the underlying benchmark actually tested — and challenge whether that proves what the headline claims.

Research Partner — AI Assistant

Headline Investigation Mode

Ready when you are. I've pulled the papers behind several recent AI headlines. What are we fact-checking first?

Module 2 · Lesson 3

When the Measure Becomes the Target

What happens when AI companies stop trying to be good and start trying to score well

If everyone knows what's on the test, can the test still tell you anything real?

In 2022, a benchmark called HumanEval — created by OpenAI — became the standard way to measure AI coding ability. It presented 164 programming problems and checked whether the AI's code passed a series of automated tests. Companies started racing to top the HumanEval leaderboard.

By 2023, researchers at EvalPlus noticed something troubling. They added more test cases to the same HumanEval problems — harder edge cases that the original benchmark hadn't included. When they ran the same AI models on this extended version, the rankings changed dramatically. Models that had looked dominant suddenly dropped. Some models that ranked lower on the original benchmark performed better on the harder version.

What had happened? AI companies had been — sometimes deliberately, sometimes just by how they trained — optimizing for the original test cases specifically. The benchmark had become the target. The models got better at passing HumanEval's 164 problems without getting proportionally better at writing code in general. Researchers called this overfitting to the benchmark — a process so systematic it had its own name decades before AI existed.

Goodhart's Law

In 1975, a British economist named Charles Goodhart was studying how the Bank of England tried to control inflation by targeting specific financial measures. He noticed something that would become one of the most quoted observations in economics, policy, and eventually AI:

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

What this means in plain language: the moment you start trying to optimize a score specifically, the score stops telling you what it used to. This happens because there are usually many ways to improve a score. Some of those ways actually improve the underlying skill. Others just game the metric.

If your school starts tracking how many books students check out from the library, and teachers are evaluated based on that number, then students will check out books without reading them. The metric was "library checkouts as a proxy for reading engagement." Once checkouts became the target, they stopped measuring engagement — they just measured checkouts.

In AI, this plays out through benchmark optimization. Once a benchmark becomes prestigious enough that companies want to top it, they start — consciously or not — training their models to do well on that benchmark specifically. The benchmark erodes as a measure of real capability even as scores keep climbing.

Goodhart's Law When a measure becomes a target, it ceases to be a good measure. Optimizing for the score corrupts the score's ability to measure what it originally measured.

Overfitting to Benchmark When an AI model gets trained (deliberately or as a side effect) to score well on a specific benchmark without proportionally improving at the broader skill the benchmark was designed to measure.

The Arms Race That Ate the Leaderboard

By 2024, the AI benchmark ecosystem had become a kind of arms race. Researchers at Hugging Face — a platform that hosts AI models and leaderboards — noticed that scores on their Open LLM Leaderboard were inflating faster than actual AI capability was improving. Models were being fine-tuned specifically to score well on benchmark sets, and the community was starting to suspect that the leaderboard reflected "benchmark fitness" more than "real-world usefulness."

In mid-2024, Hugging Face announced it was shutting down the original Open LLM Leaderboard v1 and rebuilding it with new, undisclosed test sets that models couldn't train on in advance. The announcement was blunt: the old benchmark had been "saturated and gamed." They needed to start over.

This kind of benchmark reset is actually a sign of a healthy research community — people who recognized the problem and acted on it. But it also reveals how quickly a benchmark can lose its meaning once it becomes competitive currency. The half-life of a trustworthy benchmark is getting shorter every year.

The Institutional Stakes

Policy decisions about AI regulation, government contracts for AI tools, and standards for what "safe" AI means are all being written right now — and some of them reference specific benchmarks as thresholds. When Goodhart's Law erodes those benchmarks, the policies built on them become meaningless or even counterproductive. This is not a theoretical future problem. It is happening in legislative committees today.

What Researchers Are Doing About It

There is no perfect solution to Goodhart's Law in AI benchmarking, but researchers have developed several partial defenses worth knowing about.

Keep new benchmark data secret. If you don't tell model developers which exact questions will be on the test, they can't optimize for those questions specifically. Hugging Face's v2 leaderboard uses private test sets. The problem is that this makes it harder for independent researchers to audit the benchmark itself.

Constantly update benchmarks. Rotate new questions in, retire questions that have appeared widely online, and track improvement curves rather than point-in-time scores. This is labor-intensive and expensive.

Use multiple diverse benchmarks and look for agreement. If an AI scores high on ten very different benchmarks that test different skills in different ways, that's more credible than scoring high on one. If the scores disagree — high on some, low on others — that's more informative than any single score.

Use real-world task performance, not just benchmark scores. Some researchers are moving toward evaluating AI on tasks drawn from actual deployment contexts — real customer service conversations, real code that got shipped, real patient notes — rather than abstract tests. This is harder to standardize but harder to game.

The Deeper Insight You Now Have

Knowing Goodhart's Law changes how you read every AI leaderboard from now on. A rising score does not necessarily mean improving capability. It might mean improving optimization for that score. The question isn't "who's at the top?" — it's "has this benchmark become a target yet, and if so, what does the top position actually mean?"

An Ethical Question Worth Sitting With

Some AI researchers argue that benchmark gaming is just a form of cheating — companies are misleading the public and regulators by optimizing for scores rather than real capability. Others argue it's no different from students studying for a test: if you know the format, you prepare for the format, and there's nothing wrong with that.

But there's a third position: maybe the problem isn't the companies at all — it's the structure of an industry where benchmark scores function as currency for investment, talent, and public trust. If the incentive structure makes gaming nearly rational, changing individual behavior won't fix it. You'd have to change the structure.

Who should be responsible for designing and maintaining AI benchmarks — the companies that build AI, the researchers who study it, government regulators, or some independent body? And should benchmark results be legally regulated the same way that drug trial results are? There's no clean answer. Sit with it.

Lesson 3 Quiz

Five questions · Apply Goodhart's Law to new situations

1. What is Goodhart's Law?

Correct. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Originally from economics in 1975, it now applies powerfully to AI benchmark competition.

Goodhart's Law states: when a measure becomes a target, it ceases to be a good measure. The act of optimizing for a score — rather than the underlying skill — corrupts what the score represents.

2. When EvalPlus researchers added harder test cases to HumanEval and rankings changed dramatically, what did this reveal?

Exactly right. Benchmark overfitting — models that look great on specific test cases but don't generalize — is a direct consequence of Goodhart's Law in action. The original 164 problems had become a target, not a measure.

The result revealed benchmark overfitting: models had gotten good at the original test cases specifically, without proportionally improving at general coding. When new edge cases appeared, the performance gap emerged.

3. A city starts tracking the number of potholes reported per month as a measure of road quality. After two years, the number drops significantly — but roads seem worse. What is the most likely Goodhart's Law explanation?

Exactly. This is Goodhart's Law in municipal policy. Once "reports" became the target metric, workers optimized for fewer reports — not for better roads. The measure stopped measuring road quality and started measuring report suppression.

This is a Goodhart's Law situation. The metric (pothole reports) became the target, so workers optimized for fewer reports — not for better roads. The measure stopped reflecting the underlying quality it was meant to track.

4. Why did Hugging Face shut down its Open LLM Leaderboard v1 in 2024?

Correct. Hugging Face acknowledged directly that the leaderboard had been gamed — Goodhart's Law at industrial scale. They rebuilt with undisclosed test sets to make gaming harder. This kind of honest reset is a sign of a healthy research community.

Hugging Face stated directly that the leaderboard had been "saturated and gamed." Models were being fine-tuned to score well on those specific benchmarks, not to improve at general tasks — a textbook Goodhart's Law scenario.

5. Which strategy BEST reduces the risk of Goodhart's Law corrupting AI benchmark scores?

Right. Private test sets, continuous rotation, and multiple diverse benchmarks are the best partial defenses against Goodhart's Law. None are perfect — but together they make gaming much harder and more obvious when it occurs.

The best partial defense is a combination: keep test data private so models can't optimize for specific questions, rotate questions frequently, and require performance across multiple diverse benchmarks. No single fix fully solves Goodhart's Law.

Lab 3: The Leaderboard Designer

Role: Benchmark architect · Build a benchmark that's hard to game

Your Role

You've been hired by a fictional AI safety organization called Meridian Institute to design a new benchmark for measuring AI writing quality — one that resists gaming and Goodhart's Law. Your collaborator is a senior researcher who will challenge every design decision you make.

Don't just describe a test — defend every design choice against your collaborator's skepticism. Why will this resist gaming? How will you stop companies from optimizing for your specific metrics? What will you actually measure and what won't you measure?

Start by proposing your first key design decision: what will your writing quality benchmark actually test, and why is that the right thing to measure?

Dr. Kovacs — Senior Researcher, Meridian Institute

Benchmark Design Mode

Good. I've seen a lot of benchmark proposals fall apart under scrutiny. Tell me what you'd actually measure — and I'll tell you exactly how someone would game it.

Module 2 · Lesson 4

Reading the Fine Print

How to actually evaluate an AI benchmark claim when you encounter one in the real world

You see a headline: "New AI beats humans at common sense." What do you actually check?

In November 2019, the AI research lab Allen Institute for Artificial Intelligence — known as AI2 — made a striking announcement. Their model, Aristo, had scored 90% on an 8th-grade science exam and 83% on a 12th-grade science exam. The tests were real New York Regents exams — the standardized tests that students across New York State take at the end of middle and high school.

The AI2 researchers were scrupulous about what they claimed. Their press release said Aristo "passed" these exams, not that it understood science. The lead researcher, Peter Clark, explicitly noted that Aristo could not explain its answers, could not draw diagrams, could not perform lab procedures, and had no understanding of why the right answers were right — only that they were statistically associated with correctness in its training data.

And yet, many news outlets reported this as "AI now understands 8th-grade science." Some articles compared Aristo to a science teacher. The gap between what the researchers said and what the headlines claimed was so large that several science educators wrote public letters pointing out the distortion. The researchers' careful language had been discarded entirely by the time the story reached most readers.

A Framework for Evaluating Any Benchmark Claim

Every time you encounter an AI benchmark claim — in a news headline, a company press release, a school district's procurement decision, or a government policy document — you now have the background to apply a simple framework. Here it is.

Question 1: What exactly was tested?

Not the headline version. The actual benchmark. What kind of questions? Multiple choice or open-ended? Text, images, or both? What's explicitly not tested?

Question 2: Is the headline the same as the result?

Compare what the researchers actually claimed to what the headline claims. If the researchers are more cautious, trust them. The gap between the two is the distortion zone.

Question 3: Could the AI have trained on this?

Was this a well-known public benchmark? Were the questions available online before training? If yes, data contamination is likely. If training data is undisclosed, you can't rule it out.

Question 4: Is this benchmark a known target?

Is this one of the benchmarks that major companies compete to top publicly? If yes, Goodhart's Law is in play. High scores on prestigious leaderboards are least trustworthy precisely because everyone is optimizing for them.

You don't need to answer all four questions perfectly — often you won't have enough information to do so. But asking them changes your relationship with the claim. You go from passive receiver to active evaluator. That's the shift that matters.

What "State of the Art" Actually Means

The phrase "state of the art" — or SOTA — appears constantly in AI research. It means: the best result anyone has published on a specific benchmark at this moment in time. Nothing more and nothing less.

It does not mean: the best possible AI. It does not mean: the best AI for your task. It does not mean: significantly better than the runner-up. It does not mean: better than humans at the underlying real-world skill. It simply means: highest number on a specific scoreboard right now.

SOTA changes constantly. A model that was SOTA in March might not be SOTA in June. And "best on benchmark X" often has no meaningful relationship to "best for use case Y." A model that tops a language benchmark might be mediocre at writing code. A model that leads a coding leaderboard might be poor at nuanced emotional understanding. SOTA is a narrow claim wearing a large costume.

State of the Art (SOTA) The best published result on a specific benchmark at a specific point in time. Not a general claim about AI quality or usefulness. Highly time-sensitive and benchmark-specific.

The SOTA Trap in Procurement

When a government agency or school district issues a request for proposals requiring AI tools to demonstrate "state of the art performance," they may inadvertently be requiring a snapshot of one leaderboard at one moment — which could be gamed, contaminated, or irrelevant to their actual use case. This is happening in real procurement documents written right now, by people who don't have the context you now have.

What Trustworthy Evaluation Looks Like

Given everything this module has covered, what does a trustworthy AI evaluation actually look like? Here are the signals worth watching for — the things that indicate someone is taking evaluation seriously rather than using it as marketing.

They publish scores they didn't win. A company that shows you a benchmark where their model came second — or even last — is showing you something real. They're not cherry-picking. That restraint is rare and worth noting.

They distinguish capability from deployment readiness. A research paper that says "this model scores well on X but we have not tested it on real-world deployment in context Y" is being careful about scope. A company that says "our model is ready for clinical use because it passed the USMLE" is not.

They track trends, not point scores. Researchers who say "over the last six months, improvement on this benchmark has slowed, suggesting the model is hitting a ceiling" are telling you something more useful than researchers who say "we achieved a new high score." Trajectories matter more than snapshots.

They welcome independent replication. Trustworthy results can be reproduced by people who didn't conduct the original study. If a company won't share enough details about their model and evaluation setup to allow independent replication, the score is harder to trust — regardless of how impressive it looks.

What You Can Do Now That You Couldn't Before

You now have a four-question framework, an understanding of data contamination, Goodhart's Law, and the narrow-vs-general capability gap. That's enough to read any AI benchmark claim more honestly than most of the journalists who cover AI, most of the executives who deploy it, and most of the policymakers who regulate it. This isn't an exaggeration. It's what the gap in public understanding actually looks like right now.

An Ethical Question Worth Sitting With

The Aristo story from 2019 shows researchers doing everything right — they were scrupulous and careful — and still having their work misrepresented. Several science educators wrote letters. The researchers published corrections and clarifications. And the misrepresentation still spread further than the correction.

This raises a question that doesn't have a clean answer: Do AI researchers have a responsibility to predict how their results will be misunderstood and proactively fight the misrepresentation — even if it means being louder and less nuanced than they'd prefer? Or does the responsibility for accurate reporting lie entirely with journalists and policymakers?

And here's the harder version: if a researcher publishes a careful, nuanced result knowing it will likely be misunderstood in ways that affect policy — and they do nothing to prevent that misunderstanding — are they responsible for the policy that results? There's no clean answer. Sit with it.

Lesson 4 Quiz

Five questions · Apply the full framework

1. In the Aristo story from 2019, what was the main gap between what AI2 researchers claimed and what headlines reported?

Correct. AI2's researchers were careful and specific — Aristo passed a test without understanding it. Headlines translated that to claims about AI understanding science, which the researchers never made. The distortion happened at the journalism stage.

The gap was between careful researcher language (passed a multiple-choice test, doesn't understand why answers are correct) and headline language ("AI understands science," compared to teachers). The distortion happened at the reporting stage.

2. What does "state of the art" (SOTA) actually mean in AI benchmarking?

Exactly. SOTA is a narrow, time-sensitive, benchmark-specific claim. It doesn't generalize to other tasks, doesn't imply deployment readiness, and doesn't mean better than humans at the underlying real-world skill.

SOTA means: the best published result on a specific benchmark right now. It's a narrow, time-sensitive claim. It says nothing about deployment readiness, human comparison on the real-world skill, or government verification.

3. A company says their AI model is "state of the art at customer service" because it topped a customer service benchmark leaderboard. Applying the four-question framework, which question is MOST important to ask first?

Right. Question 1 of the framework: what exactly was tested? Customer service is broad and nuanced. A benchmark of scripted, predefined exchanges is a narrow proxy. You need to understand the gap between what the benchmark tested and what real customer service requires.

The most important first question from the framework is: what exactly was tested? Understanding the specific nature of the benchmark — whether it used real conversations or narrow scripted scenarios — determines whether "state of the art" means anything relevant to real deployment.

4. Which of the following is a signal that an AI evaluation is being taken seriously rather than used as marketing?

Correct. Publishing non-winning scores and distinguishing benchmark performance from deployment readiness are both signs of intellectual honesty. They're also rare — which is exactly why they stand out as trustworthy signals.

Honest evaluation is signaled by publishing results the company didn't win, and by clearly distinguishing what the benchmark tests from what real-world deployment requires. Highlighting only first-place finishes and using "state of the art" language are marketing moves, not evaluation signals.

5. Why do researchers say that "tracking improvement trends over time" is more informative than "point-in-time scores"?

Exactly. Trends tell you whether a benchmark is still measuring real growth or whether scores are plateauing (saturation) or inflating (gaming). A model that moved from 85% to 86% after a year of development is telling a very different story than one that moved from 60% to 85%.

Trends are more informative because they reveal the rate and direction of change — whether a benchmark is saturating, whether improvement is real or slowing, and whether gaming might be inflating scores. A single number in isolation tells you very little about any of that.

Lab 4: The Policy Advisor

Role: AI policy advisor · Brief a government committee on what benchmark scores actually mean

Your Role

You're briefing a fictional government committee — the National AI Oversight Board — that is drafting rules requiring AI companies to demonstrate "state of the art performance" before their tools can be used in schools. Your interlocutor is the committee's senior analyst, who knows a lot about technology but hasn't yet understood the problems with benchmark-based standards.

Your job is to explain — clearly and specifically — why using benchmark scores as a legal threshold is more complicated than it sounds, and to propose what a better approach might look like. The analyst will push back with practical concerns and budget constraints.

Start by explaining to the analyst why requiring "state of the art performance on standardized benchmarks" might not actually guarantee good AI tools for students.

Senior Analyst — National AI Oversight Board

Policy Briefing Mode

Thanks for coming in. Our current draft says any AI tool used in schools must demonstrate state-of-the-art performance on at least three standardized benchmarks. That sounds like a rigorous standard to me — what's the problem with it?

Module 2 Test

15 questions · Score 80% or higher to pass · Benchmarks, measurement, and evaluation

1. In August 2023, Berkeley researchers found that MMLU — a major AI benchmark — had which two problems?

Correct. MMLU had both flawed questions (some with incorrect answer keys) and data contamination — making high AI scores hard to interpret as genuine reasoning ability.

The two problems were: some MMLU questions had incorrect answers in the answer key, and AI models had likely trained on MMLU materials available online. Both undermine the validity of high benchmark scores.

2. What is "data contamination" in the context of AI benchmarking?

Correct. Data contamination means the AI may have "seen" benchmark questions during training — making high scores reflect memorization rather than reasoning.

Data contamination is when training data includes benchmark questions or answers. The AI may have memorized them rather than reasoning through them, making high benchmark scores unreliable evidence of genuine capability.

3. Why is it significant that AI companies can choose which benchmark results to publish?

Correct. Selective reporting — publishing only favorable benchmark scores — is legal, common, and creates a systematically distorted picture of model capability for anyone relying on those scores.

Selective reporting means companies show you the benchmarks where they look good and hide the ones where they don't. This is legal and common — and it means published benchmark scores are a curated highlight reel, not a complete picture.

4. When ChatGPT passed the USMLE (medical licensing exam) in January 2023, what did the lead researcher actually claim?

Correct. Dr. Tseng was careful: the USMLE result proved narrow capability — answering multiple-choice questions — not general medical competence. That distinction was lost in most headlines.

The researcher was careful: ChatGPT did well on structured multiple-choice questions, but couldn't explain answers, perform procedures, or handle real clinical judgment. That's narrow capability — not medical competence.

5. What is the difference between narrow capability and general capability?

Correct. Benchmarks measure narrow capability — a specific, defined task. Real-world usefulness typically requires general capability — flexible performance across varied, uncertain, real situations. The gap between these two is where most AI overclaiming lives.

Narrow capability is doing well at one specific, precisely defined task. General capability means handling the broad, messy, real-world version of that skill — including surprises, ambiguity, and edge cases that no benchmark fully captures.

6. What is Goodhart's Law?

Correct. Goodhart's Law (1975): when a measure becomes a target, it ceases to be a good measure. This applies powerfully to AI benchmarks where companies compete for leaderboard positions.

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Once companies optimize specifically for a benchmark score, the score stops reflecting the underlying capability it was designed to measure.

7. When EvalPlus researchers added harder test cases to HumanEval, AI model rankings changed dramatically. What does this reveal?

Correct. Overfitting to benchmark — models improving on the specific test cases without improving proportionally at general coding — is Goodhart's Law in action in AI development.

Models had been optimizing for the specific 164 HumanEval problems, not for general coding ability. When new edge cases appeared, the difference in real coding skill became visible. This is benchmark overfitting — a direct consequence of Goodhart's Law.

8. Why did Hugging Face shut down and rebuild its Open LLM Leaderboard in 2024?

Correct. Hugging Face acknowledged the leaderboard had been "saturated and gamed." They rebuilt it with private, undisclosed test sets to make gaming harder. This kind of honest reset is a sign of research integrity.

Hugging Face said directly that the leaderboard had been saturated and gamed — Goodhart's Law at industrial scale. Models were being fine-tuned to score well on those specific benchmarks rather than to improve at real tasks.

9. According to the four-question framework from Lesson 4, which question asks you to consider whether Goodhart's Law might already be eroding a benchmark's meaning?

Correct. Question 4 asks whether the benchmark has become a competitive target — because if it has, Goodhart's Law is almost certainly in play, and high scores on it are least trustworthy precisely because of that competition.

Question 4 of the framework addresses Goodhart's Law directly: "Is this benchmark a known target?" The more prestigious and competitive a benchmark leaderboard is, the more likely companies have optimized specifically for it — eroding its meaning.

10. What does "state of the art" (SOTA) mean in AI research?

Correct. SOTA is narrow, time-sensitive, and benchmark-specific. It does not mean best at the real-world task, best for your use case, significantly better than the runner-up, or certified safe for deployment.

SOTA means: best published result on a specific benchmark right now. Nothing more. Not deployment-ready, not better than humans at the real skill, not a general achievement. A narrow leaderboard snapshot.

11. In the 2019 Aristo story, what did AI2 researchers do right that most companies don't?

Correct. AI2's researchers were scrupulous — explicitly naming what Aristo couldn't do. That's intellectual honesty in action. The distortion happened later, at the journalism stage, despite their carefulness.

AI2's researchers explicitly stated Aristo's limitations — couldn't explain answers, couldn't perform procedures, didn't understand why answers were correct. That careful scoping is what good researchers do. The headlines distorted the story despite it.

12. Which of the following would be the most trustworthy signal that an AI evaluation is honest, rather than marketing?

Correct. Publishing non-winning scores and distinguishing test performance from deployment readiness are both rare and honest signals. They suggest a company values accuracy over appearance — which is meaningful precisely because it's uncommon.

Honest evaluation signals: publishing results where you didn't win, and clearly stating the gap between benchmark performance and real-world deployment readiness. Using "state of the art" language and high scores across many benchmarks are not reliable honesty signals.

13. A school district is evaluating an AI tutoring tool. The vendor says their model is "state of the art at reading comprehension" based on three benchmark scores. What should the district ask that most procurement processes skip?

Correct. A rigorous procurement process applies all four framework questions: what was tested, is the claim equivalent to the result, contamination risk, and Goodhart's Law risk. Most procurement processes skip all four.

The district should apply the full four-question framework: what exactly was tested, does SOTA claim match the benchmark result, could the model have trained on this data, and is this a known competitive leaderboard subject to gaming? Those are the questions most procurement processes miss.

14. Why do researchers say tracking improvement trends over time is more useful than point-in-time scores?

Correct. A model jumping from 60% to 85% tells a different story than one moving from 91% to 92% after a year of work. Trends reveal real progress vs. saturation, and can signal when scoring gains are coming from gaming rather than genuine improvement.

Trends reveal the rate and pattern of change — whether real progress is happening, whether a benchmark is becoming saturated, whether gains are slowing. A single score tells you none of that. Comparing a model's trajectory to others is far more informative than comparing snapshots.

15. A researcher publishes a careful, limited result about AI performance — explicitly stating what it does and doesn't prove. Headlines then misrepresent the result widely and policy is made based on the misrepresentation. According to this module's ethical framing, which position is MOST defensible?

Correct. The module frames this as a chain-of-responsibility problem with structural causes — not purely individual failures. Each actor in the chain (researcher, journalist, policymaker) contributes, and the incentive structures that make this chain function as it does deserve examination alongside individual choices.

The module's ethical framing distributes responsibility across the chain rather than assigning it to one actor. It also points to structural incentives — not just individual choices — as part of the explanation. Blaming any single actor misses the systemic nature of the problem.