In August 2023, researchers at UC Berkeley published a short, alarming paper. They had noticed something strange about GPT-4 and other top AI models: the models were scoring extraordinarily high on a benchmark called MMLU โ Massive Multitask Language Understanding โ a test covering 57 subjects from college physics to moral philosophy.
But when the researchers looked more carefully, they found that many of the MMLU questions had the wrong answers listed as correct. The test had been written by humans and never carefully proofread. Some questions were ambiguous. Some had two defensible correct answers. And the AI models โ trained on enormous amounts of internet text that included MMLU practice materials โ had learned to give the answers that matched the answer key, not the answers that were actually right.
Aarohi Narayanan, a graduate student on the team, described it bluntly: "We found cases where the model answered incorrectly but would have been correct according to any reasonable interpretation, and cases where it answered correctly but for clearly the wrong reason." The models had gotten good at the test, not at the thing the test was supposed to measure.
A benchmark is simply a test โ a standardized set of questions or tasks used to measure how well an AI performs at something. Just like your school uses standardized tests to see how students are doing in math or reading, AI researchers use benchmarks to compare different AI systems against each other.
The idea sounds reasonable. You want to know which AI is better at understanding language? Give them all the same test and see who scores higher. Simple, right?
Here's the problem: a test only measures what it was designed to measure, and only if it was designed well. When the test has errors, when the AI has seen similar questions before, or when the skill being tested isn't the skill you actually care about โ the score stops meaning anything real.
Here's something worth knowing: AI companies choose which benchmarks to publish. When OpenAI released GPT-4 in March 2023, its technical report listed scores on dozens of tests โ and the model did remarkably well on most of them. The report included scores on the BAR exam (the test lawyers take), the SAT, the GRE, and many more.
Those scores traveled fast. News headlines read things like "GPT-4 scores in the top 10% of bar exam takers." But those human exams were not designed to test AI โ they were designed to test whether a lawyer or doctor has absorbed certain knowledge and can reason under time pressure. An AI doesn't experience time pressure. An AI has potentially been trained on test prep materials. An AI doesn't have to remember things โ it processes everything at once.
When you see a company announce their model "achieved state-of-the-art performance" on a benchmark, it almost never means the AI is smarter than what came before. It usually means the AI did better on that specific test. Whether that test actually reflects the thing you care about is a separate question entirely โ and one most headlines skip over.
AI labs run their models on many benchmarks during development. They are not required to publish all the results. The ones you see in press releases are the ones that look good. This is called cherry-picking, and it's completely legal and extremely common. Knowing this changes how you should read every "breakthrough" headline.
There's another problem on top of cherry-picking: data contamination. This is what the Berkeley researchers were really uncovering. AI language models are trained on vast amounts of text scraped from the internet. That internet includes websites where people post MMLU questions and answers, blog posts analyzing SAT problems, Reddit threads discussing bar exam strategies.
So when you test the AI on those same problems, you're not testing whether it can reason โ you're partially testing whether it saw those questions during training. It's like studying by memorizing the answer key, then being graded on the same test. You might get an A without understanding anything.
Researchers at Stanford, Princeton, and other universities have tried to measure how severe this contamination is. The honest answer is: we don't fully know, because AI companies don't always disclose exactly what data their models were trained on. Some companies have been more transparent than others, but none have been fully transparent.
This creates a situation where the people designing the tests, the people building the AI, and the people who fund the AI companies are often not the same as the people who need to trust the AI's scores. You โ the person eventually using these tools โ are at the end of a very long chain of decisions you weren't part of.
When you read that an AI scored 90% on some benchmark, you now know to ask: Was that benchmark well-designed? Did the AI train on similar data? Did the company cherry-pick this score? Most journalists don't ask these questions. Most adults don't either. You do now.
Here's something that doesn't have a clean answer: AI companies have a financial incentive to report high benchmark scores. Higher scores attract investors, generate press coverage, and push people to use their products. The benchmarks themselves are often created by academic researchers with no financial stake โ but those researchers have a different incentive: they want their benchmark to be used widely, which means it helps them if major AI companies adopt it and report scores on it.
So you have a situation where the people creating the measuring stick benefit when the measuring stick gets used, and the people being measured benefit when the scores look good. The person with no say in any of this is you โ the eventual user who relies on those scores to decide which AI to trust.
Should AI companies be required to report all benchmark scores, not just the favorable ones? Who should get to decide which benchmarks matter โ the companies, the researchers, or the public? There's no obvious answer. Sit with it.
You're an independent benchmark auditor. A fictional AI company called Axiom Labs has just announced that their new model, Axiom-7, scored 96% on MMLU and passed the bar exam in the top 5% of takers. The company's AI spokesperson is very proud of these results.
Your job is to challenge those claims. Ask hard questions about data contamination, selective reporting, benchmark design, and what the scores actually prove. The spokesperson will defend Axiom-7 โ push back with what you've learned.
In January 2023, a team of researchers published a paper in PLOS Digital Health reporting that ChatGPT had passed all three parts of the United States Medical Licensing Examination โ the USMLE โ without any special medical training. The headlines were immediate and electric. "AI passes medical licensing exam." Some articles suggested this meant AI could replace doctors. Others said it proved AI was now "medical-grade."
But the lead researcher, Dr. Victor Tseng, was considerably more careful than the headlines. He pointed out that the USMLE tests whether someone can recall and apply medical knowledge in a structured text format. It does not test whether someone can listen to a frightened patient, notice that they're leaving out important symptoms, manage uncertainty in real time, or make a judgment call when two treatment options are equally supported by evidence.
ChatGPT had passed the exam. It had done so by being very good at one thing: reading a description of a medical situation and selecting the most statistically likely answer from multiple options. That is a genuinely impressive skill. It is also not the same skill as practicing medicine. The test measured one thing. The headlines claimed something much larger.
This story reveals a distinction that almost every AI benchmark obscures: the difference between narrow capability and general capability.
A narrow capability is being good at a specific, well-defined task โ like answering multiple-choice medical questions. A general capability is being good at a broad, flexible, real-world version of something โ like being a trustworthy doctor.
Every benchmark tests narrow capability. It cannot test anything else, because you can only measure what you can precisely define. "Being a good doctor" is extremely hard to define precisely. "Getting 70% of these questions right" is easy to measure.
The dangerous slide happens in headlines and press releases: a company measures narrow capability, reports it accurately, and then the broader world interprets it as proof of general capability. That gap โ between what was measured and what people believe was measured โ is where most AI misinformation lives.
Good benchmark designers are usually quite aware of what they're measuring and what they're not. The people who built the USMLE didn't claim it could identify a great physician โ they built it to create a minimum bar for medical knowledge recall. The problem isn't usually the benchmark itself. It's the chain of interpretation that follows.
Here's how that chain works: Researchers create a benchmark to test a narrow skill โ AI company runs their model on it and reports a good score โ journalists write "AI masters medicine" โ the public forms a belief about AI that the benchmark never supported โ policymakers make decisions based on that belief.
Each step in that chain adds a layer of distortion. And by the time a policy gets written or a purchasing decision gets made, the original careful definition from step one has completely evaporated.
Medical knowledge recall. Ability to match symptoms to likely diagnoses using text descriptions. Statistical pattern recognition across clinical scenarios.
"AI passes medical licensing exam" โ implying medical competence, potential to replace physicians, and trustworthiness in clinical settings.
Not all benchmarks fall into the interpretation trap. One of the more honest examples is the Winograd Schema Challenge, developed by Hector Levesque at the University of Toronto in 2012. It presents sentences where pronouns are ambiguous, and you have to figure out what the pronoun refers to using common sense โ not word frequency or grammar tricks.
Example: "The trophy didn't fit in the suitcase because it was too big. What was too big?" Humans immediately say "the trophy." Early AI systems would guess wrong or guess randomly because no grammar rule helps. Winograd's designers were very careful to say: this tests one narrow slice of common-sense reasoning, and nothing more.
By 2019, GPT-2 and similar models were starting to do reasonably well on Winograd. But the designers didn't celebrate by saying AI had "mastered common sense." They pointed out that models might be using statistical patterns in the training data rather than genuine reasoning โ and immediately started designing harder versions to probe that distinction.
That kind of intellectual honesty is rare in the benchmark world. When you see it, it's worth noticing. It usually signals researchers who care more about truth than about their model looking good.
Ask: What exactly did this test measure? Is that the same as the headline claim? What did this test explicitly not measure? Did the researchers who designed the benchmark make the same claim the journalists made? If the researchers are more cautious than the headline, trust the researchers.
Hospitals and insurance companies are now making decisions about whether to use AI tools in clinical settings, partly based on benchmark scores like USMLE performance. Some of these tools are being used to suggest diagnoses, flag unusual lab results, or recommend treatments.
If those tools were adopted because of a score that measured narrow capability โ but the decision-makers believed it proved general medical competence โ and something goes wrong with a patient, who is responsible? The AI company that accurately reported the score? The journalists who overstated it? The hospital that didn't dig into what the score actually meant? The policymakers who approved the tool?
There's no clean answer. But the fact that you can now trace this chain of responsibility โ from benchmark design to headline to policy to patient outcome โ puts you in a position most people in that chain are not in. Use it.
You're a science journalist fact-checker at a major publication. Your editor just sent you three AI headlines to verify before publication. Your research partner โ an AI assistant โ has access to the underlying papers but tends to take things at face value.
Your job is to interrogate your partner: push them to identify the gap between what the benchmark measured and what the headline claims. Challenge every assumption. The partner will respond and push back.
In 2022, a benchmark called HumanEval โ created by OpenAI โ became the standard way to measure AI coding ability. It presented 164 programming problems and checked whether the AI's code passed a series of automated tests. Companies started racing to top the HumanEval leaderboard.
By 2023, researchers at EvalPlus noticed something troubling. They added more test cases to the same HumanEval problems โ harder edge cases that the original benchmark hadn't included. When they ran the same AI models on this extended version, the rankings changed dramatically. Models that had looked dominant suddenly dropped. Some models that ranked lower on the original benchmark performed better on the harder version.
What had happened? AI companies had been โ sometimes deliberately, sometimes just by how they trained โ optimizing for the original test cases specifically. The benchmark had become the target. The models got better at passing HumanEval's 164 problems without getting proportionally better at writing code in general. Researchers called this overfitting to the benchmark โ a process so systematic it had its own name decades before AI existed.
In 1975, a British economist named Charles Goodhart was studying how the Bank of England tried to control inflation by targeting specific financial measures. He noticed something that would become one of the most quoted observations in economics, policy, and eventually AI:
"When a measure becomes a target, it ceases to be a good measure."
What this means in plain language: the moment you start trying to optimize a score specifically, the score stops telling you what it used to. This happens because there are usually many ways to improve a score. Some of those ways actually improve the underlying skill. Others just game the metric.
If your school starts tracking how many books students check out from the library, and teachers are evaluated based on that number, then students will check out books without reading them. The metric was "library checkouts as a proxy for reading engagement." Once checkouts became the target, they stopped measuring engagement โ they just measured checkouts.
In AI, this plays out through benchmark optimization. Once a benchmark becomes prestigious enough that companies want to top it, they start โ consciously or not โ training their models to do well on that benchmark specifically. The benchmark erodes as a measure of real capability even as scores keep climbing.
By 2024, the AI benchmark ecosystem had become a kind of arms race. Researchers at Hugging Face โ a platform that hosts AI models and leaderboards โ noticed that scores on their Open LLM Leaderboard were inflating faster than actual AI capability was improving. Models were being fine-tuned specifically to score well on benchmark sets, and the community was starting to suspect that the leaderboard reflected "benchmark fitness" more than "real-world usefulness."
In mid-2024, Hugging Face announced it was shutting down the original Open LLM Leaderboard v1 and rebuilding it with new, undisclosed test sets that models couldn't train on in advance. The announcement was blunt: the old benchmark had been "saturated and gamed." They needed to start over.
This kind of benchmark reset is actually a sign of a healthy research community โ people who recognized the problem and acted on it. But it also reveals how quickly a benchmark can lose its meaning once it becomes competitive currency. The half-life of a trustworthy benchmark is getting shorter every year.
Policy decisions about AI regulation, government contracts for AI tools, and standards for what "safe" AI means are all being written right now โ and some of them reference specific benchmarks as thresholds. When Goodhart's Law erodes those benchmarks, the policies built on them become meaningless or even counterproductive. This is not a theoretical future problem. It is happening in legislative committees today.
There is no perfect solution to Goodhart's Law in AI benchmarking, but researchers have developed several partial defenses worth knowing about.
Keep new benchmark data secret. If you don't tell model developers which exact questions will be on the test, they can't optimize for those questions specifically. Hugging Face's v2 leaderboard uses private test sets. The problem is that this makes it harder for independent researchers to audit the benchmark itself.
Constantly update benchmarks. Rotate new questions in, retire questions that have appeared widely online, and track improvement curves rather than point-in-time scores. This is labor-intensive and expensive.
Use multiple diverse benchmarks and look for agreement. If an AI scores high on ten very different benchmarks that test different skills in different ways, that's more credible than scoring high on one. If the scores disagree โ high on some, low on others โ that's more informative than any single score.
Use real-world task performance, not just benchmark scores. Some researchers are moving toward evaluating AI on tasks drawn from actual deployment contexts โ real customer service conversations, real code that got shipped, real patient notes โ rather than abstract tests. This is harder to standardize but harder to game.
Knowing Goodhart's Law changes how you read every AI leaderboard from now on. A rising score does not necessarily mean improving capability. It might mean improving optimization for that score. The question isn't "who's at the top?" โ it's "has this benchmark become a target yet, and if so, what does the top position actually mean?"
Some AI researchers argue that benchmark gaming is just a form of cheating โ companies are misleading the public and regulators by optimizing for scores rather than real capability. Others argue it's no different from students studying for a test: if you know the format, you prepare for the format, and there's nothing wrong with that.
But there's a third position: maybe the problem isn't the companies at all โ it's the structure of an industry where benchmark scores function as currency for investment, talent, and public trust. If the incentive structure makes gaming nearly rational, changing individual behavior won't fix it. You'd have to change the structure.
Who should be responsible for designing and maintaining AI benchmarks โ the companies that build AI, the researchers who study it, government regulators, or some independent body? And should benchmark results be legally regulated the same way that drug trial results are? There's no clean answer. Sit with it.
You've been hired by a fictional AI safety organization called Meridian Institute to design a new benchmark for measuring AI writing quality โ one that resists gaming and Goodhart's Law. Your collaborator is a senior researcher who will challenge every design decision you make.
Don't just describe a test โ defend every design choice against your collaborator's skepticism. Why will this resist gaming? How will you stop companies from optimizing for your specific metrics? What will you actually measure and what won't you measure?
In November 2019, the AI research lab Allen Institute for Artificial Intelligence โ known as AI2 โ made a striking announcement. Their model, Aristo, had scored 90% on an 8th-grade science exam and 83% on a 12th-grade science exam. The tests were real New York Regents exams โ the standardized tests that students across New York State take at the end of middle and high school.
The AI2 researchers were scrupulous about what they claimed. Their press release said Aristo "passed" these exams, not that it understood science. The lead researcher, Peter Clark, explicitly noted that Aristo could not explain its answers, could not draw diagrams, could not perform lab procedures, and had no understanding of why the right answers were right โ only that they were statistically associated with correctness in its training data.
And yet, many news outlets reported this as "AI now understands 8th-grade science." Some articles compared Aristo to a science teacher. The gap between what the researchers said and what the headlines claimed was so large that several science educators wrote public letters pointing out the distortion. The researchers' careful language had been discarded entirely by the time the story reached most readers.
Every time you encounter an AI benchmark claim โ in a news headline, a company press release, a school district's procurement decision, or a government policy document โ you now have the background to apply a simple framework. Here it is.
Not the headline version. The actual benchmark. What kind of questions? Multiple choice or open-ended? Text, images, or both? What's explicitly not tested?
Compare what the researchers actually claimed to what the headline claims. If the researchers are more cautious, trust them. The gap between the two is the distortion zone.
Was this a well-known public benchmark? Were the questions available online before training? If yes, data contamination is likely. If training data is undisclosed, you can't rule it out.
Is this one of the benchmarks that major companies compete to top publicly? If yes, Goodhart's Law is in play. High scores on prestigious leaderboards are least trustworthy precisely because everyone is optimizing for them.
You don't need to answer all four questions perfectly โ often you won't have enough information to do so. But asking them changes your relationship with the claim. You go from passive receiver to active evaluator. That's the shift that matters.
The phrase "state of the art" โ or SOTA โ appears constantly in AI research. It means: the best result anyone has published on a specific benchmark at this moment in time. Nothing more and nothing less.
It does not mean: the best possible AI. It does not mean: the best AI for your task. It does not mean: significantly better than the runner-up. It does not mean: better than humans at the underlying real-world skill. It simply means: highest number on a specific scoreboard right now.
SOTA changes constantly. A model that was SOTA in March might not be SOTA in June. And "best on benchmark X" often has no meaningful relationship to "best for use case Y." A model that tops a language benchmark might be mediocre at writing code. A model that leads a coding leaderboard might be poor at nuanced emotional understanding. SOTA is a narrow claim wearing a large costume.
When a government agency or school district issues a request for proposals requiring AI tools to demonstrate "state of the art performance," they may inadvertently be requiring a snapshot of one leaderboard at one moment โ which could be gamed, contaminated, or irrelevant to their actual use case. This is happening in real procurement documents written right now, by people who don't have the context you now have.
Given everything this module has covered, what does a trustworthy AI evaluation actually look like? Here are the signals worth watching for โ the things that indicate someone is taking evaluation seriously rather than using it as marketing.
They publish scores they didn't win. A company that shows you a benchmark where their model came second โ or even last โ is showing you something real. They're not cherry-picking. That restraint is rare and worth noting.
They distinguish capability from deployment readiness. A research paper that says "this model scores well on X but we have not tested it on real-world deployment in context Y" is being careful about scope. A company that says "our model is ready for clinical use because it passed the USMLE" is not.
They track trends, not point scores. Researchers who say "over the last six months, improvement on this benchmark has slowed, suggesting the model is hitting a ceiling" are telling you something more useful than researchers who say "we achieved a new high score." Trajectories matter more than snapshots.
They welcome independent replication. Trustworthy results can be reproduced by people who didn't conduct the original study. If a company won't share enough details about their model and evaluation setup to allow independent replication, the score is harder to trust โ regardless of how impressive it looks.
You now have a four-question framework, an understanding of data contamination, Goodhart's Law, and the narrow-vs-general capability gap. That's enough to read any AI benchmark claim more honestly than most of the journalists who cover AI, most of the executives who deploy it, and most of the policymakers who regulate it. This isn't an exaggeration. It's what the gap in public understanding actually looks like right now.
The Aristo story from 2019 shows researchers doing everything right โ they were scrupulous and careful โ and still having their work misrepresented. Several science educators wrote letters. The researchers published corrections and clarifications. And the misrepresentation still spread further than the correction.
This raises a question that doesn't have a clean answer: Do AI researchers have a responsibility to predict how their results will be misunderstood and proactively fight the misrepresentation โ even if it means being louder and less nuanced than they'd prefer? Or does the responsibility for accurate reporting lie entirely with journalists and policymakers?
And here's the harder version: if a researcher publishes a careful, nuanced result knowing it will likely be misunderstood in ways that affect policy โ and they do nothing to prevent that misunderstanding โ are they responsible for the policy that results? There's no clean answer. Sit with it.
You're briefing a fictional government committee โ the National AI Oversight Board โ that is drafting rules requiring AI companies to demonstrate "state of the art performance" before their tools can be used in schools. Your interlocutor is the committee's senior analyst, who knows a lot about technology but hasn't yet understood the problems with benchmark-based standards.
Your job is to explain โ clearly and specifically โ why using benchmark scores as a legal threshold is more complicated than it sounds, and to propose what a better approach might look like. The analyst will push back with practical concerns and budget constraints.