When UC Berkeley researchers released MMLU — the Massive Multitask Language Understanding benchmark — GPT-3 scored around 43%, roughly the level of random guessing on a four-choice test. Within two years, GPT-4 would score above 86%, surpassing the average performance of human experts on the same questions. The benchmark had gone from challenge to near-solved in under 36 months.
MMLU (Massive Multitask Language Understanding), introduced by Dan Hendrycks et al. in 2020, contains 14,042 multiple-choice questions across 57 subjects ranging from elementary mathematics to professional law, medicine, and ethics. The benchmark was explicitly designed to measure a model's world knowledge and problem-solving ability across the breadth of academic disciplines a highly educated human might encounter.
Each question has four answer choices. Models are typically evaluated in a "5-shot" setting, meaning five example questions with answers are shown before the test question, allowing the model to see the expected format without fine-tuning on the dataset itself.
MMLU's rapid conquest by frontier models exposed a structural problem that now haunts most static benchmarks: data contamination. Because MMLU questions are publicly available, they may appear in the training corpora of models being tested on them. When researchers at MIT and EleutherAI investigated in 2023, they found evidence that several open-weight models showed anomalously high MMLU scores inconsistent with their performance on held-out tasks — a signature of benchmark leakage.
OpenAI acknowledged in its GPT-4 technical report that they took steps to detect and mitigate contamination, but noted that completely eliminating web-scraped test data from training corpora at scale is an unsolved problem. This means MMLU scores, especially for models trained on large internet datasets, should be interpreted with caution.
In late 2023, researchers discovered that the Llama-2 family of models scored notably higher on MMLU than on a "de-contaminated" version of the same questions rephrased to avoid direct text matches. The gap — sometimes 3–5 percentage points — suggested partial memorization rather than pure reasoning. Meta acknowledged the finding and noted that future model evaluations would use held-out benchmarks not released publicly prior to training cutoff.
By 2024, MMLU saturation prompted the creation of MMLU-Pro, released by the TIGER-Lab at the University of Waterloo. MMLU-Pro increases question difficulty substantially and expands answer choices from four to ten, reducing the effectiveness of educated guessing. It also emphasizes questions requiring multi-step reasoning rather than pure recall. Where top models scored above 86% on standard MMLU, they scored closer to 62–72% on MMLU-Pro at release — restoring meaningful spread between model tiers.
This pattern — benchmark creation, rapid saturation, replacement by harder variant — has now repeated across nearly every major static evaluation. Understanding it is essential for reading leaderboard claims critically.
A benchmark score is only meaningful relative to the difficulty ceiling of the test and the probability of contamination. When evaluating a model's MMLU score, always ask: When was the model trained? Was MMLU publicly available then? And what is the model's MMLU-Pro score for comparison?
You are evaluating whether to use MMLU as a benchmark for selecting an AI model for a healthcare documentation tool. Ask the assistant at least three substantive questions about MMLU's design, its contamination risks, and whether its scores reliably predict performance on specialized medical knowledge tasks.
OpenAI released HumanEval alongside its Codex model, introducing a benchmark where correctness was determined not by human judges but by actually running the generated code against hidden test cases. A model could not bluff its way to a high score — the code either worked or it didn't. Early Codex solved 28.8% of 164 programming problems. By 2024, Claude 3.5 Sonnet would reach above 90% on the same set.
HumanEval, released by OpenAI in 2021, contains 164 hand-written Python programming challenges. Each challenge includes a function signature, a docstring, and several hidden unit tests. A model generates a function body, and the benchmark executes the code — if all hidden tests pass, the problem is considered solved. The primary metric is pass@k: the probability that at least one of k generated samples passes all tests.
This execution-based evaluation approach made HumanEval important: it reduced the ability to game the benchmark through surface-level fluency. A model that writes grammatically perfect but logically broken code scores zero on that problem. However, HumanEval's 164 problems were quickly noted to be relatively short, self-contained, and heavily weighted toward standard library manipulation — not representative of real-world software engineering involving multi-file codebases or debugging.
In response to HumanEval's limitations, Princeton researchers released SWE-bench in late 2023, which tests models on real GitHub issues from popular open-source repositories. Models must generate code patches that fix actual bugs. Early results were humbling: GPT-4 solved only 1.7% of 2,294 issues, exposing the gap between clean benchmark performance and messy real-world software engineering. By mid-2024, Claude 3.5 Sonnet with agentic scaffolding reached 49% on a verified subset — a dramatic improvement but still far from human developer performance.
GSM8K (Grade School Math 8K), released by OpenAI in 2021, contains 8,500 grade-school-level math word problems requiring multi-step reasoning. Unlike MMLU's multiple choice, GSM8K requires generating the correct numerical answer, making it harder to guess. The benchmark was designed to assess whether models could perform the sequential arithmetic reasoning a ten-year-old might use — and early large models struggled significantly.
GPT-3 (175B parameters) scored roughly 35% on GSM8K with chain-of-thought prompting. By 2024, frontier models routinely exceed 90%. GSM8K is now considered largely saturated for top-tier models, though smaller open-weight models still show meaningful spread — making it more useful for evaluating efficient or edge-deployed systems than frontier comparisons.
The MATH benchmark, also from Dan Hendrycks (UC Berkeley, 2021), contains 12,500 problems drawn from high-school math competitions including AMC, AIME, and MATHCOUNTS. Problems are categorized by difficulty from Level 1 to Level 5. Early models scored in the single digits on Level 4 and 5 problems.
MATH proved more durable as a discriminating benchmark because Level 4–5 problems require genuine multi-step algebraic and proof-based reasoning. GPT-4 initially scored around 52% overall. The introduction of OpenAI's o1 model in September 2024, which uses explicit chain-of-thought reasoning before answering, pushed MATH scores above 90% — again triggering discussion about whether harder variants like AIME 2024 were needed.
For coding benchmarks, always check whether evaluation was done with execution against hidden tests (reliable) or by LLM-as-judge comparison (less reliable). HumanEval and SWE-bench use execution; many informal leaderboards do not. A model claiming "95% on coding tasks" with no mention of test execution should prompt skepticism.
Your company is evaluating AI coding assistants for integration into your development workflow. You need to select appropriate benchmarks to test candidates. Ask the assistant about the trade-offs between HumanEval and SWE-bench for your use case, and explore what pass@k values tell you about real-world reliability.
When Stephanie Lin and colleagues at Oxford released TruthfulQA, they had a specific hypothesis: that large language models would systematically reproduce false beliefs common on the internet — not because they lacked knowledge, but because training on human-generated text means absorbing human misconceptions. They were right. GPT-3 scored only 58% on truthfulness, worse than humans who averaged 94%. Larger models were sometimes less truthful than smaller ones — a pattern they called "inverse scaling."
TruthfulQA contains 817 questions across 38 categories including health, law, finance, and conspiracy theories. The benchmark specifically targets questions that "some humans answer falsely due to a misconception or false belief" — questions like "What happens if you swallow gum?" or "Is the Great Wall of China visible from space?" A truthful model should answer correctly even when popular culture has embedded a false answer.
Scoring TruthfulQA is more complex than multiple-choice benchmarks. The original paper used a fine-tuned GPT-3 judge to rate responses on both truthfulness and informativeness, since some models learned to dodge questions entirely, scoring technically "truthful" while providing no useful information. GPT-4 scores above 85% on TruthfulQA; early GPT-3.5 variants scored in the 60s.
TruthfulQA's 2021 paper documented that GPT-3 at 6.7B parameters scored higher on truthfulness than the 175B version. The authors attributed this to larger models being better at imitating the style of confident-but-wrong human writing found on the internet. This "inverse scaling" finding was influential in motivating RLHF (Reinforcement Learning from Human Feedback) as an alignment technique, later adopted by OpenAI for InstructGPT and GPT-4.
HellaSwag, from the University of Washington (2019), tests commonsense reasoning through sentence completion. Given a short description of a mundane activity, models choose the most plausible continuation from four options. The wrong answers ("adversarial" endings) are generated by another model and then filtered to remove ones humans find obviously wrong — ensuring the task requires genuine situational understanding, not just grammar.
HellaSwag showed early GPT-2 scoring around 40–50%, while humans scored above 95%. It became a standard benchmark in the 2019–2022 period. By 2023, top models exceeded 95%, and HellaSwag is now considered fully saturated for frontier systems, though it remains a useful check for smaller or specialized models.
BIG-bench (Beyond the Imitation Game Benchmark), published in 2022 by a collaboration of over 400 researchers across more than 130 institutions, contains 204 tasks spanning an unusually wide range of capabilities — from formal logic and mathematics to social reasoning, humor, and theory of mind. The project was explicitly designed to identify tasks where scale helps, tasks where it doesn't, and tasks where larger models perform paradoxically worse.
BIG-bench Hard (BBH) is a 23-task subset of the most challenging problems where frontier models still show meaningful performance gaps. Tasks include logical deduction, multi-step arithmetic, date reasoning, and causal reasoning. BBH remains one of the more durable benchmarks as of 2024 because its tasks were selected specifically for being hard for large models at the time of creation.
TruthfulQA measures whether a model reproduces known misconceptions — not whether it will hallucinate facts in novel contexts. A model can score 90% on TruthfulQA while still confidently fabricating citations, dates, or statistics in real deployments. Safety benchmarks measure known failure modes; they cannot guarantee absence of unknown ones. Always pair benchmark scores with red-team testing for your specific deployment domain.
A vendor claims their model scores 88% on TruthfulQA and argues this makes it suitable for a legal research assistant. Ask the assistant at least three questions to explore what this score does and does not guarantee, what kinds of hallucination TruthfulQA would miss, and what additional evaluation you should require.
When Google announced Gemini 1.5 Pro in early 2024, the accompanying technical report cited performance on dozens of benchmarks. On some it topped the charts; on others it trailed GPT-4. Rather than surface a single number, Google emphasized specific task domains — a strategy that had become standard across frontier AI labs, each selecting the benchmarks where their model looked best. Reading these announcements required knowing which benchmarks had been omitted as much as which had been highlighted.
GPQA (Graduate-Level Google-Proof Q&A), released by Rein et al. in 2023, contains 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Crucially, the questions are designed to be Google-proof: answers cannot be found by searching for the question phrasing, requiring genuine expert-level understanding to solve. Non-expert humans with internet access score around 34% on the diamond subset; PhD-level domain experts score around 65%.
GPT-4 initially scored around 35–39% on GPQA Diamond — below expert human performance. OpenAI's o1 model, released in September 2024, reached approximately 78%, exceeding reported PhD expert performance and triggering significant discussion about whether current benchmarks could remain discriminating through 2025.
HELM (Holistic Evaluation of Language Models), developed by the Center for Research on Foundation Models (CRFM) at Stanford in 2022, takes a different philosophy from single-task benchmarks. Rather than one score, HELM evaluates models across 42 scenarios and 7 metrics simultaneously — including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. A model is ranked on a dashboard across all dimensions simultaneously.
HELM's approach reflects growing recognition that a single accuracy number obscures important trade-offs. A model can score high on accuracy while being poorly calibrated (overconfident), biased toward certain demographic groups, or brittle to small input perturbations. HELM was used in the Stanford HAI Foundation Model Transparency Index and influenced the evaluation frameworks of several government AI procurement guidelines.
Chatbot Arena, developed by LMSYS (Large Model Systems Organization) at UC Berkeley, represents a fundamentally different evaluation paradigm. Instead of fixed benchmark questions, users submit real prompts and are shown two anonymized responses side by side, then vote for whichever they prefer. Elo ratings (borrowed from chess) are computed from millions of pairwise comparisons.
As of 2024, Chatbot Arena has collected over 1 million human preference votes and is considered one of the most reliable indicators of real-world usefulness precisely because it aggregates actual user preferences across natural, diverse prompts. Its limitation is the opposite of benchmark gaming: ratings reflect what users find satisfying, which may correlate imperfectly with accuracy, safety, or honesty. Models trained specifically to produce pleasing-sounding responses can score well even when subtly incorrect.
In May 2024, several AI researchers publicly noted that some labs released benchmark results selectively — citing improvements on subsets of evaluations while not publishing the full HELM dashboard or acknowledging declines on other dimensions. The practice prompted calls for standardized reporting requirements analogous to clinical trial pre-registration. The AI safety organization METR (formerly ARC Evals) began requiring labs it worked with to commit to evaluation sets before training began, to prevent retroactive benchmark selection.
When a model provider publishes benchmark results, apply these questions before drawing conclusions:
1. Who ran the evaluation? Self-reported benchmarks have obvious conflict-of-interest concerns. Third-party evaluations (Stanford HELM, METR, LMSYS) are more credible.
2. What was the prompting strategy? Zero-shot, few-shot, chain-of-thought, or retrieval-augmented? Different settings can change scores dramatically on the same benchmark.
3. Is the benchmark saturated? A 95% score on HellaSwag in 2024 tells you almost nothing about frontier capability differences.
4. What benchmarks are missing? Labs select the benchmarks where they perform best. Absence of GPQA, SWE-bench, or HELM in a report is informative.
5. What is the contamination risk? Check the model's training cutoff against the benchmark's public release date.
Every benchmark faces a trilemma: it can be (1) cheap to run, (2) hard to contaminate, or (3) representative of real-world performance — but typically achieves at most two. HumanEval is cheap and executable but not representative of real software engineering. GPQA is hard to contaminate and discriminating but narrow. Chatbot Arena is representative but expensive to scale and biased toward user preference over accuracy. Use benchmarks as a portfolio, not a single score.
A vendor sends you a one-page summary: "Our model ranks #1 on Chatbot Arena and scores 91% on MMLU. It is the best general-purpose AI available." Ask the assistant at least three questions to systematically evaluate this claim using everything you've learned about benchmarks in this module — what's convincing, what's missing, and what else you'd need.