Lesson 1 · Module 3

MMLU and Academic Knowledge Benchmarks

How a 57-subject test became the de facto standard for measuring what language models actually know.

What does it mean for a model to "pass" a benchmark — and can a test become obsolete the moment a model aces it?

When UC Berkeley researchers released MMLU — the Massive Multitask Language Understanding benchmark — GPT-3 scored around 43%, roughly the level of random guessing on a four-choice test. Within two years, GPT-4 would score above 86%, surpassing the average performance of human experts on the same questions. The benchmark had gone from challenge to near-solved in under 36 months.

What MMLU Actually Tests

MMLU (Massive Multitask Language Understanding), introduced by Dan Hendrycks et al. in 2020, contains 14,042 multiple-choice questions across 57 subjects ranging from elementary mathematics to professional law, medicine, and ethics. The benchmark was explicitly designed to measure a model's world knowledge and problem-solving ability across the breadth of academic disciplines a highly educated human might encounter.

Each question has four answer choices. Models are typically evaluated in a "5-shot" setting, meaning five example questions with answers are shown before the test question, allowing the model to see the expected format without fine-tuning on the dataset itself.

Subject Areas

14,042

Total Questions

~25%

Random Baseline

89.9%

GPT-4o (May 2024)

The Contamination Problem

MMLU's rapid conquest by frontier models exposed a structural problem that now haunts most static benchmarks: data contamination. Because MMLU questions are publicly available, they may appear in the training corpora of models being tested on them. When researchers at MIT and EleutherAI investigated in 2023, they found evidence that several open-weight models showed anomalously high MMLU scores inconsistent with their performance on held-out tasks — a signature of benchmark leakage.

OpenAI acknowledged in its GPT-4 technical report that they took steps to detect and mitigate contamination, but noted that completely eliminating web-scraped test data from training corpora at scale is an unsolved problem. This means MMLU scores, especially for models trained on large internet datasets, should be interpreted with caution.

Real Event — 2023 MMLU Controversy

In late 2023, researchers discovered that the Llama-2 family of models scored notably higher on MMLU than on a "de-contaminated" version of the same questions rephrased to avoid direct text matches. The gap — sometimes 3–5 percentage points — suggested partial memorization rather than pure reasoning. Meta acknowledged the finding and noted that future model evaluations would use held-out benchmarks not released publicly prior to training cutoff.

MMLU-Pro and Its Successors

By 2024, MMLU saturation prompted the creation of MMLU-Pro, released by the TIGER-Lab at the University of Waterloo. MMLU-Pro increases question difficulty substantially and expands answer choices from four to ten, reducing the effectiveness of educated guessing. It also emphasizes questions requiring multi-step reasoning rather than pure recall. Where top models scored above 86% on standard MMLU, they scored closer to 62–72% on MMLU-Pro at release — restoring meaningful spread between model tiers.

This pattern — benchmark creation, rapid saturation, replacement by harder variant — has now repeated across nearly every major static evaluation. Understanding it is essential for reading leaderboard claims critically.

Original

MMLU (2020)

57 subjects, 4-choice MCQ, 5-shot setting. Now effectively saturated by frontier models.

Harder Variant

MMLU-Pro (2024)

10-choice MCQ, harder questions, emphasis on multi-step reasoning. Restores discrimination.

AGIEval (2023)

Microsoft. Uses real human exams (LSAT, GRE, Chinese gaokao) to test general reasoning.

Key Principle

A benchmark score is only meaningful relative to the difficulty ceiling of the test and the probability of contamination. When evaluating a model's MMLU score, always ask: When was the model trained? Was MMLU publicly available then? And what is the model's MMLU-Pro score for comparison?

5-shot Evaluation setting where five labeled examples precede the test question, calibrating format without fine-tuning on the task.

Benchmark saturation Point at which top models score near ceiling, making the benchmark unable to distinguish between frontier systems.

Data contamination Overlap between benchmark test questions and a model's training data, inflating scores beyond genuine capability.

Lesson 1 Quiz

MMLU and Academic Knowledge Benchmarks — 4 questions

How many subjects does the original MMLU benchmark span?

Correct. MMLU spans 57 subjects from elementary math to professional law and medicine, making it unusually broad for a single benchmark.

Not quite. MMLU covers 57 subject areas, deliberately designed to assess breadth across academic disciplines a highly educated human might know.

What is the key structural change MMLU-Pro made compared to the original MMLU?

Correct. MMLU-Pro uses 10-choice questions and harder multi-step reasoning tasks, restoring meaningful performance spread between model tiers that MMLU could no longer provide.

Not quite. MMLU-Pro retains multiple-choice format but expands options to 10 and increases difficulty, making educated guessing far less effective and discrimination between models more meaningful.

What does "data contamination" mean in the context of MMLU evaluation?

Correct. When publicly available benchmark questions end up in training corpora, models may effectively memorize answers rather than demonstrate genuine reasoning, inflating scores.

Incorrect. Data contamination refers specifically to overlap between benchmark test questions and the model's training data — meaning a model may have seen the answers during training, not just reasoned to them at test time.

What approximate MMLU score did GPT-3 achieve at the benchmark's release in 2020–2021?

Correct. GPT-3 scored roughly 43% — barely above the 25% random baseline for 4-choice questions — which is why MMLU was genuinely challenging when first introduced.

Not quite. GPT-3 scored around 43% on MMLU, close to random guessing on a 4-choice test. The leap to above-human performance came with GPT-4 in 2023.

Lab 1 — Interrogating MMLU

Use the AI assistant to probe MMLU's design choices and limitations.

Your Task

You are evaluating whether to use MMLU as a benchmark for selecting an AI model for a healthcare documentation tool. Ask the assistant at least three substantive questions about MMLU's design, its contamination risks, and whether its scores reliably predict performance on specialized medical knowledge tasks.

Suggested starting point: "We're comparing models for medical documentation. Should we trust their MMLU scores, or is there a better benchmark to use?"

Benchmark Advisor

MMLU Focus

Hello! I'm here to help you think through MMLU and academic benchmarks for your model evaluation needs. What are you trying to figure out?

Lesson 2 · Module 3

Coding and Mathematics Benchmarks

HumanEval, MATH, and GSM8K — measuring whether models can actually compute, not just talk about computing.

When a model claims to solve math problems, is it reasoning — or pattern-matching to solutions it has seen before?

OpenAI released HumanEval alongside its Codex model, introducing a benchmark where correctness was determined not by human judges but by actually running the generated code against hidden test cases. A model could not bluff its way to a high score — the code either worked or it didn't. Early Codex solved 28.8% of 164 programming problems. By 2024, Claude 3.5 Sonnet would reach above 90% on the same set.

HumanEval: Code That Actually Runs

HumanEval, released by OpenAI in 2021, contains 164 hand-written Python programming challenges. Each challenge includes a function signature, a docstring, and several hidden unit tests. A model generates a function body, and the benchmark executes the code — if all hidden tests pass, the problem is considered solved. The primary metric is pass@k: the probability that at least one of k generated samples passes all tests.

This execution-based evaluation approach made HumanEval important: it reduced the ability to game the benchmark through surface-level fluency. A model that writes grammatically perfect but logically broken code scores zero on that problem. However, HumanEval's 164 problems were quickly noted to be relatively short, self-contained, and heavily weighted toward standard library manipulation — not representative of real-world software engineering involving multi-file codebases or debugging.

Real Event — SWE-bench (2023)

In response to HumanEval's limitations, Princeton researchers released SWE-bench in late 2023, which tests models on real GitHub issues from popular open-source repositories. Models must generate code patches that fix actual bugs. Early results were humbling: GPT-4 solved only 1.7% of 2,294 issues, exposing the gap between clean benchmark performance and messy real-world software engineering. By mid-2024, Claude 3.5 Sonnet with agentic scaffolding reached 49% on a verified subset — a dramatic improvement but still far from human developer performance.

GSM8K: Grade-School Math Reasoning

GSM8K (Grade School Math 8K), released by OpenAI in 2021, contains 8,500 grade-school-level math word problems requiring multi-step reasoning. Unlike MMLU's multiple choice, GSM8K requires generating the correct numerical answer, making it harder to guess. The benchmark was designed to assess whether models could perform the sequential arithmetic reasoning a ten-year-old might use — and early large models struggled significantly.

GPT-3 (175B parameters) scored roughly 35% on GSM8K with chain-of-thought prompting. By 2024, frontier models routinely exceed 90%. GSM8K is now considered largely saturated for top-tier models, though smaller open-weight models still show meaningful spread — making it more useful for evaluating efficient or edge-deployed systems than frontier comparisons.

MATH: Competition-Level Problems

The MATH benchmark, also from Dan Hendrycks (UC Berkeley, 2021), contains 12,500 problems drawn from high-school math competitions including AMC, AIME, and MATHCOUNTS. Problems are categorized by difficulty from Level 1 to Level 5. Early models scored in the single digits on Level 4 and 5 problems.

MATH proved more durable as a discriminating benchmark because Level 4–5 problems require genuine multi-step algebraic and proof-based reasoning. GPT-4 initially scored around 52% overall. The introduction of OpenAI's o1 model in September 2024, which uses explicit chain-of-thought reasoning before answering, pushed MATH scores above 90% — again triggering discussion about whether harder variants like AIME 2024 were needed.

Coding

HumanEval

164 Python problems, execution-based. pass@k metric. Now saturated for frontier models.

Coding (Hard)

SWE-bench

Real GitHub bug fixes. Much harder. Frontier models at ~49% (verified subset, 2024).

Math (Easy)

GSM8K

8,500 grade-school word problems. Nearly saturated for frontier models (>90%).

Math (Hard)

MATH

Competition problems, Levels 1–5. Still discriminating at highest difficulty levels.

Practical Insight

For coding benchmarks, always check whether evaluation was done with execution against hidden tests (reliable) or by LLM-as-judge comparison (less reliable). HumanEval and SWE-bench use execution; many informal leaderboards do not. A model claiming "95% on coding tasks" with no mention of test execution should prompt skepticism.

pass@k The probability that at least one of k generated code samples passes all hidden unit tests for a given problem. pass@1 is the strictest; pass@10 allows multiple attempts.

Chain-of-thought Prompting technique where the model is encouraged to show intermediate reasoning steps before giving a final answer, significantly improving performance on multi-step math and logic tasks.

Lesson 2 Quiz

Coding and Mathematics Benchmarks — 4 questions

How does HumanEval determine whether a model has correctly solved a programming problem?

Correct. HumanEval's execution-based evaluation is what makes it relatively robust — code that compiles but produces wrong outputs still fails.

Not quite. HumanEval actually executes the generated code against hidden unit tests. This is what makes it harder to game with fluent-sounding but incorrect code.

What made SWE-bench significantly harder than HumanEval?

Correct. SWE-bench uses real GitHub issues from actual open-source projects, requiring models to understand large codebases and fix genuine bugs — far more complex than isolated function writing.

Incorrect. SWE-bench's difficulty comes from using real GitHub issues requiring models to navigate large, multi-file codebases and generate working patches for actual bugs reported by real developers.

What is the primary metric used in HumanEval evaluations?

Correct. pass@k measures the probability that at least one of k generated solutions passes all hidden tests. pass@1 is strictest; higher k values give models more chances.

Not correct. HumanEval uses pass@k: the probability that at least one of k generated code samples fully passes all hidden unit tests for a given problem.

Why was the MATH benchmark more durable as a discriminating test than GSM8K?

Correct. MATH's Level 4–5 competition problems (from AMC, AIME, MATHCOUNTS) require genuine proof-style reasoning that frontier models could not easily pattern-match, making it harder to saturate than GSM8K's arithmetic word problems.

Not quite. MATH's durability comes from its highest-difficulty problems drawn from real math competitions. These require genuine algebraic and proof-based multi-step reasoning that surface-level pattern matching cannot solve.

Lab 2 — Choosing a Coding Benchmark

Advise on benchmark selection for a software development tool evaluation.

Your Task

Your company is evaluating AI coding assistants for integration into your development workflow. You need to select appropriate benchmarks to test candidates. Ask the assistant about the trade-offs between HumanEval and SWE-bench for your use case, and explore what pass@k values tell you about real-world reliability.

Suggested starting point: "We want to use an AI coding assistant for production code reviews. Which benchmark should we trust more — HumanEval or SWE-bench?"

Benchmark Advisor

Coding Focus

Ready to help you choose the right coding benchmark. What's your deployment context and what do you need the AI assistant to actually do?

Lesson 3 · Module 3

Safety, Alignment, and Reasoning Benchmarks

TruthfulQA, HellaSwag, BIG-bench, and the challenge of measuring what a model should not do.

If a model scores 95% on factual accuracy benchmarks but confidently states falsehoods in deployment — what did the benchmark actually measure?

When Stephanie Lin and colleagues at Oxford released TruthfulQA, they had a specific hypothesis: that large language models would systematically reproduce false beliefs common on the internet — not because they lacked knowledge, but because training on human-generated text means absorbing human misconceptions. They were right. GPT-3 scored only 58% on truthfulness, worse than humans who averaged 94%. Larger models were sometimes less truthful than smaller ones — a pattern they called "inverse scaling."

TruthfulQA: Measuring Calibrated Honesty

TruthfulQA contains 817 questions across 38 categories including health, law, finance, and conspiracy theories. The benchmark specifically targets questions that "some humans answer falsely due to a misconception or false belief" — questions like "What happens if you swallow gum?" or "Is the Great Wall of China visible from space?" A truthful model should answer correctly even when popular culture has embedded a false answer.

Scoring TruthfulQA is more complex than multiple-choice benchmarks. The original paper used a fine-tuned GPT-3 judge to rate responses on both truthfulness and informativeness, since some models learned to dodge questions entirely, scoring technically "truthful" while providing no useful information. GPT-4 scores above 85% on TruthfulQA; early GPT-3.5 variants scored in the 60s.

Real Finding — Inverse Scaling

TruthfulQA's 2021 paper documented that GPT-3 at 6.7B parameters scored higher on truthfulness than the 175B version. The authors attributed this to larger models being better at imitating the style of confident-but-wrong human writing found on the internet. This "inverse scaling" finding was influential in motivating RLHF (Reinforcement Learning from Human Feedback) as an alignment technique, later adopted by OpenAI for InstructGPT and GPT-4.

HellaSwag: Common-Sense Completion

HellaSwag, from the University of Washington (2019), tests commonsense reasoning through sentence completion. Given a short description of a mundane activity, models choose the most plausible continuation from four options. The wrong answers ("adversarial" endings) are generated by another model and then filtered to remove ones humans find obviously wrong — ensuring the task requires genuine situational understanding, not just grammar.

HellaSwag showed early GPT-2 scoring around 40–50%, while humans scored above 95%. It became a standard benchmark in the 2019–2022 period. By 2023, top models exceeded 95%, and HellaSwag is now considered fully saturated for frontier systems, though it remains a useful check for smaller or specialized models.

BIG-bench: Breadth at Scale

BIG-bench (Beyond the Imitation Game Benchmark), published in 2022 by a collaboration of over 400 researchers across more than 130 institutions, contains 204 tasks spanning an unusually wide range of capabilities — from formal logic and mathematics to social reasoning, humor, and theory of mind. The project was explicitly designed to identify tasks where scale helps, tasks where it doesn't, and tasks where larger models perform paradoxically worse.

BIG-bench Hard (BBH) is a 23-task subset of the most challenging problems where frontier models still show meaningful performance gaps. Tasks include logical deduction, multi-step arithmetic, date reasoning, and causal reasoning. BBH remains one of the more durable benchmarks as of 2024 because its tasks were selected specifically for being hard for large models at the time of creation.

Truthfulness

TruthfulQA

817 questions targeting common human misconceptions. Documented inverse scaling in early models.

Common Sense

HellaSwag

Commonsense completion with adversarial wrong answers. Now saturated for frontier models.

Breadth

BIG-bench Hard

23 tasks selected for difficulty. Logical deduction, causal reasoning. Still discriminating in 2024.

Safety Adjacent

WinoGrande

Pronoun resolution requiring social and physical commonsense. 44K problems. Allen Institute (2020).

Critical Limitation of Safety Benchmarks

TruthfulQA measures whether a model reproduces known misconceptions — not whether it will hallucinate facts in novel contexts. A model can score 90% on TruthfulQA while still confidently fabricating citations, dates, or statistics in real deployments. Safety benchmarks measure known failure modes; they cannot guarantee absence of unknown ones. Always pair benchmark scores with red-team testing for your specific deployment domain.

Inverse scaling Phenomenon where larger models perform worse than smaller ones on specific tasks — documented on TruthfulQA and some BIG-bench tasks, attributed to better imitation of human misconceptions.

Adversarial filtering Process of removing benchmark distractors that are obviously wrong, ensuring remaining wrong answers require genuine understanding to distinguish from correct ones.

Lesson 3 Quiz

Safety, Alignment, and Reasoning Benchmarks — 4 questions

What specific type of questions does TruthfulQA target?

Correct. TruthfulQA deliberately targets questions where widespread human false beliefs — about health, history, law — might be reproduced by models trained on internet text.

Not quite. TruthfulQA focuses on questions that humans commonly get wrong because of popular misconceptions — for example, believing the Great Wall is visible from space or that gum stays in your stomach for years.

What did TruthfulQA's "inverse scaling" finding reveal about larger language models?

Correct. The inverse scaling finding showed GPT-3 175B scoring lower on truthfulness than the 6.7B version — larger models were more fluent at imitating confidently-stated false beliefs from internet text.

Incorrect. Inverse scaling refers specifically to larger models performing worse on truthfulness than smaller ones — because greater scale means greater ability to reproduce the style of human writing, including human misconceptions stated with confidence.

What technique does HellaSwag use to ensure its wrong answers require genuine reasoning to reject?

Correct. Adversarial filtering — generating distractors with a model then removing obviously wrong ones — ensures the remaining wrong answers are plausible enough to require genuine situational understanding to reject.

Not quite. HellaSwag uses adversarial filtering: wrong answers are generated by a language model and filtered to remove options that humans find obviously incorrect, leaving only plausible-sounding distractors.

Why does BIG-bench Hard (BBH) remain a more durable benchmark than MMLU or HellaSwag for frontier model comparison?

Correct. BBH's 23 tasks were selected from the full BIG-bench suite precisely because they were hardest for large models at creation time — building in difficulty headroom that has made it more durable than benchmarks not designed with future model capability in mind.

Not quite. BBH's durability comes from task selection methodology: the 23 tasks were chosen specifically because they were hardest for frontier models when BIG-bench was created, giving the benchmark more headroom before saturation than most contemporaneous alternatives.

Lab 3 — Testing Truthfulness Claims

Probe how TruthfulQA scores relate to real deployment reliability.

Your Task

A vendor claims their model scores 88% on TruthfulQA and argues this makes it suitable for a legal research assistant. Ask the assistant at least three questions to explore what this score does and does not guarantee, what kinds of hallucination TruthfulQA would miss, and what additional evaluation you should require.

Suggested starting point: "A vendor says their model scores 88% on TruthfulQA. Is that enough to trust it for legal research where accuracy is critical?"

Benchmark Advisor

TruthfulQA Focus

Good question to dig into. TruthfulQA scores and real-world reliability aren't always the same thing. What's your specific deployment context?

Lesson 4 · Module 3

Emerging Benchmarks and Leaderboard Literacy

GPQA, HELM, Chatbot Arena, and how to read benchmark claims without being misled by them.

When a company announces their model is "#1 on the leaderboard," which leaderboard? Measured how? By whom?

When Google announced Gemini 1.5 Pro in early 2024, the accompanying technical report cited performance on dozens of benchmarks. On some it topped the charts; on others it trailed GPT-4. Rather than surface a single number, Google emphasized specific task domains — a strategy that had become standard across frontier AI labs, each selecting the benchmarks where their model looked best. Reading these announcements required knowing which benchmarks had been omitted as much as which had been highlighted.

GPQA: Expert-Level Science Questions

GPQA (Graduate-Level Google-Proof Q&A), released by Rein et al. in 2023, contains 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Crucially, the questions are designed to be Google-proof: answers cannot be found by searching for the question phrasing, requiring genuine expert-level understanding to solve. Non-expert humans with internet access score around 34% on the diamond subset; PhD-level domain experts score around 65%.

GPT-4 initially scored around 35–39% on GPQA Diamond — below expert human performance. OpenAI's o1 model, released in September 2024, reached approximately 78%, exceeding reported PhD expert performance and triggering significant discussion about whether current benchmarks could remain discriminating through 2025.

448

GPQA Questions

34%

Non-expert humans

65%

PhD experts

78%

o1 (Sep 2024)

HELM: Holistic Evaluation

HELM (Holistic Evaluation of Language Models), developed by the Center for Research on Foundation Models (CRFM) at Stanford in 2022, takes a different philosophy from single-task benchmarks. Rather than one score, HELM evaluates models across 42 scenarios and 7 metrics simultaneously — including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. A model is ranked on a dashboard across all dimensions simultaneously.

HELM's approach reflects growing recognition that a single accuracy number obscures important trade-offs. A model can score high on accuracy while being poorly calibrated (overconfident), biased toward certain demographic groups, or brittle to small input perturbations. HELM was used in the Stanford HAI Foundation Model Transparency Index and influenced the evaluation frameworks of several government AI procurement guidelines.

Chatbot Arena: Human Preference at Scale

Chatbot Arena, developed by LMSYS (Large Model Systems Organization) at UC Berkeley, represents a fundamentally different evaluation paradigm. Instead of fixed benchmark questions, users submit real prompts and are shown two anonymized responses side by side, then vote for whichever they prefer. Elo ratings (borrowed from chess) are computed from millions of pairwise comparisons.

As of 2024, Chatbot Arena has collected over 1 million human preference votes and is considered one of the most reliable indicators of real-world usefulness precisely because it aggregates actual user preferences across natural, diverse prompts. Its limitation is the opposite of benchmark gaming: ratings reflect what users find satisfying, which may correlate imperfectly with accuracy, safety, or honesty. Models trained specifically to produce pleasing-sounding responses can score well even when subtly incorrect.

Real Event — "Cherry-Picking" Controversy (2024)

In May 2024, several AI researchers publicly noted that some labs released benchmark results selectively — citing improvements on subsets of evaluations while not publishing the full HELM dashboard or acknowledging declines on other dimensions. The practice prompted calls for standardized reporting requirements analogous to clinical trial pre-registration. The AI safety organization METR (formerly ARC Evals) began requiring labs it worked with to commit to evaluation sets before training began, to prevent retroactive benchmark selection.

Leaderboard Literacy: A Practical Checklist

When a model provider publishes benchmark results, apply these questions before drawing conclusions:

1. Who ran the evaluation? Self-reported benchmarks have obvious conflict-of-interest concerns. Third-party evaluations (Stanford HELM, METR, LMSYS) are more credible.

2. What was the prompting strategy? Zero-shot, few-shot, chain-of-thought, or retrieval-augmented? Different settings can change scores dramatically on the same benchmark.

3. Is the benchmark saturated? A 95% score on HellaSwag in 2024 tells you almost nothing about frontier capability differences.

4. What benchmarks are missing? Labs select the benchmarks where they perform best. Absence of GPQA, SWE-bench, or HELM in a report is informative.

5. What is the contamination risk? Check the model's training cutoff against the benchmark's public release date.

The Benchmark Trilemma

Every benchmark faces a trilemma: it can be (1) cheap to run, (2) hard to contaminate, or (3) representative of real-world performance — but typically achieves at most two. HumanEval is cheap and executable but not representative of real software engineering. GPQA is hard to contaminate and discriminating but narrow. Chatbot Arena is representative but expensive to scale and biased toward user preference over accuracy. Use benchmarks as a portfolio, not a single score.

Elo rating A statistical system for computing relative skill levels from pairwise outcomes, borrowed from chess and used by Chatbot Arena to rank models from millions of human preference votes.

Calibration The degree to which a model's expressed confidence matches its actual accuracy. A well-calibrated model that says it's 70% confident should be right 70% of the time.

Google-proof Describes benchmark questions where searching for the answer phrasing does not return the correct response — requiring genuine understanding rather than retrieval.

Lesson 4 Quiz

Emerging Benchmarks and Leaderboard Literacy — 4 questions

What makes GPQA questions "Google-proof"?

Correct. Google-proof means the question phrasing, even if searched directly, does not return the correct answer — requiring expert reasoning rather than retrieval or pattern matching to known text.

Not quite. "Google-proof" means the question is designed so that searching for its text online doesn't lead you to the answer — the problem requires genuine domain expertise to solve, not retrieval skill.

What is the key philosophical difference between HELM and single-task benchmarks like MMLU?

Correct. HELM's holistic approach — 42 scenarios across 7 metrics including calibration, bias, robustness, and toxicity — reflects the insight that a single accuracy number obscures crucial trade-offs relevant to real deployment decisions.

Not quite. HELM's distinguishing feature is evaluating models across multiple dimensions simultaneously — accuracy, calibration, fairness, toxicity, robustness, and efficiency — rather than collapsing everything into one number.

What is the primary limitation of Chatbot Arena's human preference evaluation approach?

Correct. Models that generate fluent, confident, well-formatted responses can score highly on Chatbot Arena even when subtly incorrect — because users prefer style and apparent helpfulness, which don't perfectly track factual accuracy or safety.

Not quite. Chatbot Arena's key limitation is that human preference for a response doesn't guarantee that response is accurate or safe. Models can be optimized to sound satisfying without being correct, and Arena scores reflect that bias.

According to the "benchmark trilemma," which two properties does Chatbot Arena achieve that most static benchmarks do not?

Correct. Chatbot Arena is cheap (crowdsourced from real users) and representative (real diverse prompts), but not hard to contaminate — models can be specifically tuned to produce responses humans prefer regardless of accuracy.

Not quite. The benchmark trilemma involves: cheap to run, hard to contaminate, and representative of real use. Chatbot Arena achieves cheap (crowdsourced) and representative (natural diverse prompts), but not contamination-resistance.

Lab 4 — Reading a Benchmark Report

Apply leaderboard literacy to a real vendor benchmark claim.

Your Task

A vendor sends you a one-page summary: "Our model ranks #1 on Chatbot Arena and scores 91% on MMLU. It is the best general-purpose AI available." Ask the assistant at least three questions to systematically evaluate this claim using everything you've learned about benchmarks in this module — what's convincing, what's missing, and what else you'd need.

Suggested starting point: "A vendor claims #1 on Chatbot Arena and 91% on MMLU. How should I evaluate whether this makes their model right for our enterprise use case?"

Benchmark Advisor

Leaderboard Literacy

Interesting vendor claim — let's break it down systematically. What industry or use case are you evaluating this model for?

Module 3 — Final Test

Common Benchmarks Explained · 15 questions · Pass at 80%

1. How many multiple-choice questions does the original MMLU benchmark contain?

Correct. MMLU contains 14,042 questions across 57 subjects.

Incorrect. MMLU contains 14,042 questions. 817 is TruthfulQA; 8,500 is GSM8K; 448 is GPQA.

2. What does the "5-shot" evaluation setting mean in MMLU?

Correct. 5-shot means five answered examples precede the test question, calibrating format without fine-tuning.

Incorrect. 5-shot means five labeled example Q&A pairs are shown before the actual test question — establishing format context without fine-tuning the model on the task.

3. What is benchmark saturation?

Correct. Saturation means the benchmark can no longer differentiate top models because they all score near the maximum.

Incorrect. Saturation refers to the point where top models all score near ceiling — the benchmark loses discriminating power for comparing frontier systems.

4. What primary change did MMLU-Pro make to restore discriminating power?

Correct. MMLU-Pro uses 10 answer choices and harder reasoning-heavy questions, dropping frontier model scores from 86%+ to the 62–72% range at release.

Incorrect. MMLU-Pro's key changes were expanding options to 10 choices and increasing difficulty toward multi-step reasoning tasks, significantly widening the gap between model tiers again.

5. What does pass@k measure in HumanEval?

Correct. pass@k is the probability at least one of k samples fully passes all hidden test cases — pass@1 being strictest.

Incorrect. pass@k measures the probability that at least one of k generated code solutions passes all hidden unit tests for a problem. pass@1 is most stringent; higher k values give more attempts.

6. When SWE-bench was first released, what percentage of issues did GPT-4 successfully resolve?

Correct. GPT-4 resolved only 1.7% of SWE-bench issues at release, highlighting the gap between clean benchmark performance and real-world software engineering complexity.

Incorrect. GPT-4 solved roughly 1.7% of SWE-bench issues — an unexpectedly low result that underscored how hard real-world bug fixing in large codebases is compared to isolated function generation.

7. Which benchmark source material does the MATH benchmark draw its hardest problems from?

Correct. MATH draws from AMC, AIME, MATHCOUNTS, and similar competition mathematics, which require genuine multi-step algebraic reasoning.

Incorrect. MATH's hardest problems come from competition math — AMC, AIME, MATHCOUNTS — which require proof-style multi-step reasoning far beyond standardized test arithmetic.

8. What did TruthfulQA's inverse scaling finding reveal about GPT-3?

Correct. GPT-3 175B scored lower on TruthfulQA than 6.7B — larger models were better at imitating the confident-but-false writing style common on the internet.

Incorrect. The inverse scaling finding showed GPT-3 175B was less truthful than the smaller 6.7B version — scale made it better at mimicking confidently stated false beliefs from its internet training data.

9. TruthfulQA contains how many questions across how many categories?

Correct. TruthfulQA has 817 questions across 38 categories including health, law, finance, and conspiracy theories.

Incorrect. TruthfulQA contains 817 questions across 38 categories. 14,042/57 is MMLU; 448/3 is GPQA; 8,500 is GSM8K.

10. How many tasks does BIG-bench contain in total, and how many are in the BIG-bench Hard (BBH) subset?

Correct. BIG-bench has 204 tasks; BBH selects the 23 hardest for frontier-model discrimination.

Incorrect. BIG-bench contains 204 tasks contributed by over 400 researchers. BIG-bench Hard (BBH) is the 23-task subset selected for being hardest for large models at creation time.

11. GPQA's "Diamond" subset difficulty places PhD-level domain experts at approximately what score?

Correct. PhD-level domain experts score around 65% on GPQA Diamond; non-experts with internet access score only ~34%.

Incorrect. GPQA Diamond is hard enough that PhD-level domain experts score roughly 65%. Non-expert humans with internet access score only about 34% — illustrating genuine expert-level difficulty.

12. What rating system does Chatbot Arena use to rank models from pairwise human preference votes?

Correct. Chatbot Arena uses Elo ratings — the same system used in chess — to compute relative model rankings from pairwise human preference votes.

Incorrect. Chatbot Arena uses Elo ratings from competitive chess, which compute relative skill levels from pairwise win/loss outcomes across millions of comparisons.

13. HELM evaluates models across how many scenarios and metrics simultaneously?

Correct. HELM uses 42 scenarios evaluated across 7 metrics including accuracy, calibration, fairness, robustness, bias, toxicity, and efficiency.

Incorrect. HELM covers 42 scenarios and 7 metrics — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — providing a multidimensional view rather than a single score.

14. Which benchmark is described by the module as most useful for evaluating efficient or edge-deployed models in 2024, despite being saturated for frontier comparisons?

Correct. GSM8K is saturated for top-tier models but still shows meaningful spread for smaller or edge-deployed systems, making it useful in those evaluation contexts.

Incorrect. GSM8K is noted as still useful for smaller or edge-deployed model evaluation even though frontier models exceed 90% — it retains discriminating power in that tier of the capability landscape.

15. According to the module's benchmark literacy checklist, what does the absence of SWE-bench or GPQA in a vendor's benchmark report typically signal?

Correct. Benchmark selection bias means labs report where they perform well. A report citing only saturated or favorable benchmarks while omitting harder ones is a meaningful signal worth investigating.

Incorrect. The module explicitly notes that benchmark omission is as informative as inclusion — labs select evaluations that favor their model. Missing harder benchmarks like SWE-bench or GPQA warrants asking why they were not reported.