Lesson 1 · Module 2

What a Number Actually Means

Score anatomy — understanding what sits behind a single benchmark percentage

When a model scores 89.7% on MMLU, what exactly is being measured — and what is not?

In June 2023, Meta released Llama 2 with a technical report claiming strong performance across a range of benchmarks. Simultaneously, Anthropic released Claude 2. Both teams published tables of numbers. Within days, journalists and engineers were writing headlines that one model "beat" the other on reasoning — citing a single MMLU score to make the case. The problem: both teams used subtly different few-shot prompting setups, different answer extraction methods, and different evaluation subsets. The numbers were real. The comparison was not.

Anatomy of a Benchmark Score

A benchmark score is a compression: it collapses thousands of individual model decisions into a single number for legibility. Understanding what that compression discards is as important as reading the number itself.

Every published score is the product of at least five distinct choices: (1) the task set — which questions or prompts are used; (2) the prompting format — how many examples are shown before the test question (zero-shot vs. few-shot); (3) the decoding strategy — greedy, temperature sampling, beam search; (4) the answer extraction rule — does "B" count if the model says "The answer is B" vs. outputting the letter alone; and (5) the aggregation method — micro vs. macro averaging across subtopics.

Each of these choices can shift scores by several percentage points on the same underlying model. The Open LLM Leaderboard run by Hugging Face standardized all five for exactly this reason when it launched in 2023 — yet even there, models trained on data that includes benchmark questions ("benchmark contamination") produce inflated scores that do not reflect generalization.

Real Case — MMLU Score Variance

Researchers at EleutherAI demonstrated in 2023 that re-running the same model (LLaMA-1 65B) on MMLU with a different answer extraction regex — matching the first capital letter anywhere in the response vs. only at the start — changed the reported score by up to 3.2 percentage points. No weights changed. No data changed. Only the post-processing changed.

Reading the Score Line

When you encounter a published benchmark score, the minimum information required to interpret it responsibly is:

Benchmark nameWhich dataset and task type. MMLU (multiple-choice knowledge), HumanEval (code generation), HellaSwag (commonsense NLI), GSM8K (grade-school math) all measure fundamentally different things.

Shot count0-shot, 5-shot, 25-shot. More shots generally inflate scores because the model sees formatting examples. Cross-shot comparisons are nearly meaningless.

MetricAccuracy, exact match, pass@k, BLEU, normalized accuracy. Each rewards different behaviors. Pass@k for code is especially sensitive to k.

EvaluatorSelf-reported scores from model labs are not the same as third-party reproductions. Both Google (PaLM 2 report, 2023) and Meta (Llama 2 report, 2023) published scores that independent reproducers found slightly different when replicating conditions exactly.

Score vs. Rank — A Critical Distinction

A model scoring 88% on MMLU and another scoring 85% may not meaningfully differ in practical capability. Standard error on a 14,000-question test at typical model performance levels means differences under about 1–2 percentage points are often within statistical noise. Yet leaderboards display these as distinct ranked positions, creating a false precision that influences purchasing decisions and research directions.

The correct question to ask is never "which score is higher?" in isolation — it is: Is the difference statistically significant? Is it replicated across multiple benchmarks? Does it generalize to the actual task I care about?

Core Principle

A benchmark score without its evaluation methodology is like a clinical trial result without its study design. The number alone tells you almost nothing about whether the result is real, reproducible, or relevant to your use case.

Element	What to look for	Why it matters
Shot count	0-shot or few-shot?	Few-shot adds up to ~5pp on MMLU
Decoding	Greedy vs. sampled?	Greedy inflates exact-match metrics
Extraction	How is the answer pulled?	Regex differences → ±3pp
Aggregation	Macro or micro average?	Small subtasks dominate micro-avg
Contamination	Test set in training data?	Inflates by unknown, possibly large amount

Lesson 1 Quiz

What a Number Actually Means · 4 questions

1. EleutherAI researchers showed that changing only the answer extraction regex on LLaMA-1 65B shifted MMLU scores by up to how much?

Correct. A 3.2pp shift from post-processing alone — with no model change — illustrates how sensitive benchmark numbers are to implementation details.

Not quite. EleutherAI found up to 3.2 percentage points of difference from answer extraction alone — a striking illustration of methodological sensitivity.

2. Which of the following is NOT one of the five key choices that shape a benchmark score?

Correct. Server location is irrelevant to benchmark score computation. The five key choices are: task set, prompting format, decoding strategy, answer extraction rule, and aggregation method.

Server location has no effect on benchmark scores. The five choices that do matter are: task set, prompting format, decoding strategy, answer extraction rule, and aggregation method.

3. Why did the Hugging Face Open LLM Leaderboard standardize all five evaluation choices when it launched in 2023?

Exactly right. Standardization removes the noise introduced by different labs using different prompting, decoding, and extraction setups — enabling fairer comparisons.

The leaderboard standardized to enable meaningful comparisons — when each lab uses different methods, you cannot tell whether score differences reflect real capability differences or just methodological ones.

4. When two models score 88% and 85% on MMLU, the most appropriate conclusion is:

Correct. Small gaps on a single benchmark are often within statistical noise and should be confirmed across multiple benchmarks before drawing conclusions about overall capability.

A 3-point gap is within the range of statistical noise on MMLU. Responsible interpretation requires checking significance and replicating the pattern across multiple benchmarks.

Lab 1 — Score Dissection

Practice reading and critiquing benchmark score reports

Your Task

You will be shown a hypothetical (but realistic) benchmark claim. Ask the AI to help you identify what information is missing, what methodology questions need answers, and whether the comparison is valid. Complete at least 3 exchanges to finish the lab.

Starting scenario: "Model A scores 91.2% on MMLU (5-shot). Model B scores 89.4% on MMLU. Therefore Model A has superior general knowledge." — What questions should you ask before accepting this conclusion?

Score Dissection Lab

Welcome to Lab 1. I'm here to help you practice critically reading benchmark scores. The scenario above presents a common type of claim you'll encounter in model announcements and blog posts. What's the first question you'd want to ask about that comparison?

Lesson 2 · Module 2

Leaderboards and Their Distortions

How ranking systems shape incentives, inflate scores, and mislead practitioners

Why do leaderboard-topping models sometimes underperform on real tasks — and how do you read past the ranking?

When the LMSYS Chatbot Arena launched in 2023, it introduced an Elo-based ranking system driven by human preference votes. Almost immediately, model developers began noticing that optimizing for Chatbot Arena Elo — producing verbose, confident, well-formatted responses — diverged from optimizing for accuracy on factual tasks. A model could climb the leaderboard by sounding authoritative while being measurably less accurate on knowledge benchmarks. The ranking was real. The capability it implied was not always.

How Leaderboards Work

Most AI leaderboards fall into two families: automated benchmarks (like Hugging Face's Open LLM Leaderboard, which runs fixed test sets with standardized evaluation code) and human preference rankings (like Chatbot Arena, which aggregates pairwise human votes into an Elo score).

Each family measures something real but different. Automated benchmarks measure accuracy on curated tasks under controlled conditions. Human preference rankings measure whether a response feels better — which conflates accuracy, style, confidence, length, and formatting. Neither directly measures what most practitioners actually care about: task-specific performance in production.

Goodhart's Law in Action — 2023–2024

Once benchmark scores became selection criteria for enterprise procurement, labs optimized explicitly for benchmark performance. Mistral AI's documentation for Mixtral 8x7B (December 2023) was notably careful to list exact evaluation configurations, precisely because the company knew that other labs' numbers used incompatible setups. Meanwhile, models fine-tuned specifically on MMLU-adjacent data showed scores that did not transfer to novel reasoning tasks — a textbook instance of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

The Contamination Problem

Benchmark contamination occurs when test questions (or near-duplicates) appear in a model's training data. Because the internet contains solutions to many standardized test questions, and because large language models train on internet-scale data, contamination is nearly universal to some degree.

In April 2024, researchers at the Allen Institute for AI published a contamination analysis showing that several top-ranked models on MMLU had non-trivial overlap between their pretraining data and the MMLU test set. Scores were inflated by an estimated 2–7 percentage points depending on the model and subject area. The leaderboard ranks held — but they measured memorization alongside (and sometimes instead of) reasoning.

Elo scoreA rating derived from pairwise comparisons (win/loss/tie). Chatbot Arena Elo reflects human preference in head-to-head matchups — not factual accuracy or task completion rate.

ContaminationTest questions appearing in training data, inflating scores by unknown amounts. Detected via n-gram overlap analysis, but difficult to fully eliminate at web scale.

Goodhart's LawOriginally from economics: "When a measure becomes a target, it ceases to be a good measure." In AI: optimizing explicitly for benchmark scores degrades the benchmark's validity as a signal of capability.

Reading Past the Rank

Experienced practitioners use leaderboards as a coarse filter, not a final verdict. A model in the top 10 of a well-run automated leaderboard is probably worth evaluating further. A model at rank 1 is not necessarily the best choice for your specific application.

The right move is to look at the confidence interval or standard error around each score (if published), check whether the evaluation was third-party or self-reported, and look at per-category breakdowns rather than aggregate scores. A model that ranks 3rd overall but ranks 1st in medical question answering may be exactly what a healthcare application needs.

Automated Leaderboard

Fixed

Task set, standardized eval code, reproducible

Human Preference (Arena)

Dynamic

Pairwise votes, Elo rating, style-sensitive

Self-Reported

Variable

Lab's own setup, not independently verified

Third-Party Repro

Gold

Most trustworthy; often differs from self-report

Practitioner Rule

Use leaderboards to build a shortlist. Use your own evaluation data to make the final decision. A model's leaderboard rank is a hypothesis about its capability — your production task is the experiment that tests it.

Lesson 2 Quiz

Leaderboards and Their Distortions · 4 questions

1. The LMSYS Chatbot Arena uses which scoring system to rank models?

Correct. Chatbot Arena Elo is derived from pairwise head-to-head votes by human raters — making it a preference signal, not an accuracy signal.

Chatbot Arena uses Elo ratings from pairwise human preference comparisons — not accuracy benchmarks. This distinction matters because preference can diverge from accuracy.

2. Allen Institute for AI research (2024) estimated that benchmark contamination inflated MMLU scores by approximately how much for affected models?

Correct. The 2–7pp range illustrates why contamination is a serious concern — it can reverse the ordering of models that are otherwise close in capability.

The Allen Institute found 2–7 percentage points of inflation depending on model and subject area — enough to meaningfully change leaderboard positions among closely ranked models.

3. According to Goodhart's Law as applied to AI benchmarks, what happens when a benchmark score becomes an explicit optimization target?

Exactly right. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Models fine-tuned for benchmark performance can achieve high scores without the underlying capability generalizing.

Goodhart's Law says the opposite: once a metric is the target, it degrades as a signal of the underlying quality it was supposed to measure. High benchmark scores achieved through explicit optimization may not generalize.

4. A model ranks 3rd overall on the Open LLM Leaderboard but 1st in medical question answering. For a healthcare application, this model is:

Correct. Domain-specific sub-scores are often more predictive of production performance than overall aggregate rank. A practitioner should always look at per-category breakdowns.

For domain-specific applications, the relevant category score is far more predictive than overall rank. A model dominating medical QA is a strong candidate for a healthcare use case regardless of aggregate position.

Lab 2 — Leaderboard Analysis

Practice evaluating leaderboard claims and spotting distortions

Your Task

Explore leaderboard-related scenarios with the AI assistant. Ask about specific cases of Goodhart's Law, contamination, or the difference between Elo-based and accuracy-based rankings. Complete at least 3 substantive exchanges.

Try asking: "How would I tell whether a model's leaderboard rank is driven by contamination vs. genuine capability?" — or explore any leaderboard distortion scenario from the lesson.

Leaderboard Analysis Lab

Welcome to Lab 2. Let's dig into leaderboard mechanics. I can help you work through contamination detection, Goodhart's Law examples in AI benchmarking, or the difference between human preference rankings and accuracy-based automated evaluations. What would you like to explore?

Lesson 3 · Module 2

Comparing Scores Across Models

The conditions that make cross-model comparison valid — and the many ways it goes wrong

What needs to be true before you can say one model score is "better than" another?

When OpenAI released GPT-4 in March 2023 and Anthropic released Claude 2 in July 2023, both companies published benchmark tables. Both reported MMLU scores. GPT-4 used 5-shot. Claude 2's technical report used a different few-shot format with chain-of-thought prompting. The numbers sat in adjacent columns of comparison tables across the internet — but they were not produced by equivalent evaluation pipelines. Comparing them directly was, statistically, comparing apples to slightly different apples with different cutting techniques.

The Comparability Requirements

For two benchmark scores to support a valid comparison, they must share: the same benchmark version (test sets are sometimes revised), the same shot count, the same prompting template, the same decoding configuration, and ideally the same evaluator running both. When any of these differ, the delta you observe is a mixture of real capability difference and methodological noise — and you often cannot separate them.

The Open LLM Leaderboard exists precisely to solve this: it runs all models through identical code, at identical settings, on identical test data. When scores from that system are compared, the comparability requirements are met. When scores from two different labs' own technical reports are compared, they almost never are.

Real Case — GPT-4 Technical Report, 2023

OpenAI's GPT-4 technical report explicitly noted that their MMLU evaluation used a "chain-of-thought" prompting variant in some comparisons and standard 5-shot in others. Footnotes clarified which was which — but most news coverage and blog posts dropped the footnotes, presenting all numbers as directly comparable. This is one of the most common real-world failure modes in benchmark reporting.

Model Size vs. Score — The Scaling Confound

A 70B parameter model and a 7B parameter model are not competing on equal terms. Comparing their raw scores without controlling for scale is like comparing a car's fuel efficiency without noting one is a compact and the other is a truck. Yet leaderboards routinely rank them together.

The appropriate comparison units are: same scale class vs. same scale class, or score per compute budget. Mistral 7B's 2023 paper was notable for explicitly framing its results as "performance per parameter" comparisons — a much more informative frame than raw score ranking against much larger models.

Confidence Intervals and Effect Size

MMLU contains 14,042 questions. At 85% accuracy, a model answers roughly 11,936 correctly. The 95% confidence interval for this proportion is approximately ±0.6 percentage points. Two models scoring 85.0% and 85.8% are statistically indistinguishable by this measure. Yet they appear as distinct ranked positions with a "winner."

GSM8K (grade-school math) contains only 1,319 test examples. The confidence interval is correspondingly wider: ±1.4pp at typical performance levels. Differences under 3 percentage points on GSM8K are often within noise. On HumanEval (164 problems), the intervals are so wide that differences under 5–10 percentage points may not be meaningful.

Benchmark	N (test questions)	Approx. 95% CI at 80% acc.	Min. meaningful diff.
MMLU	14,042	±0.66pp	~2pp
GSM8K	1,319	±2.15pp	~5pp
HumanEval	164	±6.1pp	~10pp
HellaSwag	10,042	±0.78pp	~2–3pp
ARC-Challenge	1,172	±2.3pp	~5pp

Practical Cross-Model Comparison Protocol

When you need to compare two models and cannot run them yourself on a standardized leaderboard, the most defensible approach is: 1) Check whether both scores come from the same evaluator. 2) Verify the shot count matches. 3) Check the confidence intervals — is the gap larger than noise? 4) Look for consistency across at least three benchmarks. 5) Find at least one third-party reproduction of each claim. If you cannot satisfy all five, treat the comparison as indicative rather than definitive.

The One-Number Trap

No single benchmark score is sufficient to characterize a model's capability. The most rigorous comparisons use a battery of benchmarks across different capability dimensions — reasoning, knowledge, code, instruction-following — and look for consistent patterns rather than isolated wins.

Lesson 3 Quiz

Comparing Scores Across Models · 4 questions

1. When comparing GPT-4 and Claude 2 MMLU scores from their respective technical reports, the main validity problem is:

Correct. GPT-4 used 5-shot standard evaluation; Claude 2 used a chain-of-thought prompting variant. Same benchmark name, incompatible pipelines.

The core problem is that both teams used different prompting setups (standard 5-shot vs. chain-of-thought variants), making the scores methodologically incomparable even though both called it "MMLU."

2. Mistral 7B's 2023 paper was notable for framing its benchmark results in terms of:

Correct. Framing results as "performance per parameter" is a more informative comparison when models differ substantially in scale — it separates efficiency from brute-force compute.

Mistral's paper emphasized performance per parameter — a framing that makes much more sense when comparing a 7B model against 70B+ models, since raw scores conflate capability with scale.

3. HumanEval contains only 164 test problems. What does this imply about interpreting score differences between two models on HumanEval?

Exactly right. With only 164 problems, the 95% confidence interval at typical performance levels is roughly ±6 percentage points — making small differences statistically unreliable.

Small test sets produce wide confidence intervals. With 164 problems, differences under approximately 10 percentage points often fall within statistical noise on HumanEval.

4. Which of the following is the most defensible approach when comparing two models using scores from their respective technical reports?

Correct. This five-step protocol addresses the key validity threats: methodological equivalence, statistical significance, and reproducibility.

The defensible approach requires checking: same evaluator, matching shot counts, confidence interval analysis, cross-benchmark consistency, and third-party reproduction. Shortcuts risk accepting methodological artifacts as real capability differences.

Lab 3 — Cross-Model Comparison

Practice applying the comparability protocol to realistic scenarios

Your Task

Work through cross-model comparison scenarios with the AI. Practice applying the five-step comparability protocol and reasoning about confidence intervals. Complete at least 3 exchanges.

Starting scenario: Your team is choosing between Model X (89% MMLU, self-reported, 0-shot) and Model Y (87% MMLU, Open LLM Leaderboard, 5-shot). Which is better? Walk through the comparability protocol with the AI.

Cross-Model Comparison Lab

Welcome to Lab 3. Let's practice cross-model comparison. The scenario above is the kind of comparison you'll encounter constantly in practice — two numbers that look comparable but aren't. Walk me through your thinking on the Model X vs. Model Y scenario, or bring your own example to analyze.

Lesson 4 · Module 2

Translating Scores to Real-World Expectations

The gap between benchmark performance and production performance — and how to bridge it

A model aces your benchmark of choice. Why might it still disappoint in deployment — and what can you do about it?

In 2023, Bloomberg reported on enterprise teams that had selected large language models based on MMLU leaderboard position for customer-facing legal summarization tasks. The models ranked highly. In production, they hallucinated citations, missed jurisdiction-specific nuances, and failed on document lengths that exceeded their context windows — none of which MMLU tests for. The benchmark scores were accurate. The translation assumption was flawed.

The Distribution Shift Problem

Benchmark scores measure performance on a fixed, curated dataset. Production tasks involve a distribution of real user inputs that almost certainly differs from that dataset in vocabulary, length, format, ambiguity, and domain specificity. This is the distribution shift problem: a model optimized or selected for one distribution will degrade on another, and the degree of degradation is not predictable from the benchmark score alone.

MMLU tests multiple-choice academic questions drawn from US standardized test materials. Most production NLP tasks are not multiple-choice and are not drawn from US standardized tests. The overlap in skill requirements is real but partial. High MMLU performance is a necessary but not sufficient condition for high performance on most production reasoning tasks.

Real Case — Medical AI, 2023

Google's Med-PaLM 2 was reported to achieve expert-level performance on USMLE (United States Medical Licensing Exam) questions in 2023. However, researchers and clinicians reviewing the system noted that USMLE multiple-choice questions are a specific, structured format quite unlike the open-ended, context-rich queries that arise in clinical settings. High USMLE scores indicated strong medical knowledge encoding — but did not directly predict clinical utility, which required additional evaluation on realistic clinical vignettes and open-ended question formats.

What Benchmarks Cannot Measure

Current standardized benchmarks are particularly poor at measuring: instruction-following fidelity on novel tasks (following complex multi-step instructions the model has never seen); calibration (whether the model knows when it doesn't know something); long-context coherence (maintaining consistency across 50k+ token documents); tool use reliability (correctly calling APIs and handling errors); and adversarial robustness (maintaining accuracy under deliberate prompt manipulation).

Several of these gaps are now addressed by newer benchmark families. RULER and LongBench test long-context performance. TruthfulQA probes calibration and epistemic honesty. MT-Bench uses multi-turn instruction following. But none of these fully replicate production conditions for specific domains.

Distribution shiftThe difference between the data distribution a model was evaluated on and the distribution it encounters in production. A primary driver of benchmark-to-production performance gaps.

CalibrationThe alignment between a model's expressed confidence and its actual accuracy. Poorly calibrated models can score well on accuracy benchmarks while being dangerously overconfident in production.

Task-specific evaluationEvaluation on examples drawn from or closely resembling the actual production task. The most reliable predictor of production performance.

Building a Translation Bridge

The most reliable path from benchmark score to production expectation is a layered evaluation strategy. Start with standardized leaderboards for initial filtering. Then run the shortlisted models on a task-specific held-out evaluation set — ideally 100+ examples drawn from or similar to your actual production inputs. Then run a controlled production pilot with real users on a subset of traffic. Only at the third stage do you have reliable evidence for production performance.

The benchmark score tells you which models are worth the cost of step 2. It does not tell you which model will win step 2 or step 3. Teams that skip steps 2 and 3 and deploy based on leaderboard rank alone consistently report disappointing production outcomes.

Stage 1 — Filter

Leaderboard

Narrows field; low cost; not predictive of domain performance

Stage 2 — Evaluate

Task-Specific

100+ domain examples; most predictive of production performance

Stage 3 — Validate

Pilot

Real users, real traffic, real feedback; gold standard

The Practitioner's Translation Rule

Treat benchmark scores as evidence that a model is capable of learning the skills your task requires — not as evidence that it has already learned them for your specific task. The gap between "can" and "does" is closed by task-specific evaluation, not by leaderboard position.

Score Interpretation Summary

Across this module: a benchmark score is a compressed, methodology-dependent, potentially contaminated, statistically noisy estimate of performance on a specific task set that may or may not resemble your production task. Read it that way — as a starting point for investigation, not a final verdict — and you will make substantially better model selection decisions.

Lesson 4 Quiz

Translating Scores to Real-World Expectations · 4 questions

1. Google's Med-PaLM 2 demonstrated expert-level USMLE performance in 2023. Clinicians noted this result mainly showed:

Correct. USMLE is multiple-choice and structured — very different from the open-ended, context-rich queries in clinical practice. High USMLE scores indicate knowledge but not clinical utility directly.

USMLE performance indicates medical knowledge encoding — but clinical practice involves open-ended, context-rich queries that the structured multiple-choice format does not capture. Additional task-specific evaluation was needed.

2. Which of the following capabilities are current standardized benchmarks POOREST at measuring?

Correct. These are exactly the capability gaps that make high benchmark scores poor predictors of real-world performance in complex deployment scenarios.

Standard benchmarks are reasonably good at knowledge recall, code generation (HumanEval), and arithmetic (GSM8K). They are notably weak on long-context coherence, calibration, and adversarial robustness.

3. In the three-stage evaluation strategy described in this lesson, what is the role of the leaderboard score?

Correct. Leaderboard scores filter the field at low cost. Task-specific evaluation (Stage 2) and production pilots (Stage 3) are required to make reliable deployment decisions.

In the recommended three-stage approach, leaderboard scores serve as a filter — they narrow the candidate field at low cost. They do not replace task-specific evaluation or production validation.

4. The "distribution shift" problem in benchmarking refers to:

Exactly right. Distribution shift is the core reason why benchmark performance often fails to predict production performance — the evaluation data and the production data come from different distributions.

Distribution shift refers to the mismatch between evaluation data distribution and production data distribution. It is the primary structural reason why benchmark scores often don't predict production performance.

Lab 4 — Production Translation

Design an evaluation strategy that bridges benchmarks to your real task

Your Task

Use the AI to design a concrete three-stage evaluation strategy for a realistic deployment scenario. Practice identifying distribution shift risks, capability gaps, and what task-specific evaluation data you'd need. Complete at least 3 exchanges.

Choose a scenario to explore: (A) Selecting a model for legal document summarization. (B) Selecting a model for customer support chatbot. (C) Selecting a model for medical triage assistance. Walk through the translation challenges with the AI.

Production Translation Lab

Welcome to Lab 4 — the capstone lab for this module. We're going to practice translating benchmark knowledge into deployment decisions. Pick a scenario from the prompt above (or bring your own), and let's work through the three-stage evaluation strategy together. Which use case would you like to start with?

Module 2 Test

Reading Benchmark Scores · 15 questions · Pass at 80%

1. Which five choices most directly shape a benchmark score's reported value?

Correct. These five implementation choices can shift reported scores by several percentage points on the same model with no weight changes.

The five choices that shape benchmark scores are: task set, prompting format, decoding strategy, answer extraction rule, and aggregation method.

2. What specific technique did EleutherAI use to demonstrate score sensitivity without changing model weights?

Correct. Changing only the regex for answer extraction shifted LLaMA-1 65B MMLU scores by up to 3.2 percentage points.

EleutherAI changed the answer extraction regex — how the model's text output is parsed to determine if it chose the correct option — producing up to 3.2pp of score difference.

3. The Hugging Face Open LLM Leaderboard was designed primarily to:

Correct. Standardization is the leaderboard's core value proposition — same code, same settings, same data for every model.

The Open LLM Leaderboard standardizes all five evaluation choices so that score differences reflect genuine capability differences rather than methodological noise.

4. Chatbot Arena Elo scores measure:

Correct. Elo reflects human preference — which is influenced by verbosity, confidence, formatting, and style alongside actual accuracy.

Chatbot Arena Elo measures human preference in head-to-head matchups. Preference is influenced by many factors beyond accuracy, including response length, confidence, and presentation style.

5. Goodhart's Law as applied to AI benchmarks predicts that when a benchmark becomes an explicit training target:

Correct. Models can be optimized to score highly on a benchmark without the underlying generalized capability improving — invalidating the benchmark as a signal.

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Benchmark optimization inflates scores without necessarily improving the underlying capability.

6. The Allen Institute for AI contamination analysis (2024) found that MMLU scores were inflated by approximately how much for affected models?

Correct. 2–7pp is a substantial margin — enough to change leaderboard ordering between closely competing models.

The Allen Institute found 2–7 percentage points of inflation from contamination, varying by model and MMLU subject category.

7. The approximate 95% confidence interval for MMLU performance at 80% accuracy (14,042 questions) is:

Correct. The large MMLU test set produces tight confidence intervals — but even here, differences under ~2pp may not be meaningful.

With 14,042 questions, the 95% CI at 80% accuracy is approximately ±0.66pp — narrower than HumanEval or GSM8K, but still meaningful when interpreting small score differences.

8. Why is comparing self-reported GPT-4 and Claude 2 MMLU scores from their technical reports problematic?

Correct. Same benchmark name, different pipelines. Comparing the raw numbers treats methodological noise as capability signal.

Both models used MMLU but with different prompting setups. GPT-4 used standard 5-shot; Claude 2 used chain-of-thought prompting variants. The numbers are not directly comparable.

9. Why did Mistral's 2023 technical report use "performance per parameter" framing?

Correct. Comparing raw scores across vastly different model sizes confounds capability with scale. Performance-per-parameter isolates efficiency as the comparison axis.

Performance per parameter framing makes sense when your model is much smaller — it separates genuine efficiency from the brute-force advantage of using far more parameters.

10. HumanEval's 164-question test set implies that score differences under approximately what threshold are likely within statistical noise?

Correct. With only 164 questions, the 95% CI is approximately ±6pp at typical performance levels — making differences under ~10pp unreliable.

The 95% confidence interval on HumanEval is approximately ±6pp at typical performance. Differences under about 10pp are likely not statistically meaningful.

11. The "distribution shift" problem causes benchmark scores to poorly predict production performance because:

Correct. Distribution shift is the core structural reason benchmarks often fail to predict production performance.

Distribution shift means the data in the evaluation set comes from a different statistical distribution than real production inputs — causing degraded performance that the benchmark score couldn't have predicted.

12. Google Med-PaLM 2's expert-level USMLE performance demonstrated what limitation of benchmark-to-production translation?

Correct. USMLE format is structurally very different from real clinical interaction, illustrating the distribution shift problem in a high-stakes domain.

Med-PaLM 2's USMLE results illustrated that structured multiple-choice performance — even at expert level — doesn't directly predict performance on the open-ended, context-rich queries of real clinical practice.

13. In the recommended three-stage evaluation strategy, what is the primary function of Stage 2 (task-specific evaluation)?

Correct. Task-specific evaluation on domain-relevant examples is the bridge between coarse leaderboard filtering and production validation.

Stage 2's role is to evaluate candidates on task-specific examples — the most predictive pre-deployment signal for how a model will actually perform in your use case.

14. Which of the following capabilities are BEST covered by existing standardized benchmarks as of 2024?

Correct. MMLU, GSM8K, and HumanEval are well-established benchmarks for knowledge, math, and code respectively. The other options represent gaps that newer benchmark families are only beginning to address.

Existing benchmarks cover knowledge recall (MMLU), arithmetic (GSM8K), and short code generation (HumanEval) reasonably well. Long-context, adversarial, calibration, and tool-use evaluation remain less mature.

15. The practitioner's core principle for translating benchmark scores to production expectations is best summarized as:

Correct. This is the synthesis of the entire module: benchmarks filter candidates, task-specific evaluation selects, and production pilots validate.

Benchmark scores are useful as filters — they indicate capability potential. Task-specific evaluation and production pilots are required to confirm that capability applies to your specific task and data distribution.