Model Evaluation and Benchmarks · Introduction

We Have Always Needed Ways to Measure What Machines Can Do

Every powerful technology eventually forces the question: how good is good enough, and who gets to decide?

In 1844, Samuel Morse sent his first public telegraph message from Washington to Baltimore, and within months investors and governments were demanding to know: how fast is it, how far can it reach, and how often does it fail? The first formal telegraph performance standards appeared by 1850, not because engineers were pedantic, but because capital would not flow — and regulators would not approve — without agreed-upon measures. Standardized testing of copper wire conductivity, operator error rates, and message throughput became the invisible scaffolding that turned an interesting demonstration into a continental nervous system.

The same dynamic is unfolding today with large language models. OpenAI released GPT-3 in June 2020 with a 175-billion-parameter model that dazzled journalists and confused researchers, because nobody had a shared vocabulary for comparing it to anything. Within eighteen months, leaderboards like HELM and BIG-bench had proliferated, academic labs were publishing benchmark papers faster than models could be trained, and the phrase "state of the art" had become both indispensable and nearly meaningless — depending entirely on which benchmark you happened to cite.

This course is about that measurement problem: where benchmarks come from, what they actually capture, where they fail, and how practitioners make decisions when the numbers are incomplete or misleading. We will examine real benchmarks — MMLU, HumanEval, TruthfulQA, HellaSwag, and others — alongside the documented cases where high scores concealed serious real-world limitations. By the end, you will be able to read a model evaluation report critically, identify the questions it does not answer, and make more defensible choices about which model fits which task.

Model Evaluation and Benchmarks · Lesson 1

The Measurement Problem: Why Benchmarks Exist at All

Before anyone could agree on which AI model was better, they had to agree on what "better" meant.

What does it actually mean to say one model outperforms another — and who set the rules?

In the summer of 2022, Google engineer Blake Lemoine published transcripts of conversations with LaMDA, the company's dialogue model, claiming it had become sentient. The claim generated enormous press coverage. Meanwhile, researchers at Stanford and elsewhere were quietly noting that LaMDA, along with GPT-3 and PaLM, all scored near chance on the BIG-bench "Causal Reasoning" tasks — tasks that any sentient creature would handle trivially. Two radically different pictures of the same technology. The gap between them was not a matter of opinion; it was a measurement gap. Nobody had agreed on what to measure, or why, before the systems were deployed.

That gap is not new. It was present in 1950 when Alan Turing proposed what became known as the Turing Test — not as a rigorous benchmark but as a philosophical thought experiment. It widened through the 1980s expert-system era, when systems like MYCIN could outperform medical students on narrow hepatitis diagnosis questions while being completely useless for anything else. It persists today in every leaderboard screenshot posted on social media. Benchmarks exist, at their core, because capability claims without measurement are just marketing.

1.1 What a Benchmark Actually Is

A benchmark, in the machine learning sense, is a standardized dataset of inputs paired with expected outputs, together with a scoring procedure that maps model responses onto a numeric performance measure. That definition sounds dry, but each component matters.

The dataset encodes assumptions about what the task domain looks like. MMLU (Massive Multitask Language Understanding), released by Dan Hendrycks and colleagues in 2020, consists of 57 academic subject areas drawn from freely available standardized tests — SAT, GRE, professional licensing exams. Those sources reflect what the creators could gather legally at scale. The scoring procedure for MMLU is multiple-choice accuracy, which is clean and reproducible but excludes open-ended reasoning, calibration, and the ability to say "I don't know."

The expected output — the ground truth — is often the most contested part. For factual question-answering benchmarks, ground truth might be a Wikipedia sentence. For code generation benchmarks like HumanEval (OpenAI, 2021), ground truth is whether the generated code passes a suite of unit tests. Each choice of ground-truth source bakes in assumptions about what correctness means.

Why This Matters

When a company announces "our model achieves 90% on MMLU," the number is only meaningful if you know what MMLU measures, what it excludes, and whether the model was trained on data that overlaps with the test set. All three of those factors have been contested for every major benchmark released since 2018.

1.2 The Pre-History: Evaluation Before Deep Learning

Formal NLP evaluation predates neural networks by decades. The TREC (Text REtrieval Conference) benchmarks, organized by the National Institute of Standards and Technology beginning in 1992, were designed to evaluate information-retrieval systems — essentially search engines — on standardized document collections. The approach was rigorous: human assessors judged relevance, metrics like mean average precision were carefully defined, and results were submitted blind. TREC established the template that most subsequent AI benchmarks follow.

In 2001, the Penn Treebank had already become the standard evaluation set for parsing accuracy in syntactic analysis. Researchers knew that progress on Penn Treebank did not necessarily generalize to other text domains — a problem they called domain shift — but the benchmark's value as a shared yardstick outweighed its limitations for a decade. This tension between a useful imperfect benchmark and the search for a more perfect one runs through every era of the field.

The shift to deep learning, beginning with AlexNet's ImageNet victory in 2012, intensified the stakes. ImageNet, assembled by Fei-Fei Li and colleagues at Stanford from 2007 onward, contained over 14 million labeled images across 20,000 categories. When AlexNet's top-5 error rate dropped from the previous best of 26% to 15.3% in a single year, it was not just a benchmark win — it restructured research funding, company strategy, and hiring pipelines globally. Benchmarks were no longer academic scorekeeping; they were market signals.

1.3 Three Reasons Benchmarks Exist

It is worth being explicit about the distinct functions benchmarks serve, because they are often conflated — and that conflation causes confusion about why a benchmark might be inadequate for one purpose while perfectly suited for another.

Scientific progress tracking Researchers need a shared yardstick to know whether a new architecture or training method is actually better than what preceded it. Without reproducible baselines, the field cannot accumulate knowledge systematically.

Deployment decision-making Practitioners choosing a model for a production system need evidence that the model will behave acceptably on their specific task. A benchmark that covers the relevant domain can reduce (though never eliminate) the risk of a costly deployment failure.

Regulatory and accountability signaling Governments and standards bodies increasingly require documented evidence of model safety, fairness, and capability limits before deployment in high-stakes domains. Benchmarks provide an auditable record, however imperfect.

These three purposes pull in different directions. A benchmark optimized for scientific comparability (narrow, controlled, reproducible) may be useless for deployment decisions (which require coverage of real-world distribution shifts). A benchmark designed to satisfy regulators (covering legally mandated fairness criteria) may tell researchers very little about architectural tradeoffs. Understanding which purpose a benchmark was built for is step one in deciding whether to trust it for your own purpose.

1.4 The Goodhart Problem

In 1975, British economist Charles Goodhart observed that "when a measure becomes a target, it ceases to be a good measure." This principle — now called Goodhart's Law — applies to AI benchmarks with uncomfortable precision.

When a benchmark becomes the primary signal by which models are compared publicly, model developers — whether consciously or through the ordinary mechanics of hyperparameter search and training data curation — optimize specifically for that benchmark. The result is that benchmark scores rise while the underlying capability the benchmark was meant to proxy may not rise proportionally, or may even degrade on closely related out-of-distribution tasks.

A documented example: the HellaSwag benchmark (Zellers et al., 2019) was designed specifically to be adversarially hard for the BERT-era models of the time, which scored around 48% on it. By 2023, GPT-4 and its contemporaries scored above 95%. But several studies found that these same models, when presented with slight surface-form variations of HellaSwag questions — rephrasing without changing the underlying reasoning challenge — showed significant performance drops, suggesting that high scores partly reflected training-set overlap and surface-pattern matching rather than the robust commonsense reasoning HellaSwag was meant to measure.

The Goodhart Problem does not mean benchmarks are useless. It means they must be read with an understanding of how they were constructed, how widely they have been used as optimization targets, and what alternative measures exist to triangulate the same underlying capability.

The Central Tension of This Course

We need benchmarks to make progress measurable and comparable. But every benchmark, once it becomes consequential, begins to distort the thing it measures. This course is about navigating that tension — using benchmarks as the imperfect instruments they are, rather than treating them as ground truth.

1.5 The Modern Benchmark Landscape

As of 2024, the benchmark landscape for large language models has fragmented into hundreds of individual evaluations. Several families are worth knowing by name, because they appear constantly in model release documentation.

MMLU (Hendrycks et al., 2020) remains the single most-cited general-knowledge benchmark. Its 57 subject areas and 14,079 test questions have made it the de facto "academic IQ test" for LLMs, despite well-documented concerns about answer-key errors in the original dataset and extensive overlap with publicly available training corpora.

HumanEval (Chen et al., OpenAI, 2021) evaluates code generation: the model is given a Python function signature and docstring and must complete the function so that it passes a set of unit tests. The pass@k metric — the probability that at least one of k generated samples passes all tests — introduced a probabilistic framing that influenced subsequent benchmark design.

TruthfulQA (Lin et al., 2021) tests whether models give factually accurate answers to questions specifically selected because humans commonly believe false things about them. High benchmark scores elsewhere are explicitly not predictive of TruthfulQA performance — models that "know more" sometimes hallucinate more confidently.

BIG-bench (Srivastava et al., 2022) is a collaborative benchmark comprising over 200 tasks contributed by researchers worldwide, ranging from formal logic to social reasoning to creative writing. Its scale makes it comprehensive but also unwieldy; no model is typically evaluated on all tasks.

The existence of this proliferation is itself informative: no single benchmark has achieved consensus as a sufficient summary of LLM capability, and the field has responded by adding more benchmarks rather than converging on fewer, better ones. That situation is unlikely to resolve soon.

Lesson 1 Quiz

Five questions · Select the best answer for each

1. Which of the following best describes what a benchmark, in the machine learning sense, consists of?

Correct. A benchmark requires all three components: a dataset of inputs, ground-truth expected outputs, and a reproducible scoring procedure that maps responses to a numeric measure.

Not quite. While some of those descriptions touch on related concepts, the defining components of a benchmark are a standardized dataset, ground-truth outputs, and a scoring procedure.

2. The TREC (Text REtrieval Conference) benchmarks, organized beginning in 1992, established which important practice for subsequent AI evaluation?

Correct. TREC established the template of blind result submission, human relevance assessors, and carefully defined retrieval metrics like mean average precision — a template most subsequent AI benchmarks follow.

That does not describe TREC's contribution. TREC pioneered blind submission, human relevance judgments, and rigorously defined scoring metrics — the foundational template for later AI evaluation.

3. Goodhart's Law, as applied to AI benchmarks, predicts which outcome when a benchmark becomes a primary public comparison signal?

Correct. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Optimization pressure on a benchmark inflates scores without necessarily reflecting genuine capability gains on the broader skill.

Goodhart's Law predicts the opposite dynamic. When benchmarks become targets, scores inflate — models learn to perform well on the specific measure while the broader underlying capability may lag significantly behind the numbers.

4. TruthfulQA was designed specifically to test which property of language models?

Correct. TruthfulQA (Lin et al., 2021) specifically targeted questions where common human misconceptions could lead a model trained to imitate human text to reproduce false beliefs confidently.

TruthfulQA specifically targets factual accuracy on questions where humans frequently believe false things — making it a test of whether models parrot misconceptions or correct them.

5. Which of the three distinct purposes of benchmarks — scientific progress tracking, deployment decision-making, and regulatory signaling — is MOST directly served by HumanEval's pass@k metric for code generation?

Correct. HumanEval's primary value is enabling reproducible comparison of code generation capability across model versions — the scientific progress tracking function. It is less suited to deployment decisions (which require coverage of the actual production code distribution) or regulatory purposes.

HumanEval's pass@k metric, with its clean unit-test-based scoring, is most directly useful for scientific progress tracking — giving researchers a reproducible yardstick to compare models. Its applicability to deployment or regulation is more limited.

Lab 1: Interrogating a Benchmark

Conversation-based lab · Minimum 3 exchanges to complete

What You Will Do

In this lab you will interrogate the design of the MMLU benchmark — asking your AI tutor about what it measures, what it excludes, and whether a model's MMLU score is sufficient evidence of capability for a specific task. The goal is to practice the critical reading of benchmark claims.

Start by asking: "What are the main limitations of using MMLU score as evidence that a model is suitable for professional medical question answering?"

AI Tutor

Lab 1

Welcome to Lab 1. We're going to examine what benchmark scores actually tell us — and what they don't. Ask me about MMLU, its design choices, or what a high score on it really proves about a model's suitability for real-world tasks.

Model Evaluation and Benchmarks · Lesson 2

How Benchmarks Are Built: Choices That Shape What Gets Measured

Every benchmark encodes decisions about what counts as intelligence, correctness, and fairness.

What happens when the people who build the ruler decide what counts as tall?

In 2021, researchers at the University of Washington and AI2 published a paper titled "Are NLP Datasets Consistent?" They systematically re-annotated subsets of eight widely used NLP benchmarks and found that between 6% and 10% of labels in most datasets were incorrect — wrong answers presented as ground truth. For MMLU specifically, subsequent audits found errors ranging from ambiguous questions to outright incorrect answer keys in several subject areas, including virology and formal logic. A model "failing" a question on MMLU might in fact have given the correct answer that a human grader had mistakenly marked wrong.

The construction of a benchmark is not a neutral technical act. It involves choices about whose knowledge counts, which languages and cultures are represented, what difficulty level is appropriate, whether the task should be multiple-choice or open-ended, and how ground truth is established. Each choice introduces systematic biases that propagate into every comparison made using that benchmark for years afterward.

2.1 Data Collection and Its Biases

Benchmark datasets are assembled from source materials, and those sources are never a neutral sample of human knowledge. MMLU drew from publicly available standardized tests, which skew heavily toward American English, Western academic curricula, and test-taking formats that are culturally specific. A model trained primarily on English-language text may perform well on MMLU not because it has broad knowledge, but because it has been exposed to the same cultural register that produced the benchmark.

The GLUE benchmark (Wang et al., 2018) — designed to evaluate general language understanding — was criticized within two years of release because models had saturated it (exceeded human performance) without demonstrating the kind of flexible language understanding the benchmark was intended to measure. Its successor, SuperGLUE (Wang et al., 2019), was deliberately made harder, but the same saturation dynamic recurred. This arms-race pattern — benchmark saturates, researchers design harder benchmark, repeat — is one of the defining structural problems in the field.

Crowdsourced labeling introduces its own biases. The SNLI (Stanford Natural Language Inference) dataset, a foundational benchmark for textual entailment, was collected via Amazon Mechanical Turk. Workers were paid per annotation and were predominantly North American. Studies subsequently found that SNLI contained systematic annotation artifacts — patterns in how Turk workers chose to rephrase sentences — that allowed models to perform well by learning those artifacts rather than genuine inferential reasoning.

2.2 The Ground-Truth Problem

For many tasks, there is no unambiguous single correct answer. Open-domain question answering, summarization, translation, and dialogue all involve outputs where multiple responses could be correct. Benchmarks handle this in different ways, each with tradeoffs.

Multiple-choice format sidesteps the problem by constraining answers to a small set, but introduces a new one: the distractor options must be chosen by humans, and those choices signal what kinds of mistakes are expected. If the distractors are easy to rule out, the benchmark measures something shallower than the original task. If distractors are ambiguous, test-taker performance may reflect ability to second-guess the item writer rather than genuine knowledge.

Human reference answers create a ceiling: model performance is compared against one or more human-written references. The SQuAD (Stanford Question Answering Dataset) family of benchmarks used F1 overlap with human answers as the primary metric. But F1 overlap can be gamed by models that reproduce key phrases from the passage without understanding the question, and it penalizes semantically equivalent answers phrased differently.

Code execution, used in HumanEval and its successors, avoids many of these problems by using deterministic unit tests as ground truth. But unit tests only test what the test author thought to test. A function can pass all provided unit tests while being incorrect on inputs the test author did not anticipate — a property well known to software engineers and largely unresolved in code benchmarks.

Documented Case

When OpenAI released GPT-4 in March 2023, its technical report noted that GPT-4 scored 86.4% on MMLU. What the report did not prominently feature: subsequent analysis by independent researchers found substantial overlap between MMLU test questions and content in Common Crawl, one of GPT-4's likely training sources. Contamination estimates varied, but the methodological concern — that models may be partially tested on their training data — became a central debate in benchmark validity for the remainder of 2023.

2.3 Difficulty Calibration and Task Validity

A benchmark is only useful if it discriminates between models at the current frontier. A benchmark where all models score 95% tells you nothing; a benchmark where all models score 2% also tells you nothing. Calibrating difficulty to the capability frontier is an ongoing challenge because the frontier moves.

The ARC (AI2 Reasoning Challenge) benchmark, released in 2018, divided science questions into an "Easy Set" and a "Challenge Set" based on whether retrieval-based and word-co-occurrence methods could answer them. By 2022, the Challenge Set had been largely saturated by large language models. ARC-Challenge now appears in evaluation suites primarily as a historical comparison point rather than a discriminating measure of frontier capability.

Task validity — whether the benchmark actually measures what it claims to measure — is a related and harder problem. HellaSwag claims to measure commonsense reasoning. But commonsense reasoning is not a unitary capacity; it encompasses physical intuition, social inference, causal understanding, and temporal reasoning, among others. A single benchmark score conflates these. A model might excel at one type of commonsense reasoning while failing at another, producing a misleadingly aggregate score.

2.4 Who Builds Benchmarks and Why It Matters

Most foundational NLP benchmarks have been built by academic research groups with access to graduate student labor, cloud compute credits, and annotation budgets. This creates systematic tendencies: tasks are often designed around what is easy to annotate at scale rather than what is important to measure; academic prestige systems reward novel benchmarks over rigorous ones; and the researchers who build benchmarks often also evaluate their own models on them.

Commercial labs have begun releasing their own evaluation frameworks — OpenAI's Evals repository, Anthropic's model card evaluations, Google's BIG-bench participation — but these are conducted by the same organizations that train the models being evaluated, creating obvious conflict-of-interest dynamics. The independent evaluation organization HELM (Holistic Evaluation of Language Models), developed at Stanford's CRFM and released in 2022, was explicitly designed to address this by evaluating multiple models from multiple labs on a standardized suite of tasks using a consistent methodology. HELM remains one of the most credible sources of comparative evaluation precisely because it is conducted by a party without a direct commercial stake in the results.

Key Takeaway

Reading a benchmark result requires asking not just "what score did the model achieve" but "who built this benchmark, from what sources, with what ground-truth mechanism, and who ran the evaluation?" Those questions do not invalidate benchmark results — they contextualize them.

Lesson 2 Quiz

Five questions · Select the best answer for each

6. Research auditing the SNLI benchmark found that model performance on it could be inflated by learning what feature of the dataset?

Correct. Studies found that SNLI annotation artifacts — consistent surface-level patterns in how crowdworkers generated hypothesis sentences for each label — allowed models to achieve high accuracy by learning those stylistic cues rather than genuine natural language inference.

The documented problem with SNLI was annotation artifacts: systematic patterns in how Amazon Mechanical Turk workers phrased hypothesis sentences that correlated with label categories, letting models game the benchmark without genuine inference.

7. What specific problem does using code execution (unit tests) as ground truth in benchmarks like HumanEval address compared to human-reference-answer metrics?

Correct. Unit-test-based scoring is deterministic — a function either passes or fails the tests, with no subjective judgment. This addresses the ambiguity of reference-match metrics like F1. However, it still only tests what the test author anticipated, which is a separate limitation.

Code execution ground truth provides deterministic, unambiguous scoring — a key advantage over F1 overlap with human references. But it does not solve contamination (models may have seen similar code in training) or comprehensiveness (only tests what the author anticipated).

8. HELM (Holistic Evaluation of Language Models) was designed to address which specific concern in existing model evaluation practices?

Correct. HELM, developed at Stanford's CRFM, was explicitly designed to evaluate models from multiple commercial labs using a consistent methodology conducted by a party without a direct stake in the results — addressing the obvious conflict of interest when labs evaluate their own models.

HELM's defining feature is independence: it was designed to evaluate models from multiple labs using a consistent methodology, conducted by a party without a direct commercial interest in the results — addressing the conflict-of-interest problem in self-reported evaluations.

9. The "arms race" pattern in NLP benchmarking refers to which recurring dynamic?

Correct. GLUE saturated within ~2 years of release, prompting SuperGLUE, which also saturated. ARC-Challenge is now primarily a historical comparison point. The frontier of model capability keeps outpacing benchmarks designed to measure it — a structural problem in the field.

The arms-race pattern specifically describes benchmarks being saturated — models exceeding human performance — prompting harder replacements, which are subsequently saturated in turn. GLUE and SuperGLUE are the canonical example.

10. Why is multiple-choice format insufficient for measuring open-domain question answering capability, according to the issues discussed in Lesson 2?

Correct. Multiple-choice format introduces the distractor-design problem: if distractors are easy to rule out, the benchmark measures something shallower than intended, and model performance may partly reflect learning item-writer conventions rather than the target knowledge.

The core issue is distractor quality: the choice of wrong answers shapes apparent difficulty, and models (and humans) can sometimes identify correct answers by recognizing patterns in how distractors are constructed rather than by genuinely knowing the answer.

Lab 2: Ground Truth Under a Microscope

Conversation-based lab · Minimum 3 exchanges to complete

What You Will Do

In this lab you will examine the choices embedded in benchmark construction — specifically how ground-truth mechanisms shape what gets measured. You will ask your AI tutor to walk through the implications of different ground-truth approaches for a task you care about.

Start by asking: "If I wanted to benchmark a model's ability to write clear technical documentation, what ground-truth mechanism would you recommend, and what are the tradeoffs?"

AI Tutor

Lab 2

Welcome to Lab 2. We're exploring the choices that go into building a benchmark — especially how the mechanism for establishing ground truth shapes everything that follows. Ask me about specific benchmarks, ground-truth tradeoffs, or how you'd design an evaluation for a real task.

Model Evaluation and Benchmarks · Lesson 3

What Benchmark Scores Hide: Distribution Shift, Contamination, and Gaming

A score on a test tells you how the model performed on that test. It does not tell you what the model will do next.

If a model aces the exam but fails on the job, was the exam measuring the wrong thing — or was the model cheating?

In November 2023, a team at MIT, NYU, and the University of Washington released a paper titled "Are Emergent Abilities of Large Language Models a Mirage?" They argued that many dramatic capability jumps observed on benchmarks — including apparent sudden emergence of multi-step arithmetic — were artifacts of nonlinear scoring metrics rather than genuine discontinuous capability changes. When they re-analyzed the same model outputs using linear metrics, the apparent emergent jumps disappeared, replaced by smooth, gradual improvement curves. The striking benchmark result was real; the inference drawn from it was not.

This is not fraud. It is what happens when a benchmark score is treated as a window into capability rather than as a data point generated by a specific measurement instrument with specific properties. The three phenomena covered in this lesson — distribution shift, training data contamination, and benchmark gaming — are the most common mechanisms by which benchmark scores mislead practitioners who are not looking for them.

3.1 Distribution Shift

Distribution shift occurs when the data a model encounters in deployment differs systematically from the data used to evaluate it. Every benchmark represents a particular slice of the possible input space — a particular time, a particular text register, a particular set of topics. When the deployment environment differs from that slice, benchmark performance does not generalize.

A documented example is the MIMIC-III clinical notes dataset, widely used to evaluate medical NLP models. Models achieving strong performance on MIMIC-III (recorded at Beth Israel Deaconess Medical Center in Boston) showed significantly degraded performance when applied to clinical notes from other hospitals using different documentation conventions, different patient populations, and different local terminology — even within the same country and language. The benchmark correctly measured capability on the MIMIC-III distribution. It could not be expected to measure capability on different distributions.

Distribution shift operates along multiple axes simultaneously: temporal shift (world events after training cutoff), domain shift (different subject matter), format shift (different text structure or conventions), and demographic shift (different user populations). A model evaluated before a major geopolitical event may perform poorly on questions about it; a model evaluated on formal prose may struggle with social media text; a model evaluated on American English may perform worse on British English. These are not model failures per se — they are benchmarking failures, cases where the benchmark underspecified the deployment context.

3.2 Training Data Contamination

Training data contamination — sometimes called "benchmark leakage" — occurs when a model is trained on data that includes, or closely resembles, the benchmark test set. This is not necessarily deliberate; it arises naturally because the same publicly available text sources (Wikipedia, Common Crawl, GitHub, Stack Overflow) that researchers use to build benchmarks are also used as training data for large language models.

The problem was publicly documented in detail for the first time at scale in the GPT-3 paper (Brown et al., 2020, OpenAI), which included a contamination analysis disclosing which benchmark test sets had overlapping n-grams with the training data. The analysis found 13-gram overlap between training data and several standard benchmarks including Winogrande, WinoGrande, and PIQA. The authors concluded that contamination likely had "negligible" effect — a conclusion later researchers disputed.

Subsequent work by Jacovi et al. (2023) and others developed more sensitive contamination detection methods based on model behavior rather than text overlap. If a model's performance degrades dramatically when questions are rephrased — without changing the underlying knowledge requirement — it is evidence that the model memorized surface forms rather than learning the underlying capability. This behavioral contamination detection approach is increasingly used as a complement to text-overlap analysis.

The Contamination Dilemma

There is no fully satisfactory solution to benchmark contamination as long as training sets and evaluation sets are drawn from the same web-crawled sources. Private, held-out test sets (like the ones used in competitive ML challenges) partially address the problem but require trusted third-party custodians and make it harder for the research community to examine evaluation data for quality issues. Both horns of this dilemma are real.

3.3 Benchmark Gaming

Benchmark gaming is the deliberate or emergent optimization of model development choices to improve scores on a specific benchmark at the expense of broader capability. It exists on a spectrum from legitimate to problematic.

At the legitimate end, selecting training data that covers a benchmark's topic distribution is reasonable and expected. At the problematic end, including benchmark test-set examples in training data is straightforward contamination. In between lies a large gray zone: choosing evaluation prompts that match the format of a popular benchmark, tuning on held-out splits of benchmark-adjacent data, or fine-tuning on data specifically selected to improve benchmark performance without disclosing that fine-tuning occurred.

A well-documented instance of gaming dynamics appeared in the ChatBot Arena leaderboard, operated by LMSYS (Large Model Systems Organization). Arena rankings are based on human preference votes in head-to-head model comparisons. When researchers analyzed the distribution of model responses that attracted high votes, they found that verbose, confident, well-structured responses were systematically preferred, regardless of factual accuracy — a pattern that model developers appeared to optimize for. The Arena ranking became as much a measure of response style as of underlying capability.

The LMSYS researchers themselves acknowledged this tension in a 2024 paper, noting that optimizing for Arena Elo could diverge from optimizing for task performance on specific applications. Their proposed solution — more structured evaluation criteria for raters — illustrates the general principle: making implicit scoring criteria explicit reduces but does not eliminate the gaming surface.

3.4 Reading a Benchmark Result Critically

Given these three failure modes, a practitioner reading a benchmark result should habitually ask four questions before drawing conclusions about deployment suitability.

1. Domain match Does the benchmark's data distribution match my deployment context in terms of topic, text format, language variety, and time period? If not, the score is only weak evidence of deployment performance.

2. Contamination disclosure Has the evaluating party disclosed a contamination analysis? Was the analysis text-overlap-based (necessary but not sufficient) or behavior-based (stronger evidence)? Absence of disclosure is itself informative.

3. Optimization pressure Has this benchmark been widely used as a development target? If many models have been optimized against it, high scores may partly reflect benchmark-specific optimization rather than the underlying capability.

4. Triangulation Does performance on this benchmark cohere with performance on other benchmarks measuring related capabilities? Divergence between related benchmarks is a signal worth investigating before making deployment decisions.

Practical Guidance

For high-stakes deployment decisions, a single benchmark score — however reputable — is insufficient evidence. The minimum defensible practice is triangulation across multiple evaluation sources including, where possible, a small-scale pilot evaluation on data drawn from your own deployment context.

Lesson 3 Quiz

Five questions · Select the best answer for each

11. The paper "Are Emergent Abilities of Large Language Models a Mirage?" (2023) argued that apparent sudden capability jumps in benchmark results were caused by what specific methodological artifact?

Correct. The key argument was that nonlinear scoring metrics — such as exact-match accuracy on multi-step arithmetic, which jumps from 0% to a high score once the model gets all steps right — created the visual appearance of discontinuous emergence that flattened out under linear measures.

The paper's argument centered on nonlinear scoring metrics: measures that appear to jump discontinuously when a model crosses a threshold produce striking benchmark graphs that disappear entirely under linear analysis — suggesting the appearance of emergence was a measurement artifact.

12. Behavioral contamination detection — using model behavior rather than text overlap — provides what advantage over n-gram-overlap analysis?

Correct. Behavioral contamination detection exploits the fact that a model that memorized surface forms will show sharp performance drops when questions are rephrased, whereas a model with genuine understanding will maintain performance across surface variations — a signal n-gram overlap analysis cannot capture.

Behavioral contamination detection's advantage is sensitivity: it catches cases where a model memorized specific phrasings of questions without exact text overlap, by checking whether rephrasing the question while preserving the knowledge requirement causes a disproportionate performance drop.

13. The ChatBot Arena leaderboard, as analyzed in research published around 2024, showed that Arena Elo rankings were significantly influenced by which model property beyond task accuracy?

Correct. Analysis of ChatBot Arena voting patterns found that verbose, well-structured, confident responses were systematically upvoted over more accurate but less polished responses — making Arena Elo as much a measure of presentation style as underlying capability.

The documented finding was that verbose, confident, well-structured responses attracted higher preference votes regardless of factual accuracy — a form of benchmark gaming where models learned to optimize for presentation style rather than (or alongside) genuine capability.

14. Distribution shift operates along multiple axes. Which of the following describes "temporal shift" as a source of benchmark performance degradation?

Correct. Temporal shift refers specifically to the mismatch between a model's training data cutoff and the time of evaluation — models do not know about events after their training cutoff, so evaluations that include post-cutoff questions will show degraded performance not because the model is less capable but because it lacks the relevant information.

Temporal shift means the world has changed since the model's training data was collected — events, facts, and terminology that postdate the training cutoff create a systematic performance gap on questions about those topics.

15. According to the "four questions" framework in Lesson 3, which practice represents the minimum defensible standard for high-stakes deployment decisions based on benchmark evidence?

Correct. The lesson explicitly states that a single benchmark score is insufficient for high-stakes deployment decisions. The minimum defensible practice is triangulation across multiple evaluation sources, supplemented where possible by a pilot evaluation on data drawn from the actual deployment environment.

The minimum defensible standard described is triangulation across multiple sources plus a pilot evaluation on deployment-context data — not reliance on any single score, however reputable the benchmark or certifying body.

Lab 3: Detecting What Scores Hide

Conversation-based lab · Minimum 3 exchanges to complete

What You Will Do

In this lab you will work through a realistic scenario: a model release announcement claims impressive benchmark scores, and you must identify what the scores might be hiding. Ask your AI tutor to help you reason through distribution shift, contamination, and gaming concerns for a specific claim.

Start by asking: "A company announces their new model scored 89% on MMLU and 72% on HumanEval. They plan to deploy it for financial compliance document review. What should I be worried about before trusting those numbers for this use case?"

AI Tutor

Lab 3

Welcome to Lab 3. We're practicing critical reading of benchmark claims in a deployment context. Bring me a real or hypothetical scenario — a model release, a use case, a benchmark claim — and we'll work through what the numbers might and might not tell you.

Model Evaluation and Benchmarks · Lesson 4

Beyond Accuracy: Safety, Fairness, and the Limits of Numeric Scores

Some of the most important things to know about a model are not captured by any current benchmark.

When a model scores 90% on every test you have, but 10% of the time it harms someone, what does the score mean?

In March 2023, Stanford researchers published an evaluation of six commercially deployed large language model APIs — including GPT-4, Claude, and Bard — on a battery of tasks related to medical advice. The models performed impressively on medical knowledge benchmarks. But when the researchers presented the same models with clinical vignettes designed to elicit advice on medication overdose or self-harm, several models provided detailed information that clinical guidelines explicitly recommend against disclosing to patients expressing suicidal ideation. Benchmark scores had not predicted this behavior. The evaluations that would have caught it did not exist in standard benchmark suites.

This gap — between what benchmarks measure and what matters in deployment — is most acute in the domains of safety and fairness. These properties are not impossible to measure, but they resist the clean numeric summarization that accuracy-on-a-test-set provides. They depend on context, on who is using the model, on what harm means in a specific situation, and on values that different people hold differently. That does not make them unmeasurable. It makes them differently measurable.

4.1 What Safety Benchmarks Measure (and Miss)

The dominant approach to AI safety evaluation in 2023–2024 centers on red-teaming — deliberately trying to elicit harmful outputs from a model — and on taxonomy-based harm benchmarks that categorize potential harms and test model behavior against each category. Both approaches have value and both have significant limitations.

BBQ (Bias Benchmark for QA), released by Parrish et al. in 2021, tests whether models make biased inferences about social groups when context is ambiguous. A model is presented with an ambiguous social situation and asked which group is more likely to be responsible for a negative outcome. A well-calibrated model should express uncertainty when context is genuinely ambiguous; a biased model will assign blame disproportionately to historically stereotyped groups. BBQ found measurable bias in all models it evaluated in 2021, with considerable variation in degree.

WinoBias and WinoGender test gender-stereotyping in coreference resolution — whether a model assumes that "the nurse" refers to a woman and "the engineer" to a man. Models that score well on these benchmarks in 2024 have largely been fine-tuned specifically to avoid these particular stereotyping patterns, which illustrates both the value of targeted benchmarks (they created optimization pressure for a real problem) and their limitation (the models have learned to avoid the specific forms tested, not necessarily the underlying bias mechanism).

The most significant limitation of current safety benchmarks is their adversarial incompleteness: they test against known harmful patterns identified before the benchmark was created. Novel jailbreaks, unanticipated deployment contexts, and emergent model behaviors are not covered. A model that passes all available safety benchmarks may fail in ways that none of those benchmarks measured.

4.2 Fairness Metrics and Their Tensions

Fairness in machine learning is a formally contested concept: there are multiple mathematically precise definitions of fairness that are mutually incompatible in general. This is not a philosophical quibble — it has direct consequences for benchmark design.

Demographic parity requires that a model's positive-outcome rate be equal across demographic groups. Equalized odds requires that true positive rates and false positive rates be equal across groups. Individual fairness requires that similar individuals receive similar outcomes. In a 2016 paper, Chouldechova demonstrated that when base rates differ across groups (as they frequently do in criminal justice, medical, and lending datasets), demographic parity and equalized odds cannot both be simultaneously satisfied. Any benchmark that tests for one of these fairness criteria and not others is testing a partial view of fairness.

The BOLD (Bias in Open-ended Language Generation) benchmark evaluates sentiment and regard in model-generated text about different demographic groups. It found that LLMs generated consistently more positive text about some racial, religious, and gender groups than others. By 2023, most large commercial models had been fine-tuned to perform much better on BOLD. But independent analysis found cases where this fine-tuning introduced a different form of unfairness: over-refusal, where the model declined to generate text about certain groups at all rather than generating text with measured positive regard — a behavior that BOLD's scoring mechanism did not detect as a problem.

The Over-Refusal Problem

Several studies in 2023 found that safety and fairness fine-tuning sometimes caused models to refuse reasonable requests at elevated rates — sometimes disproportionately for certain demographic groups, certain languages, or certain topics. A benchmark that measures harmful outputs but not over-refusals captures only half the relevant behavior space.

4.3 Calibration: The Benchmark Nobody Discusses Enough

A model's calibration — the correspondence between its expressed confidence and its actual accuracy — is one of the most practically important properties for deployment and one of the least prominently featured in public benchmark reporting.

A model that says "I am 90% confident" and is right 90% of the time is well-calibrated. A model that expresses high confidence while being right only 60% of the time is miscalibrated in the direction of overconfidence, and this is a deployment risk: users may make consequential decisions based on confident incorrect statements without realizing they should verify.

TruthfulQA, discussed in Lesson 1, partially addresses calibration by testing whether models give accurate answers to questions where overconfident wrong answers are common. The Expected Calibration Error (ECE) metric, used in the HELM evaluation suite, measures the gap between expressed probability and empirical accuracy more directly. HELM's 2022 evaluation found substantial variation in calibration across models, with some frontier models showing worse calibration (higher overconfidence) than smaller models on certain task categories — a counterintuitive finding that has not been prominently highlighted in commercial model releases.

4.4 What Evaluation Cannot (Currently) Capture

Several important properties of deployed AI systems remain largely outside the scope of current benchmark evaluation.

Consistency across equivalent inputs: A model may answer a question correctly when phrased one way and incorrectly when semantically equivalent phrasing is used. Several papers from 2021–2023 documented this phenomenon — sometimes called "brittleness" — but no widely adopted benchmark systematically quantifies consistency across input variations at scale.

Long-context coherence: Benchmarks are almost universally short-context. The ability to maintain accurate reasoning across a 100,000-token document — a capability increasingly marketed in commercial model releases — is difficult to benchmark with available infrastructure and has received limited systematic evaluation compared to its claimed importance.

User-interaction dynamics: Most benchmarks evaluate single-turn or short multi-turn interactions. Real deployment involves extended interactions, follow-up questions, error recovery, and user adaptation. How model quality evolves across a full conversation is poorly understood and rarely evaluated.

Societal-scale effects: The aggregate impact of millions of simultaneous model-mediated interactions on information ecosystems, political discourse, and collective reasoning is not measurable by any current benchmark. This is not a gap that better benchmarks can easily fill — it requires different research methodologies entirely.

Closing the Module

Benchmarks exist because claims without measurement are just marketing. They fail — predictably, in documented ways — because measuring complex capabilities with a single number always loses information. The practical skill this course develops is not benchmark skepticism for its own sake, but a calibrated reading practice: using benchmark evidence where it applies, identifying where it does not, and asking what additional evidence would reduce your uncertainty before a consequential decision.

Lesson 4 Quiz

Five questions · Select the best answer for each

16. The Stanford 2023 evaluation of commercial LLM APIs on medical tasks found a critical limitation of standard benchmark scores. What did that evaluation specifically reveal?

Correct. The Stanford study found that strong medical benchmark performance did not predict safe behavior on clinical vignettes where models were asked about topics like medication overdose in patients expressing suicidal ideation — a safety-relevant behavior that no standard benchmark had measured.

The key finding was that strong medical benchmark scores coexisted with potentially harmful behavior on safety-sensitive clinical vignettes — a gap between what benchmarks measured and what mattered in deployment that standard evaluations had not addressed.

17. Chouldechova's 2016 paper demonstrated which fundamental constraint on algorithmic fairness?

Correct. Chouldechova proved that when base rates differ across groups — a common condition in real datasets — the mathematical definitions of demographic parity and equalized odds are mutually incompatible. Any benchmark testing for one necessarily accepts violation of the other.

Chouldechova's key result: differing base rates across demographic groups make it mathematically impossible to simultaneously satisfy both demographic parity and equalized odds — meaning any fairness benchmark is implicitly taking a position on which definition of fairness matters.

18. The "over-refusal problem" discovered in analysis of safety-fine-tuned models describes what specific behavior not captured by standard harm benchmarks?

Correct. Safety fine-tuning sometimes led models to over-refuse — declining reasonable requests, sometimes with patterns suggesting differential treatment by topic or demographic group — a behavior that benchmarks measuring harmful outputs did not detect because the model produced no output at all.

Over-refusal means the model becomes too restrictive — refusing reasonable requests rather than engaging appropriately — sometimes with differential rates across topics or groups. Standard harm benchmarks test for harmful outputs, not for the absence of output when output was warranted.

19. HELM's 2022 evaluation using the Expected Calibration Error metric found which counterintuitive result about frontier models?

Correct. HELM found that higher accuracy did not reliably predict better calibration — some frontier models were more overconfident on certain task categories than smaller models, making them potentially more dangerous in deployment where users might over-trust confident incorrect answers.

HELM's counterintuitive calibration finding: larger, more accurate frontier models were sometimes worse-calibrated (more overconfident) than smaller models on specific task categories — a deployment risk that accuracy benchmarks alone would not reveal.

20. Which currently important model property is described in Lesson 4 as largely outside the scope of existing benchmarks due to infrastructure limitations rather than conceptual difficulty?

Correct. Long-context coherence — the ability to reason accurately across 100,000+ token documents, a capability actively marketed in commercial models — has received limited systematic benchmark evaluation compared to its claimed importance, partly because evaluating it at scale requires substantial infrastructure investment.

Lesson 4 specifically flags long-context coherence as a capability increasingly marketed in commercial releases but inadequately benchmarked, due in part to the infrastructure cost of systematically evaluating performance across very long contexts.

Lab 4: Designing an Evaluation Strategy

Conversation-based lab · Minimum 3 exchanges to complete

What You Will Do

In this final lab you will synthesize the full module by designing an evaluation strategy for a specific deployment scenario. Your AI tutor will push back, identify gaps, and help you think through which benchmarks are relevant, which are insufficient, and what additional evidence you would need.

Start by asking: "I'm choosing a model to help triage customer support tickets for a healthcare company. Help me design an evaluation strategy that goes beyond standard accuracy benchmarks to cover safety, fairness, and calibration."

AI Tutor

Lab 4

Welcome to Lab 4 — the synthesis lab. You've covered why benchmarks exist, how they're built, what they hide, and what they miss entirely. Now let's apply that to a real evaluation design problem. Tell me your deployment scenario and we'll build an evaluation strategy together — including the parts that no existing benchmark covers.

Module 1 Test

15 questions · 80% required to pass · Covers all four lessons

1. Which three components are required for a complete benchmark in the machine learning sense?

Correct. A benchmark requires: inputs, expected outputs (ground truth), and a scoring procedure.

The three required components are: a standardized input dataset, ground-truth expected outputs, and a defined scoring procedure.

2. TREC (Text REtrieval Conference), beginning in 1992, established what practice that most subsequent AI benchmarks follow?

Correct. TREC established blind submission, human relevance judgments, and carefully defined scoring metrics as the foundational template.

TREC pioneered blind submission, human relevance assessors, and rigorously defined metrics — the template most later AI evaluation follows.

3. AlexNet's victory on the ImageNet benchmark in 2012 is significant because it illustrates which point about benchmarks?

Correct. AlexNet's ImageNet performance — dropping error rate from 26% to 15.3% — became a market signal that restructured entire industries, not just a research result.

The significance is that AlexNet's benchmark win became a market signal, restructuring research funding and industry strategy — elevating benchmarks from academic scorekeeping to consequential instruments.

4. Goodhart's Law predicts that once a benchmark becomes a primary public comparison signal, what will happen to model scores over time?

Correct. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure — scores inflate through optimization without necessarily reflecting genuine capability gains.

Goodhart's Law: optimization pressure on a benchmark inflates its scores without necessarily improving the underlying capability the benchmark was meant to measure.

5. What does the "arms race" pattern in NLP benchmarking specifically describe?

Correct. GLUE saturated, leading to SuperGLUE, which also saturated. ARC-Challenge is now primarily historical. The frontier of capability continuously outpaces benchmarks designed for it.

The arms-race pattern: models saturate benchmarks, researchers build harder ones, those get saturated too. GLUE and SuperGLUE are the canonical example.

6. The SNLI benchmark was found to be gameable because crowdworkers' annotation patterns created what kind of unintended signal?

Correct. Turk workers consistently used specific phrasings and constructions for each label type, creating surface-level cues models could exploit without genuine natural language inference.

SNLI's problem was annotation artifacts: systematic stylistic patterns in how workers wrote hypothesis sentences for each label, giving models a shortcut to high performance without genuine inferential reasoning.

7. GPT-3's 2020 technical report included what notable transparency contribution regarding benchmarks?

Correct. GPT-3's technical paper included an n-gram overlap contamination analysis — the first public documentation of this issue at scale for large language models, even though the conclusions were later disputed.

GPT-3's paper included an n-gram-overlap contamination analysis — notable as the first public disclosure of this kind for a large language model, even if later researchers questioned its conclusions.

8. Behavioral contamination detection identifies memorization of surface forms by testing which observable model property?

Correct. If a model memorized surface forms rather than learned the underlying capability, rephrasing the question should cause a disproportionate performance drop — the key behavioral signal for contamination.

Behavioral contamination detection works by rephrasing: a model with genuine understanding performs consistently across phrasings; a model that memorized specific question forms shows large performance drops on rephrasings.

9. Distribution shift along the "temporal" axis describes what specific source of benchmark performance degradation?

Correct. Temporal shift: the model's training has a cutoff date, so post-cutoff events create systematic gaps between benchmark performance (which may include recent-event questions) and the model's actual knowledge state.

Temporal distribution shift: the real world changes after a model's training cutoff, so benchmarks that include post-cutoff questions test a gap in knowledge, not a gap in capability.

10. The BBQ (Bias Benchmark for QA) benchmark tests for bias in language models by presenting what specific type of scenario?

Correct. BBQ specifically uses ambiguous contexts — where no correct answer exists — to test whether models default to stereotyped attributions when they should express genuine uncertainty.

BBQ's design: present ambiguous contexts where no definitive answer is available. A fair model should express uncertainty; a biased model assigns blame to stereotyped groups — revealing bias through its responses to genuinely ambiguous situations.

11. Chouldechova's 2016 mathematical result about fairness metrics has what direct consequence for benchmark design?

Correct. When base rates differ, demographic parity and equalized odds are mathematically incompatible. A benchmark testing one is implicitly taking a position on which fairness definition matters — a value choice, not a technical one.

Chouldechova's result: with differing base rates, no model can satisfy both demographic parity and equalized odds simultaneously. Benchmark designers choosing one definition are making an implicit value judgment.

12. HELM (Holistic Evaluation of Language Models) was designed to address what specific structural problem in commercial AI evaluation?

Correct. HELM at Stanford CRFM was explicitly designed to evaluate models from multiple commercial labs using a consistent methodology conducted by a party without a direct stake in the results.

HELM's defining feature is independence from commercial interests — evaluating models from multiple labs using consistent methodology at a research institution without financial stake in the outcomes.

13. The "emergent abilities" paper (2023) argued that apparent sudden capability jumps in LLM benchmarks were caused by what?

Correct. The paper showed that nonlinear metrics — like exact-match accuracy on multi-step tasks — create apparent threshold jumps that are artifacts of measurement, not genuine discontinuous capability changes.

The argument was specifically about nonlinear scoring metrics: measures that require all-or-nothing success create visual jumps in performance curves that disappear under continuous scoring — suggesting the apparent emergence was a measurement artifact.

14. HELM's calibration evaluation using Expected Calibration Error (ECE) found which counterintuitive result about large frontier models?

Correct. HELM found that larger, more accurate models were sometimes more overconfident — a worse calibration than smaller models on specific tasks — a deployment risk that accuracy benchmarks alone would not identify.

HELM's finding: frontier models can be more overconfident (worse calibration) than smaller models on specific task categories, even while achieving higher accuracy — a critical deployment risk not visible in accuracy-only reporting.

15. According to the module's closing framework, the minimum defensible standard for benchmark evidence in high-stakes deployment decisions is:

Correct. No single benchmark, however reputable, is sufficient for high-stakes decisions. The minimum is triangulation across multiple sources plus deployment-context-specific evidence.

The module's consistent conclusion: triangulation across multiple sources, plus a pilot evaluation on deployment-context data where possible, is the minimum defensible standard — not reliance on any single score or leaderboard.