Module 4 · Lesson 1

What Is Benchmark Contamination?

When the exam answers are in the textbook — and the textbook is the training set.

How do test questions end up inside a model's training data, and why does it matter so much?

In September 2023, researchers at MIT and the University of Washington published a paper titled "Investigating Data Contamination in Modern LLMs." They found that several frontier models showed statistically anomalous performance on certain multiple-choice benchmarks — performance that dropped meaningfully when the questions were slightly paraphrased. The implication was uncomfortable: models appeared to have memorised specific answer sequences, not learned the underlying reasoning the benchmarks were designed to test.

The Core Problem

Benchmark contamination — also called data contamination or test leakage — occurs when examples from a held-out evaluation dataset appear in a model's training corpus. The result is that a model's reported score on that benchmark no longer reliably measures the capability the benchmark was designed to assess. The model may have, in effect, memorised the questions and answers rather than generalised the skill.

The internet is enormous. Modern LLMs are trained on crawls that span hundreds of billions of tokens drawn from Common Crawl, GitHub, academic repositories, and countless other sources. Benchmark datasets — including BoolQ, HellaSwag, MMLU, HumanEval, and many others — are publicly available. Their questions, answer keys, and even their metadata have been indexed, discussed, reproduced, and mirrored across the web. Keeping them out of a multi-terabyte training corpus is genuinely difficult.

Why This Is Not Cheating in the Ordinary Sense

Contamination is usually unintentional. Training pipelines consume web data at scale; no human reviews every token. The problem is structural: the incentive to publish benchmarks publicly (for reproducibility) conflicts with the need to keep evaluation data unseen. This tension has no easy resolution.

A Precise Vocabulary

Researchers distinguish several contamination scenarios, each with different severity:

Input contaminationThe question text appears in training data but not the answer. The model has seen the question before but must still reason to the answer. Considered a mild form.

Input-output contaminationBoth the question and the correct answer appear in training data. The model can pattern-match without reasoning. Considered severe.

Partial contaminationSome subset of the benchmark (e.g., a particular subject category in MMLU) appears in training while others do not. Aggregate scores mask the inflation in contaminated subsets.

Indirect contaminationTraining data contains documents that discuss benchmark answers (e.g., forum posts, blog posts about model results) without being the original benchmark file.

The Memorisation–Generalisation Distinction

A fundamental goal of evaluation is to measure generalisation — the capacity to handle novel situations. Contamination short-circuits this by converting a generalisation test into a recall test. A model that has memorised MMLU answers achieves high scores by retrieval, not reasoning. When those same questions are rephrased or the answer options are shuffled, the memorisation advantage erodes, which is exactly the pattern the 2023 MIT/UW study documented.

OpenAI's technical report for GPT-4 (March 2023) addressed contamination directly, noting that they performed contamination analysis by checking for n-gram overlap between training data and benchmark questions. They found some overlap with several benchmarks, including MMLU, and disclosed this explicitly — a relatively rare degree of transparency at the time. Their reported scores for contaminated subsets were annotated with caveats.

Key Insight

The same property that makes benchmarks useful — public accessibility — makes them vulnerable. Once a benchmark question is on the internet, it is a candidate for training data inclusion. This is a systems problem, not a character problem.

Historical Antecedents

The problem predates large language models. In supervised machine learning, the distinction between training, validation, and test sets has been foundational since the 1990s. Kaggle competitions have documented cases where participants inadvertently or deliberately incorporated test labels. The NLP community grappled with train-test overlap in earlier systems — the SQuAD reading comprehension benchmark, for instance, had passages drawn from Wikipedia, and early neural models were trained on Wikipedia, creating at least indirect overlap concerns. What changed with LLMs is scale: the training corpora are so vast that probabilistic contamination of any public benchmark is nearly certain rather than merely possible.

Lesson 1 Quiz

What Is Benchmark Contamination?

1. Which of the following best defines "input-output contamination"?

Correct. Input-output contamination is the most severe form because the model can retrieve the answer directly rather than reason to it.

Not quite. Input-output contamination specifically means both question and answer appear in training — the model can recall the correct answer without reasoning.

2. Why is benchmark contamination described as a "structural" problem rather than an ethical failure?

Correct. The conflict between reproducibility (requiring public benchmarks) and evaluation integrity (requiring unseen test data) is structural — it arises from how the field operates, not from deliberate misconduct.

Reconsider. The structural framing points to the systems-level tension between public benchmark accessibility and training data scale — contamination is typically unintentional.

3. The 2023 MIT/University of Washington study found that contaminated models' scores dropped when benchmark questions were slightly paraphrased. What does this suggest?

Correct. Sensitivity to superficial paraphrasing is the signature of memorisation rather than understanding — a model that has genuinely learned the concept should handle rephrasings equally well.

Think about what it means for performance to drop specifically on paraphrased versions of the same question. That pattern implicates memorisation of exact formulations.

Lab 1 — Contamination Identification

Explore how contamination arises and how to recognise its fingerprints.

Your Task

In this lab you'll discuss benchmark contamination with an AI tutor. Explore the mechanics of how contamination enters training data, the difference between input and input-output contamination, and how researchers detect it. Ask at least three substantive questions to complete the lab.

Suggested opening: "Walk me through exactly how a benchmark question from MMLU could end up in a model's training data, step by step."

AI Tutor — Contamination Mechanics

Lab 1

Welcome to Lab 1. I'm here to help you understand how benchmark contamination actually occurs — the mechanics, the detection signals, and the real-world cases. What would you like to explore first?

Module 4 · Lesson 2

Documented Cases of Contamination

Real incidents where training leakage distorted reported benchmark scores.

What have post-hoc investigations actually found — and how did labs respond when contamination was discovered?

When OpenAI released the GPT-4 technical report, it included an unusual section: a contamination analysis. The team had checked n-gram overlap between their training data and every evaluation benchmark they reported results on. They found varying degrees of overlap with AMC 2022/2023, MMLU, HumanEval, and several others. Rather than omitting contaminated benchmarks, they disclosed the findings and, where overlap was high, provided results both with and without the potentially contaminated examples. It was an imperfect but transparent approach that set a precedent.

The GPT-4 Contamination Findings in Detail

OpenAI's analysis found that AMC 2022 and 2023 math competitions had substantial overlap with training data. However, on closer inspection, the contaminated subset did not show dramatically higher performance than the clean subset — suggesting the contamination may not have provided a meaningful advantage, or that the benchmark was simply measuring something the model had genuinely learned. This nuance is important: overlap does not automatically mean inflated scores.

For HumanEval (a coding benchmark), overlap was lower. OpenAI noted that the nature of code evaluation — running programs and checking outputs — makes contamination less helpful than in multiple-choice settings where answer keys can be memorised directly.

Phi-1 and the HumanEval Investigation (2023)

Microsoft Research's Phi-1 model (June 2023) achieved 50.6% pass@1 on HumanEval despite having only 1.3 billion parameters — a result that surprised many observers given that much larger models scored lower. Researchers outside Microsoft raised contamination concerns. The Phi-1 paper acknowledged that their training data included "textbook-quality" synthetically generated code problems that were topically similar to HumanEval. Whether this constituted contamination or simply effective domain coverage was debated, but the episode highlighted how even indirect overlap — not literal benchmark inclusion — can inflate scores when the benchmark and training data share the same distribution.

The Distribution Overlap Problem

True contamination (exact benchmark examples in training) is one end of a spectrum. At the other end is distribution overlap — training on data that is drawn from the same distribution as the benchmark. The second is harder to detect and arguably harder to prevent. Phi-1's HumanEval result sits in this grey zone.

The Llama 2 Evaluation Controversy (2023)

Meta's Llama 2 paper (July 2023) reported strong benchmark results across MMLU, TruthfulQA, and other evaluations. External researchers, including a team at Hugging Face, noted that the reported numbers for some benchmarks appeared higher than those produced by community re-evaluations using the same benchmark datasets. Discrepancies were attributed partly to evaluation methodology differences (prompt formatting, shot counts) and partly to possible contamination. Meta provided some contamination analysis in the paper but the methodology was criticised as insufficiently rigorous by several independent researchers.

The BenchmarkLeakage Repository (2024)

In early 2024, a group of researchers published a systematic study cataloguing contamination evidence across 30+ open-source and proprietary models, releasing their findings as a public repository. Their methodology involved querying models directly with benchmark questions and measuring whether models could complete answer sequences with higher-than-expected accuracy — a behavioural contamination probe rather than a data-inspection approach. They found evidence of contamination in multiple models across MMLU, ARC-Challenge, and TriviaQA. The study could not establish causation (contamination versus genuine knowledge) but provided quantitative contamination risk scores for each model-benchmark pair.

Model / Paper	Benchmark	Contamination Finding	Response
GPT-4 (OpenAI, 2023)	AMC 2022/23, MMLU, HumanEval	N-gram overlap detected; disclosed in tech report	Results annotated; clean vs. contaminated subsets reported
Phi-1 (Microsoft, 2023)	HumanEval	Distribution overlap via synthetic training data	Acknowledged in paper; debated externally
Llama 2 (Meta, 2023)	MMLU, TruthfulQA	Score discrepancies vs. community re-eval	Contamination analysis in paper; methodology criticised
Multiple models (2024 study)	MMLU, ARC-Challenge, TriviaQA	Behavioural probing suggested memorisation	Public repository; industry largely did not respond directly

What the Cases Have in Common

Across these documented cases, several patterns emerge. First, contamination is almost always discovered after the fact by external researchers, not prevented during training. Second, the response from labs has been variable — OpenAI's transparency in the GPT-4 report is the positive outlier. Third, the contamination rarely produces a simple, clean verdict: the relationship between data overlap and score inflation is complex and depends on the benchmark type, the degree of overlap, and how the model is evaluated.

The Transparency Baseline

The field has not yet converged on a standard for contamination disclosure. GPT-4's approach — proactive n-gram analysis with public disclosure — remains one of the most thorough examples. That it is exceptional rather than routine reflects a systemic gap in evaluation norms.

Lesson 2 Quiz

Documented Cases of Contamination

1. What was notable about OpenAI's handling of contamination in the GPT-4 technical report?

Correct. OpenAI's proactive disclosure and dual reporting (clean vs. contaminated subsets) set a positive precedent, though the approach was still debated as insufficiently rigorous by some researchers.

Revisit the GPT-4 case. Their approach was notable for transparency — disclosure and annotated results — not for excluding data or claiming clean pipelines.

2. The Phi-1 model's strong HumanEval performance raised contamination concerns. What made this case different from direct benchmark inclusion?

Correct. Distribution overlap — training on data from the same domain and difficulty level as the benchmark — is a subtler and harder-to-detect form of contamination than direct inclusion of benchmark text.

The Phi-1 case is an example of distribution overlap, not verbatim benchmark inclusion. This grey zone is harder to detect and harder to adjudicate.

3. The 2024 "BenchmarkLeakage" study used what methodology to detect contamination without access to training data?

Correct. Behavioural probing is a black-box contamination detection method — it infers memorisation from model outputs rather than requiring access to training data internals.

The key methodological innovation in the 2024 study was behavioural probing — inferring memorisation from how models complete benchmark answer sequences, without needing to inspect training data directly.

Lab 2 — Case Analysis

Analyse documented contamination incidents and their industry implications.

Your Task

In this lab, discuss the documented contamination cases from Lesson 2 with the AI tutor. Focus on comparing how different labs responded, what the cases reveal about industry norms, and what "distribution overlap" means versus direct contamination. Ask at least three substantive questions.

Suggested opening: "Why is the Phi-1 / HumanEval case considered a 'grey zone' — what makes distribution overlap harder to adjudicate than direct contamination?"

AI Tutor — Contamination Case Studies

Lab 2

Welcome to Lab 2. We'll be diving into the documented contamination cases — GPT-4, Phi-1, Llama 2, and the 2024 systematic study. What would you like to analyse?

Module 4 · Lesson 3

Detection Methods

From n-gram overlap to membership inference — the toolkit for finding contamination.

How do researchers detect contamination when they cannot inspect the training data directly?

Most contamination detection research operates under a fundamental constraint: the training data is not publicly available. Researchers studying GPT-4 or Claude cannot download the training corpus and search for benchmark examples. This black-box condition has driven the development of inference-based detection methods — techniques that query model behaviour to infer what the model has memorised.

Method 1: N-gram Overlap Analysis

The most direct method is n-gram overlap: compare substrings of benchmark questions against the training corpus. This requires access to the training data and is therefore primarily used internally by labs. OpenAI's GPT-4 contamination analysis used this approach — they searched for 50-character substrings from benchmark questions within their training set.

N-gram overlap has known limitations. A high overlap score does not guarantee score inflation (the model may not have learned the answer from those specific occurrences), and a low overlap score does not guarantee clean evaluation (paraphrased versions of the same question may be present). Threshold selection is also arbitrary — there is no consensus on what length or frequency of overlap constitutes "contamination."

N-gram overlapSearching for shared character or word sequences between training data and benchmark questions. Requires training data access. Used internally by labs.

Membership inferenceStatistical techniques that determine, from model outputs, whether a specific example was in training data. Based on the observation that models assign higher probability to memorised sequences.

Canary insertionSeeding unique, distinctive strings into training data to later test whether the model has memorised them — used to measure extraction risk rather than benchmark contamination specifically.

Method 2: Perplexity-Based Probing

Language models assign probability scores to text sequences. If a model has memorised a benchmark question and answer, it will assign lower perplexity (higher probability) to that sequence than to a paraphrased equivalent. Researchers can exploit this: compute a model's perplexity on the original benchmark text versus carefully paraphrased versions. A significant perplexity gap is evidence of memorisation.

The 2021 paper "Extracting Training Data from Large Language Models" by Carlini et al. (Google Brain) demonstrated that models with high perplexity gaps on specific sequences had typically memorised those sequences verbatim. This methodology was later adapted for benchmark contamination detection. The challenge is generating appropriate paraphrases that preserve semantic content while differing sufficiently in surface form.

Perplexity Gap Test

If a model assigns dramatically lower perplexity to "The capital of France is Paris" than to "France's capital city is Paris," this gap suggests memorisation of the first phrasing specifically — a red flag if that phrasing appears in a benchmark answer key.

Method 3: Completion-Based Behavioural Probing

Without access to perplexity scores (which require white-box access), researchers have developed completion probing: present the model with the beginning of a benchmark question and measure whether it can accurately complete the question text or produce the answer before being asked. Models that have memorised benchmark examples tend to complete them with higher accuracy than they would achieve through reasoning alone.

The 2023 paper "Investigating Data Contamination in Modern LLMs" (Golchin & Surdeanu) formalised this approach. They found that several models, when prompted with the first portion of MMLU questions, could often generate the remainder verbatim — strong evidence of memorisation. They called this a "guided prompting" contamination test and applied it across multiple benchmarks and models.

Method 4: Paraphrase Degradation Testing

Perhaps the most practically actionable detection method: re-run the benchmark with semantically equivalent but lexically varied questions. If performance on the paraphrased version is significantly lower than on the original, contamination is the likely explanation. This approach was used in the MIT/UW 2023 study.

The limitation is constructing high-quality paraphrases — paraphrases that change surface form without inadvertently changing difficulty or introducing ambiguity. Automated paraphrasing with LLMs introduces its own biases; human paraphrasing is expensive at benchmark scale.

Method 5: Temporal Analysis

Some benchmarks have associated timestamps — when the benchmark was released versus when training data was collected. If a model's training cutoff predates a benchmark's release and the model still shows high performance, contamination via that specific benchmark is less likely (though the model may have trained on similar-distribution data). Conversely, if benchmark release predates the training cutoff by several months, the window for contamination is wide.

Epoch AI and other research organisations have used temporal analysis to assess contamination risk across benchmark-model pairs, flagging combinations where the temporal window is suspicious relative to observed performance.

Limitations of All Current Methods

No current detection method is definitive. N-gram overlap requires training data access most researchers don't have. Perplexity probing requires white-box access to logits. Completion probing can produce false positives (a model might know the answer without memorising the benchmark). Paraphrase testing is expensive and methodologically fraught. Temporal analysis is correlational. The honest conclusion is that the field lacks a gold-standard contamination detection protocol — which is itself an argument for prevention rather than relying on detection.

The Detection Gap

Contamination is nearly impossible to definitively rule out from the outside. This asymmetry — where proving contamination is hard and disproving it is harder — puts the burden of credibility on the labs that produce training data, not on external researchers trying to audit them.

Lesson 3 Quiz

Detection Methods

1. What is the core limitation of n-gram overlap analysis as a contamination detection tool?

Correct. The double limitation — access requirement and imperfect signal — makes n-gram overlap necessary but not sufficient as a contamination audit tool.

The key limitations are the training data access requirement and the imperfect relationship between overlap and score inflation. Even high overlap doesn't guarantee the model learned the answer from those specific occurrences.

2. Completion-based behavioural probing detects contamination by:

Correct. This "guided prompting" approach, formalised by Golchin & Surdeanu (2023), exploits the fact that memorised sequences can be retrieved with high fidelity given partial cues.

Completion probing works by testing whether partial cues from a benchmark question trigger verbatim completion — a behavioural signature of memorisation rather than reasoning.

3. Why does the existence of multiple imperfect detection methods argue for prevention over detection as a strategy?

Correct. The detection gap — the difficulty of definitively proving or disproving contamination from the outside — is the core argument for building prevention into training pipelines rather than auditing after the fact.

The argument for prevention stems from the fundamental inadequacy of post-hoc detection: with no definitive method available, post-hoc auditing cannot restore confidence in already-reported results.

Lab 3 — Detection Tool Comparison

Work through the tradeoffs of contamination detection methods.

Your Task

In this lab, you'll work through the strengths and weaknesses of each contamination detection method with the AI tutor. Think about which methods you could apply as an external researcher (no training data access) versus an internal researcher (full access). Ask at least three substantive questions to complete the lab.

Suggested opening: "If I'm an external researcher trying to audit a closed-source model for contamination on MMLU, which detection methods are actually available to me and which are blocked by lack of training data access?"

AI Tutor — Detection Methods

Lab 3

Welcome to Lab 3. Let's think carefully about contamination detection from both an internal (lab) perspective and an external (researcher) perspective. What would you like to explore?

Module 4 · Lesson 4

Prevention, Mitigation, and Living with Uncertainty

Decontamination pipelines, dynamic benchmarks, and the evaluation arms race.

What are the best available strategies for preventing contamination — and what are their costs and limits?

Every benchmark that becomes publicly associated with frontier model performance becomes a target for inadvertent contamination. The more prominent a benchmark — the more papers cite it, the more leaderboards feature it, the more blog posts discuss it — the more likely its questions are to appear in future training crawls. In this sense, benchmark success is self-defeating: the benchmarks that matter most are the ones most at risk.

Prevention Strategy 1: Decontamination Pipelines

The most direct prevention approach is to build explicit decontamination into the training data pipeline: before finalising training data, systematically search for and remove benchmark examples. This requires knowing in advance which benchmarks will be used for evaluation — a planning requirement that is easy to satisfy for widely used benchmarks but harder for newly proposed ones.

Practical decontamination pipelines vary in aggressiveness. A minimal approach removes exact matches of benchmark questions. A more aggressive approach removes documents with high n-gram overlap. The most aggressive approaches attempt to remove documents topically related to benchmark domains, but this risks removing genuinely useful training signal.

EleutherAI's decontamination approach for their evaluation harness sets a notable precedent: when evaluating models trained on the Pile (their open training dataset), they removed any training examples with 13-gram overlap with evaluation benchmark test sets. This transparent, documented decontamination protocol became a reference point for the open-source community.

Prevention Strategy 2: Private or Partially-Hidden Benchmarks

If benchmark questions never appear on the public internet, they cannot contaminate training corpora derived from web crawls. Several organisations maintain private evaluation sets for this reason. BIG-bench (Beyond the Imitation Game benchmark, Google DeepMind 2022) included a deliberate mix of public and non-public tasks to allow comparative analysis. HELM (Holistic Evaluation of Language Models, Stanford 2022) used some licensed datasets not freely available online.

The limitation is significant: private benchmarks cannot be independently replicated. A lab claims X% on a private test set and external researchers must take this on faith or run their own evaluation. This trades contamination risk for verification risk — a difficult tradeoff.

The Tradeoff Matrix

Public benchmark: Reproducible, independently verifiable, contamination-prone. Private benchmark: Contamination-resistant, not independently verifiable. Neither is strictly superior — the appropriate choice depends on the evaluation purpose and the importance of independent verification.

Prevention Strategy 3: Dynamic and Living Benchmarks

If benchmark questions are rotated, generated on-the-fly, or continuously refreshed, contamination is structurally prevented — a model cannot memorise questions that did not exist when it was trained. Several efforts have pursued this direction:

Dynabench (Facebook AI, 2021)

A platform for dynamic adversarial data collection where new examples are continuously added via human-AI collaboration. Benchmark questions are created after model training, making memorisation impossible. Applied to NLI, QA, and sentiment tasks.

HELMET (2024, Princeton)

A suite designed with contamination resistance explicitly in mind, using tasks that require reasoning over novel, generated inputs rather than retrieving memorised facts. Preliminary results showed more differentiation between model sizes than static benchmarks.

LiveBench (2024)

A benchmark that uses questions derived from recently published papers, news, and competition problems — all postdating typical training cutoffs. Updated monthly. Designed so that any model claiming strong performance must be reasoning from genuinely novel inputs.

Prevention Strategy 4: Held-Out Private Test Sets with Third-Party Audit

A proposed institutional solution: evaluation benchmarks are maintained by a trusted third party (an independent research organisation or standards body). Labs submit models for evaluation; the third party runs the evaluation against questions that are never disclosed publicly. Results are certified by the third party rather than self-reported by the lab.

This model is analogous to financial auditing — the entity being audited does not get to design or see the audit questions. MLCommons (which administers MLPerf benchmarks) uses variants of this approach for inference benchmarks. Extending it to capability evaluation for LLMs remains an ongoing institutional challenge.

Living with Residual Uncertainty

Even with aggressive prevention, complete elimination of contamination risk is probably not achievable. The internet is vast; training data pipelines make tradeoffs; new benchmarks emerge after training decisions are made. The realistic goal is not zero contamination but rather contamination that is disclosed, quantified where possible, and appropriately caveated in benchmark reporting.

The responsible evaluation practice that is emerging — slowly — involves four elements: running explicit decontamination analysis, disclosing any detected overlap, reporting scores with and without suspected contamination, and using multiple diverse benchmarks so that contamination of any single one does not dominate the overall evaluation picture.

The Role of Benchmark Diversity

One underappreciated mitigation is simply using many benchmarks. If a model is contaminated on MMLU but MMLU is one of twenty evaluation dimensions, the contamination signal is diluted. If MMLU is the primary or only evaluation, contamination is catastrophic. HELM's approach of evaluating across dozens of tasks and scenarios, and reporting disaggregated scores rather than a single composite, provides more resilience to single-benchmark contamination than leaderboard-style ranking systems.

The Field's Direction

The community is moving — unevenly — toward dynamic benchmarks, third-party evaluation, and multi-dimensional reporting. Each approach has costs: dynamic benchmarks are expensive to maintain; third-party evaluation requires institutional trust and coordination; disaggregated reporting is harder to communicate to non-expert audiences. The path forward is probably some combination of all three rather than any single solution.

Lesson 4 Quiz

Prevention, Mitigation, and Living with Uncertainty

1. EleutherAI's decontamination approach for their evaluation harness removed training examples with what overlap criterion?

Correct. The 13-gram overlap criterion became a reference standard in the open-source community because it was transparent, documented, and operationally specific.

EleutherAI used 13-gram overlap as their decontamination criterion — a specific, documented standard that became influential in the open-source community.

2. What is the core tradeoff between public and private benchmarks in the context of contamination?

Correct. This is the fundamental tradeoff — contamination resistance versus independent verifiability — and neither option dominates. The appropriate choice depends on evaluation purpose and context.

The key tension is between contamination resistance (favours private) and independent verifiability (favours public). Neither is strictly superior; the tradeoff depends on what you're trying to protect against.

3. Why does LiveBench (2024) claim structural contamination resistance without requiring private benchmark questions?

Correct. Temporal contamination resistance — using questions from after the training cutoff — is a clever structural solution that maintains public accessibility while limiting memorisation risk.

LiveBench's innovation is temporal: using questions derived from events and publications after training cutoffs. This provides contamination resistance without requiring question secrecy.

Lab 4 — Prevention Strategy Design

Design a contamination prevention approach for a realistic evaluation scenario.

Your Task

In this lab you'll work with the AI tutor to design a contamination prevention strategy. Imagine you're responsible for evaluating a new frontier model and need to produce benchmark results that stakeholders will trust. Consider the tradeoffs between different prevention approaches for your scenario. Ask at least three substantive questions to complete the lab.

Suggested opening: "I need to evaluate a model that was trained with a data cutoff of January 2025. I want to use MMLU, but I'm worried about contamination. Walk me through what I should do before I report results."

AI Tutor — Prevention Strategy

Lab 4

Welcome to Lab 4. Let's work through contamination prevention design together. Tell me about the evaluation scenario you're working with, and we'll think through what strategies make sense and at what cost.

Module Test — Benchmark Contamination

15 questions · Score 80% or above to pass · All four lessons

1. Benchmark contamination occurs when:

Correct.

Contamination specifically refers to training data including evaluation examples, undermining the measurement validity of the benchmark.

2. "Input contamination" (as opposed to input-output contamination) means:

Correct. Input contamination is milder because the model still needs to reason to the answer rather than recall it directly.

Input contamination is question-only leakage — the answer key is not in training data. This is considered less severe than full input-output contamination.

3. The MIT/University of Washington 2023 study detected contamination by observing that:

Correct. Paraphrase degradation — performance dropping on semantically equivalent but lexically different questions — is the behavioural signature of memorisation.

The paraphrase degradation finding is the key result: genuine understanding should be robust to surface rephrasing; memorisation is not.

4. OpenAI's GPT-4 contamination analysis found overlap with which benchmarks?

Correct. The proactive disclosure across multiple benchmarks, with annotated results, was notable for its transparency relative to industry norms.

OpenAI found and disclosed overlap with AMC competitions, MMLU, HumanEval, and others — and reported results with appropriate caveats for contaminated subsets.

5. Phi-1's strong HumanEval score is considered a "grey zone" contamination case because:

Correct. Distribution overlap sits in a conceptual grey zone — the model was trained on similar data, not the benchmark itself, making contamination claims harder to establish and dispute.

The grey zone arises from distribution overlap: Phi-1's training data was similar to, not identical to, HumanEval questions. This makes contamination claims hard to prove or refute.

6. Llama 2's benchmark scores were contested by external researchers primarily because:

Correct. Score discrepancies between self-reported and community-replicated results, combined with methodological concerns about the contamination analysis, drove the controversy.

The Llama 2 controversy involved discrepancies between Meta's reported scores and community re-evaluations, plus criticism of Meta's contamination analysis methodology.

7. Perplexity-based contamination probing works by:

Correct. The perplexity gap between original and paraphrased versions is the contamination signal — memorised text gets lower perplexity (higher probability) than semantically equivalent novel phrasing.

Perplexity probing exploits the fact that models assign higher probability to memorised sequences. A gap between original and paraphrased versions indicates the model has memorised the specific phrasing.

8. The "guided prompting" contamination test, formalised by Golchin & Surdeanu (2023), involves:

Correct. Verbatim completion of benchmark question text from partial cues is a strong signal of memorisation — a model that only understands the concept should not be able to reproduce the exact original phrasing.

Guided prompting tests whether a model can complete benchmark questions from partial cues — verbatim completion implies the full question was in training data.

9. Which detection method requires white-box access to a model's internal logits?

Correct. Perplexity computation requires access to the model's output probability distribution, which is available for open-source models but typically not for closed-source API-only models.

Perplexity probing requires logit access — the model's probability assignments — making it unavailable for black-box API-only models. The other methods can work from model outputs alone.

10. EleutherAI's decontamination standard for their open-source evaluation harness used what threshold?

Correct. The 13-gram standard became a widely cited decontamination benchmark in open-source evaluation practice.

EleutherAI used 13-gram overlap as their documented decontamination criterion — specific enough to be operationalised and transparent enough to be reproduced.

11. Dynamic benchmarks like LiveBench (2024) achieve contamination resistance primarily through:

Correct. Temporal contamination resistance — using post-cutoff sources — allows public questions to remain contamination-free because they didn't exist when training data was collected.

LiveBench's key innovation is temporal: questions from after training cutoffs cannot have been memorised during training, regardless of whether they are publicly accessible.

12. Dynabench (Facebook AI, 2021) addressed contamination through:

Correct. Dynabench's continuous collection model structurally prevents contamination by ensuring the benchmark grows after models are trained.

Dynabench uses a living benchmark approach: continuous collection of new examples via human-AI collaboration, so the benchmark always contains questions newer than any model being evaluated.

13. The core tradeoff between public and private benchmarks is:

Correct. This fundamental tradeoff — contamination risk vs. verification risk — shapes the design choices for any evaluation programme.

The core tension is reproducibility versus contamination resistance. A private benchmark can't be contaminated but also can't be independently audited — trading one trust problem for another.

14. Using multiple diverse benchmarks rather than relying on a single benchmark mitigates contamination risk because:

Correct. Benchmark diversity is a practical contamination resilience strategy — it reduces the stakes of any single benchmark being contaminated.

Benchmark diversity mitigates contamination impact: if one benchmark is contaminated, it's a smaller proportion of a many-benchmark evaluation than if it's the sole metric.

15. The "benchmark success is self-defeating" observation refers to:

Correct. This is the central irony of benchmark contamination: the benchmarks that the field most relies upon for progress measurement are the ones most at risk of being contaminated by that very reliance.

Benchmark success is self-defeating because prominence creates internet presence, and internet presence creates training data inclusion risk. The most trusted benchmarks become the most contaminated.