In September 2023, researchers at MIT and the University of Washington published a paper titled "Investigating Data Contamination in Modern LLMs." They found that several frontier models showed statistically anomalous performance on certain multiple-choice benchmarks — performance that dropped meaningfully when the questions were slightly paraphrased. The implication was uncomfortable: models appeared to have memorised specific answer sequences, not learned the underlying reasoning the benchmarks were designed to test.
Benchmark contamination — also called data contamination or test leakage — occurs when examples from a held-out evaluation dataset appear in a model's training corpus. The result is that a model's reported score on that benchmark no longer reliably measures the capability the benchmark was designed to assess. The model may have, in effect, memorised the questions and answers rather than generalised the skill.
The internet is enormous. Modern LLMs are trained on crawls that span hundreds of billions of tokens drawn from Common Crawl, GitHub, academic repositories, and countless other sources. Benchmark datasets — including BoolQ, HellaSwag, MMLU, HumanEval, and many others — are publicly available. Their questions, answer keys, and even their metadata have been indexed, discussed, reproduced, and mirrored across the web. Keeping them out of a multi-terabyte training corpus is genuinely difficult.
Contamination is usually unintentional. Training pipelines consume web data at scale; no human reviews every token. The problem is structural: the incentive to publish benchmarks publicly (for reproducibility) conflicts with the need to keep evaluation data unseen. This tension has no easy resolution.
Researchers distinguish several contamination scenarios, each with different severity:
A fundamental goal of evaluation is to measure generalisation — the capacity to handle novel situations. Contamination short-circuits this by converting a generalisation test into a recall test. A model that has memorised MMLU answers achieves high scores by retrieval, not reasoning. When those same questions are rephrased or the answer options are shuffled, the memorisation advantage erodes, which is exactly the pattern the 2023 MIT/UW study documented.
OpenAI's technical report for GPT-4 (March 2023) addressed contamination directly, noting that they performed contamination analysis by checking for n-gram overlap between training data and benchmark questions. They found some overlap with several benchmarks, including MMLU, and disclosed this explicitly — a relatively rare degree of transparency at the time. Their reported scores for contaminated subsets were annotated with caveats.
The same property that makes benchmarks useful — public accessibility — makes them vulnerable. Once a benchmark question is on the internet, it is a candidate for training data inclusion. This is a systems problem, not a character problem.
The problem predates large language models. In supervised machine learning, the distinction between training, validation, and test sets has been foundational since the 1990s. Kaggle competitions have documented cases where participants inadvertently or deliberately incorporated test labels. The NLP community grappled with train-test overlap in earlier systems — the SQuAD reading comprehension benchmark, for instance, had passages drawn from Wikipedia, and early neural models were trained on Wikipedia, creating at least indirect overlap concerns. What changed with LLMs is scale: the training corpora are so vast that probabilistic contamination of any public benchmark is nearly certain rather than merely possible.
In this lab you'll discuss benchmark contamination with an AI tutor. Explore the mechanics of how contamination enters training data, the difference between input and input-output contamination, and how researchers detect it. Ask at least three substantive questions to complete the lab.
When OpenAI released the GPT-4 technical report, it included an unusual section: a contamination analysis. The team had checked n-gram overlap between their training data and every evaluation benchmark they reported results on. They found varying degrees of overlap with AMC 2022/2023, MMLU, HumanEval, and several others. Rather than omitting contaminated benchmarks, they disclosed the findings and, where overlap was high, provided results both with and without the potentially contaminated examples. It was an imperfect but transparent approach that set a precedent.
OpenAI's analysis found that AMC 2022 and 2023 math competitions had substantial overlap with training data. However, on closer inspection, the contaminated subset did not show dramatically higher performance than the clean subset — suggesting the contamination may not have provided a meaningful advantage, or that the benchmark was simply measuring something the model had genuinely learned. This nuance is important: overlap does not automatically mean inflated scores.
For HumanEval (a coding benchmark), overlap was lower. OpenAI noted that the nature of code evaluation — running programs and checking outputs — makes contamination less helpful than in multiple-choice settings where answer keys can be memorised directly.
Microsoft Research's Phi-1 model (June 2023) achieved 50.6% pass@1 on HumanEval despite having only 1.3 billion parameters — a result that surprised many observers given that much larger models scored lower. Researchers outside Microsoft raised contamination concerns. The Phi-1 paper acknowledged that their training data included "textbook-quality" synthetically generated code problems that were topically similar to HumanEval. Whether this constituted contamination or simply effective domain coverage was debated, but the episode highlighted how even indirect overlap — not literal benchmark inclusion — can inflate scores when the benchmark and training data share the same distribution.
True contamination (exact benchmark examples in training) is one end of a spectrum. At the other end is distribution overlap — training on data that is drawn from the same distribution as the benchmark. The second is harder to detect and arguably harder to prevent. Phi-1's HumanEval result sits in this grey zone.
Meta's Llama 2 paper (July 2023) reported strong benchmark results across MMLU, TruthfulQA, and other evaluations. External researchers, including a team at Hugging Face, noted that the reported numbers for some benchmarks appeared higher than those produced by community re-evaluations using the same benchmark datasets. Discrepancies were attributed partly to evaluation methodology differences (prompt formatting, shot counts) and partly to possible contamination. Meta provided some contamination analysis in the paper but the methodology was criticised as insufficiently rigorous by several independent researchers.
In early 2024, a group of researchers published a systematic study cataloguing contamination evidence across 30+ open-source and proprietary models, releasing their findings as a public repository. Their methodology involved querying models directly with benchmark questions and measuring whether models could complete answer sequences with higher-than-expected accuracy — a behavioural contamination probe rather than a data-inspection approach. They found evidence of contamination in multiple models across MMLU, ARC-Challenge, and TriviaQA. The study could not establish causation (contamination versus genuine knowledge) but provided quantitative contamination risk scores for each model-benchmark pair.
| Model / Paper | Benchmark | Contamination Finding | Response |
|---|---|---|---|
| GPT-4 (OpenAI, 2023) | AMC 2022/23, MMLU, HumanEval | N-gram overlap detected; disclosed in tech report | Results annotated; clean vs. contaminated subsets reported |
| Phi-1 (Microsoft, 2023) | HumanEval | Distribution overlap via synthetic training data | Acknowledged in paper; debated externally |
| Llama 2 (Meta, 2023) | MMLU, TruthfulQA | Score discrepancies vs. community re-eval | Contamination analysis in paper; methodology criticised |
| Multiple models (2024 study) | MMLU, ARC-Challenge, TriviaQA | Behavioural probing suggested memorisation | Public repository; industry largely did not respond directly |
Across these documented cases, several patterns emerge. First, contamination is almost always discovered after the fact by external researchers, not prevented during training. Second, the response from labs has been variable — OpenAI's transparency in the GPT-4 report is the positive outlier. Third, the contamination rarely produces a simple, clean verdict: the relationship between data overlap and score inflation is complex and depends on the benchmark type, the degree of overlap, and how the model is evaluated.
The field has not yet converged on a standard for contamination disclosure. GPT-4's approach — proactive n-gram analysis with public disclosure — remains one of the most thorough examples. That it is exceptional rather than routine reflects a systemic gap in evaluation norms.
In this lab, discuss the documented contamination cases from Lesson 2 with the AI tutor. Focus on comparing how different labs responded, what the cases reveal about industry norms, and what "distribution overlap" means versus direct contamination. Ask at least three substantive questions.
Most contamination detection research operates under a fundamental constraint: the training data is not publicly available. Researchers studying GPT-4 or Claude cannot download the training corpus and search for benchmark examples. This black-box condition has driven the development of inference-based detection methods — techniques that query model behaviour to infer what the model has memorised.
The most direct method is n-gram overlap: compare substrings of benchmark questions against the training corpus. This requires access to the training data and is therefore primarily used internally by labs. OpenAI's GPT-4 contamination analysis used this approach — they searched for 50-character substrings from benchmark questions within their training set.
N-gram overlap has known limitations. A high overlap score does not guarantee score inflation (the model may not have learned the answer from those specific occurrences), and a low overlap score does not guarantee clean evaluation (paraphrased versions of the same question may be present). Threshold selection is also arbitrary — there is no consensus on what length or frequency of overlap constitutes "contamination."
Language models assign probability scores to text sequences. If a model has memorised a benchmark question and answer, it will assign lower perplexity (higher probability) to that sequence than to a paraphrased equivalent. Researchers can exploit this: compute a model's perplexity on the original benchmark text versus carefully paraphrased versions. A significant perplexity gap is evidence of memorisation.
The 2021 paper "Extracting Training Data from Large Language Models" by Carlini et al. (Google Brain) demonstrated that models with high perplexity gaps on specific sequences had typically memorised those sequences verbatim. This methodology was later adapted for benchmark contamination detection. The challenge is generating appropriate paraphrases that preserve semantic content while differing sufficiently in surface form.
If a model assigns dramatically lower perplexity to "The capital of France is Paris" than to "France's capital city is Paris," this gap suggests memorisation of the first phrasing specifically — a red flag if that phrasing appears in a benchmark answer key.
Without access to perplexity scores (which require white-box access), researchers have developed completion probing: present the model with the beginning of a benchmark question and measure whether it can accurately complete the question text or produce the answer before being asked. Models that have memorised benchmark examples tend to complete them with higher accuracy than they would achieve through reasoning alone.
The 2023 paper "Investigating Data Contamination in Modern LLMs" (Golchin & Surdeanu) formalised this approach. They found that several models, when prompted with the first portion of MMLU questions, could often generate the remainder verbatim — strong evidence of memorisation. They called this a "guided prompting" contamination test and applied it across multiple benchmarks and models.
Perhaps the most practically actionable detection method: re-run the benchmark with semantically equivalent but lexically varied questions. If performance on the paraphrased version is significantly lower than on the original, contamination is the likely explanation. This approach was used in the MIT/UW 2023 study.
The limitation is constructing high-quality paraphrases — paraphrases that change surface form without inadvertently changing difficulty or introducing ambiguity. Automated paraphrasing with LLMs introduces its own biases; human paraphrasing is expensive at benchmark scale.
Some benchmarks have associated timestamps — when the benchmark was released versus when training data was collected. If a model's training cutoff predates a benchmark's release and the model still shows high performance, contamination via that specific benchmark is less likely (though the model may have trained on similar-distribution data). Conversely, if benchmark release predates the training cutoff by several months, the window for contamination is wide.
Epoch AI and other research organisations have used temporal analysis to assess contamination risk across benchmark-model pairs, flagging combinations where the temporal window is suspicious relative to observed performance.
No current detection method is definitive. N-gram overlap requires training data access most researchers don't have. Perplexity probing requires white-box access to logits. Completion probing can produce false positives (a model might know the answer without memorising the benchmark). Paraphrase testing is expensive and methodologically fraught. Temporal analysis is correlational. The honest conclusion is that the field lacks a gold-standard contamination detection protocol — which is itself an argument for prevention rather than relying on detection.
Contamination is nearly impossible to definitively rule out from the outside. This asymmetry — where proving contamination is hard and disproving it is harder — puts the burden of credibility on the labs that produce training data, not on external researchers trying to audit them.
In this lab, you'll work through the strengths and weaknesses of each contamination detection method with the AI tutor. Think about which methods you could apply as an external researcher (no training data access) versus an internal researcher (full access). Ask at least three substantive questions to complete the lab.
Every benchmark that becomes publicly associated with frontier model performance becomes a target for inadvertent contamination. The more prominent a benchmark — the more papers cite it, the more leaderboards feature it, the more blog posts discuss it — the more likely its questions are to appear in future training crawls. In this sense, benchmark success is self-defeating: the benchmarks that matter most are the ones most at risk.
The most direct prevention approach is to build explicit decontamination into the training data pipeline: before finalising training data, systematically search for and remove benchmark examples. This requires knowing in advance which benchmarks will be used for evaluation — a planning requirement that is easy to satisfy for widely used benchmarks but harder for newly proposed ones.
Practical decontamination pipelines vary in aggressiveness. A minimal approach removes exact matches of benchmark questions. A more aggressive approach removes documents with high n-gram overlap. The most aggressive approaches attempt to remove documents topically related to benchmark domains, but this risks removing genuinely useful training signal.
EleutherAI's decontamination approach for their evaluation harness sets a notable precedent: when evaluating models trained on the Pile (their open training dataset), they removed any training examples with 13-gram overlap with evaluation benchmark test sets. This transparent, documented decontamination protocol became a reference point for the open-source community.
If benchmark questions never appear on the public internet, they cannot contaminate training corpora derived from web crawls. Several organisations maintain private evaluation sets for this reason. BIG-bench (Beyond the Imitation Game benchmark, Google DeepMind 2022) included a deliberate mix of public and non-public tasks to allow comparative analysis. HELM (Holistic Evaluation of Language Models, Stanford 2022) used some licensed datasets not freely available online.
The limitation is significant: private benchmarks cannot be independently replicated. A lab claims X% on a private test set and external researchers must take this on faith or run their own evaluation. This trades contamination risk for verification risk — a difficult tradeoff.
Public benchmark: Reproducible, independently verifiable, contamination-prone. Private benchmark: Contamination-resistant, not independently verifiable. Neither is strictly superior — the appropriate choice depends on the evaluation purpose and the importance of independent verification.
If benchmark questions are rotated, generated on-the-fly, or continuously refreshed, contamination is structurally prevented — a model cannot memorise questions that did not exist when it was trained. Several efforts have pursued this direction:
A proposed institutional solution: evaluation benchmarks are maintained by a trusted third party (an independent research organisation or standards body). Labs submit models for evaluation; the third party runs the evaluation against questions that are never disclosed publicly. Results are certified by the third party rather than self-reported by the lab.
This model is analogous to financial auditing — the entity being audited does not get to design or see the audit questions. MLCommons (which administers MLPerf benchmarks) uses variants of this approach for inference benchmarks. Extending it to capability evaluation for LLMs remains an ongoing institutional challenge.
Even with aggressive prevention, complete elimination of contamination risk is probably not achievable. The internet is vast; training data pipelines make tradeoffs; new benchmarks emerge after training decisions are made. The realistic goal is not zero contamination but rather contamination that is disclosed, quantified where possible, and appropriately caveated in benchmark reporting.
The responsible evaluation practice that is emerging — slowly — involves four elements: running explicit decontamination analysis, disclosing any detected overlap, reporting scores with and without suspected contamination, and using multiple diverse benchmarks so that contamination of any single one does not dominate the overall evaluation picture.
One underappreciated mitigation is simply using many benchmarks. If a model is contaminated on MMLU but MMLU is one of twenty evaluation dimensions, the contamination signal is diluted. If MMLU is the primary or only evaluation, contamination is catastrophic. HELM's approach of evaluating across dozens of tasks and scenarios, and reporting disaggregated scores rather than a single composite, provides more resilience to single-benchmark contamination than leaderboard-style ranking systems.
The community is moving — unevenly — toward dynamic benchmarks, third-party evaluation, and multi-dimensional reporting. Each approach has costs: dynamic benchmarks are expensive to maintain; third-party evaluation requires institutional trust and coordination; disaggregated reporting is harder to communicate to non-expert audiences. The path forward is probably some combination of all three rather than any single solution.
In this lab you'll work with the AI tutor to design a contamination prevention strategy. Imagine you're responsible for evaluating a new frontier model and need to produce benchmark results that stakeholders will trust. Consider the tradeoffs between different prevention approaches for your scenario. Ask at least three substantive questions to complete the lab.