When the Stanford Question Answering Dataset launched in 2016, it seemed to offer a clean scoreboard for machine reading comprehension. By early 2018, two systems — from Microsoft and Alibaba — posted F1 scores above human baseline on the leaderboard within days of each other. Headlines declared the reading comprehension problem solved. It wasn't.
SQuAD (Stanford Question Answering Dataset) presents models with a Wikipedia passage and a question whose answer is a contiguous span of text within that passage. Performance is measured with two metrics: Exact Match (EM) — whether the predicted span matches the gold answer character-for-character — and F1, which gives partial credit for overlapping tokens.
The benchmark was carefully constructed: crowdworkers read passages and wrote questions, then separate workers identified answer spans. The result was 100,000+ question-answer pairs grounded in real text. But it had a structural constraint that would prove consequential — every answer had to be extractable verbatim from the passage.
In 2018, Rajpurkar et al. released SQuAD 2.0, adding 50,000 unanswerable questions — questions that looked plausible but had no answer in the passage. Models that had seemingly "solved" SQuAD 1.1 dropped dramatically. The best models at SQuAD 2.0 launch scored around 66–67% F1; humans scored ~89%. The gap had re-opened.
This exposed a core weakness: models were doing sophisticated pattern matching against passage text rather than understanding when information was absent. A model trained to always find an answer span would confidently extract wrong spans when the correct answer was "I don't know."
On SQuAD 2.0, the BERT-large model (the dominant architecture at release) scored 80.0 F1 — still 9 points below human performance of 89.5. Models that had "beaten" human-level on SQuAD 1.1 remained substantially below humans when unanswerable questions were introduced.
BoolQ (Boolean Questions, 2019) shifted the format: given a passage and a yes/no question, the model must output True or False. This sounds simpler but requires genuine inference — questions like "Can you survive a lightning strike?" paired with a passage about lightning strike survival rates demand synthesis, not span extraction.
HotpotQA (2018) introduced multi-hop reasoning: answering a question requires combining information from two separate Wikipedia paragraphs. Both supporting facts must be identified, not just the final answer. This made it much harder to shortcut through superficial text matching.
The progression — SQuAD → SQuAD 2.0 → BoolQ → HotpotQA — illustrates benchmark iteration: each version closed a loophole that allowed models to score well without truly understanding text.
When selecting a QA benchmark, verify whether it tests extraction, inference, or abstention. A model that scores 90% on SQuAD 1.1 may score 70% on SQuAD 2.0 and fail on multi-hop tasks — the same capability gap, revealed by different benchmark design choices.
You're evaluating a reading comprehension system for a legal document search tool. The system must both extract answers from contracts and recognize when a contract doesn't address the question. Discuss which metrics and benchmark design choices apply to this scenario.
When OpenAI published the Codex paper in August 2021, they introduced HumanEval — 164 hand-written programming problems, each with a function signature, docstring, and unit tests. The central metric was pass@k: given k generated solutions, does at least one pass all tests? It was a deliberately functional metric. The code either ran correctly or it didn't.
HumanEval was designed to avoid the contamination problems of code benchmarks derived from GitHub — if a model is trained on GitHub, it may have seen benchmark solutions verbatim. The 164 problems were original, spanning string manipulation, list operations, math, and simple algorithms. Each problem included at least 7 unit tests.
The key metric, pass@k, works as follows: generate k code samples for a given problem. If any one passes all unit tests, the problem is solved. The metric is estimated using an unbiased estimator rather than sampling — this corrects for the statistical variance of small k values.
164 problems is a small dataset. By 2023, multiple papers documented that models' HumanEval scores could be inflated by training data contamination — the problems, originally private, had appeared in discussion forums. OpenAI addressed this in part with HumanEval+, which expanded test coverage per problem, but the sample size remained limited.
More importantly, HumanEval problems are self-contained functions. Real software engineering involves multi-file codebases, API integration, debugging existing code, and writing tests themselves. The SWE-bench benchmark (2023, Princeton NLP) addressed this directly: it presents models with actual GitHub issues from real Python repositories and asks them to produce a code patch that passes the repository's existing test suite. SWE-bench pass rates for GPT-4 at launch were under 2% — a stark contrast to HumanEval scores above 85%.
SWE-bench evaluated 2,294 real GitHub issues from 12 popular Python repositories. GPT-4 with retrieval-augmented generation resolved 1.74% of issues. Claude 2 resolved 4.80%. The gap with HumanEval scores (GPT-4: ~87%, Claude: ~71%) illustrated how isolated function completion differs from real-world software engineering.
Google's MBPP (Mostly Basic Python Programming, 2021) provided 374 crowdsourced problems targeting beginner-level Python. Unlike HumanEval's expert-written problems, MBPP problems came from crowd workers, introducing more natural language variation and some ambiguity in problem statements.
At the other extreme, APPS (Automated Programming Progress Standard, 2021) included competitive programming problems at introductory, interview, and competition difficulty levels. pass@1 on APPS competition problems remained below 5% for all models tested at launch — establishing that current code generation models, despite high HumanEval scores, cannot reliably solve hard algorithmic problems.
LiveCodeBench (2024) addressed contamination by continuously adding new problems from competitive programming contests, ensuring models cannot have seen solutions during training.
| Benchmark | Size | Problem Type | Key Metric |
|---|---|---|---|
| HumanEval | 164 | Self-contained functions | pass@k (functional) |
| MBPP | 374 | Beginner Python | pass@k |
| APPS | 10,000 | Competitive programming | pass@k by difficulty |
| SWE-bench | 2,294 | Real GitHub issues | % resolved (full repo tests) |
| LiveCodeBench | Ongoing | New contest problems | pass@k, contamination-controlled |
Always match code benchmark difficulty to your deployment context. A high HumanEval score tells you about isolated function generation from docstrings — it tells you very little about multi-file debugging, codebase navigation, or test-driven development. For production code generation tools, SWE-bench-style evaluation on real repositories provides far more signal.
Your company is evaluating code generation models to integrate into an internal Python data-engineering tool. Engineers will use it to write ETL scripts, fix bugs in pandas/dask pipelines, and generate SQL queries from natural language. You need to choose or design a benchmark strategy.
BLEU — Bilingual Evaluation Understudy — was published by Papineni et al. at IBM in 2002. For the next two decades, it was the standard for machine translation evaluation. By the 2020s, researchers were publishing papers showing BLEU correlated poorly with human judgments on modern neural MT outputs, and the ACL community was actively debating whether to retire it. The metric had outlived the era of translation it was designed for.
BLEU computes modified n-gram precision between a hypothesis (model output) and one or more reference translations, applying a brevity penalty for short outputs. It correlates well with human judgment when systems are clearly far apart in quality — but poorly discriminates between modern high-quality systems.
A 2020 meta-analysis by Mathur et al. at ACL demonstrated that BLEU had near-zero correlation with human rankings when comparing top-performing neural MT systems on the WMT19 shared task. Systems ranked very differently by human evaluators had almost identical BLEU scores. The problem: BLEU rewards n-gram overlap but ignores paraphrase, synonymy, and word order at the sentence level.
The CNN/DailyMail dataset became the standard summarization benchmark after Hermann et al. (2015) at DeepMind released it. News articles paired with bullet-point highlights provided extractive summaries as references. ROUGE-1, ROUGE-2, and ROUGE-L became the default metrics.
By 2020, multiple papers documented that high-ROUGE abstractive summaries were frequently factually inconsistent with source documents. A 2020 study by Kryscinski et al. (Salesforce Research) found that 30% of summaries from then-state-of-the-art models contained factual errors. ROUGE had no mechanism to detect hallucination — a model that made up plausible-sounding facts could score just as high as a faithful one.
This led to factuality-aware evaluation: FactCC, DAE, and similar metrics that check whether each sentence of a summary is logically entailed by the source document. SummEval (2021, Columbia NLP) provided a large-scale human annotation study comparing 14 automatic metrics against human judgments of coherence, consistency, fluency, and relevance.
Kryscinski et al. evaluated summarization systems on CNN/DailyMail using a trained fact-checking model (FactCC). Models achieving state-of-the-art ROUGE scores produced summaries where ~30% of sentences were inconsistent with the source article. High ROUGE and high factuality were not correlated.
BERTScore (Zhang et al., 2020) addressed the semantic blindness problem by computing greedy matching between BERT embeddings of hypothesis and reference tokens. It outperformed BLEU and ROUGE in correlation with human judgments across multiple datasets. However, BERTScore is still reference-dependent — it still requires gold-standard reference outputs.
COMET (Crosslingual Optimized Metric for Evaluation of Translation, 2020) went further: a learned metric trained directly on human quality judgments from WMT shared tasks. By conditioning on source, reference, and hypothesis, COMET captures translation adequacy in ways n-gram overlap cannot. It became the primary metric for WMT 2022 and 2023.
For summarization, the field has moved toward multi-dimensional evaluation: separate scores for faithfulness, relevance, coherence, and fluency. Unified Summarization Evaluation (UniEval, 2022) used a pre-trained model to ask binary questions about each dimension, enabling reference-free evaluation.
For summarization, never use ROUGE alone — add a factuality check (FactCC, DAE, or an LLM-based consistency judge). For translation, prefer COMET over BLEU for system comparison. For both tasks, human evaluation of a stratified sample remains essential when deployment quality matters.
Your team is building a clinical note summarization system. Doctors will use it to generate discharge summaries from lengthy patient records. Accuracy is critical — a factual error could affect patient care. You need to choose an evaluation metric suite for this system.
GSM8K — Grade School Math — launched in 2021 with 8,500 linguistically diverse grade-school word problems. GPT-3 scored 35% with chain-of-thought prompting. By 2023, GPT-4 scored above 90%. Researchers moved to MATH: 12,500 competition mathematics problems from AMC, AIME, and other olympiad tracks. GPT-4's score: around 52% on MATH at launch — far lower, and for the hardest subset, under 20%.
GSM8K tests multi-step arithmetic reasoning in natural language. Each problem requires 2–8 steps. The benchmark was important because it specifically required showing work — chain-of-thought prompting, where models generate intermediate reasoning steps, was shown to dramatically improve accuracy (Wei et al., 2022). But GSM8K problems are solvable without symbolic manipulation; they rely on arithmetic and basic algebra.
MATH (Hendrycks et al., 2021) introduced genuine mathematical difficulty: competition problems in algebra, geometry, number theory, calculus, and combinatorics. The difficulty levels 4 and 5 (the hardest) remained below 25% accuracy for most frontier models until 2024. This revealed a gap between "following arithmetic steps" and "mathematical reasoning."
MINERVA (Google, 2022) specifically examined whether models could solve quantitative scientific problems requiring multi-step symbolic reasoning. The benchmark used 272 STEM problems and found that standard few-shot prompting with PaLM 540B solved 14.1% — but a fine-tuned model (Minerva) reached 33.6%, demonstrating that training data composition (mathematical text) substantially drives math capability.
MMLU (Hendrycks et al., 2021) presented 15,908 multiple-choice questions across 57 subjects: professional law, medicine, history, computer science, ethics, and more. It became the most widely cited general-capability benchmark. GPT-3 scored 43.9%; GPT-4 scored approximately 86.4% at launch. Human expert performance was estimated at ~89% for a domain expert covering all subjects.
However, MMLU measures primarily recall and surface-level reasoning. Models that perform near human-expert level on MMLU still fail significantly on tasks requiring novel problem formulation, causal reasoning, or adversarial inputs. MMLU-Pro (2024) addressed this by adding harder reasoning-required questions and expanding answer choices from 4 to 10, substantially reducing score inflation from random guessing.
A 2023 study found that several frontier models appeared to have inflated MMLU scores due to test set contamination — the multiple-choice questions and answers were present in web-scraped training data. Models could retrieve rather than reason. This prompted the adoption of stricter data decontamination protocols and renewed interest in dynamic evaluation.
Safety evaluation faces a unique challenge: the cost of false negatives (failing to refuse harmful content) and false positives (refusing benign content) are both real, but measured very differently. Anthropic's model cards describe red-teaming processes where human specialists attempt to elicit harmful outputs using jailbreaks, prompt injections, and social engineering. The key metrics include:
Harmful content rate — the fraction of attempts that successfully elicit policy-violating outputs. Over-refusal rate — the fraction of benign requests incorrectly refused. Attack success rate (ASR) — specifically for adversarial red-teaming, the fraction of jailbreak attempts that succeed.
AdvBench (2023) presented 520 harmful instruction strings testing model refusals across categories: harmful information synthesis, hate speech, cybercrime, and others. The GCG (Greedy Coordinate Gradient) attack by Zou et al. (2023) demonstrated that gradient-based adversarial suffixes could achieve near-100% ASR on aligned models — causing models to produce harmful content they would normally refuse.
The critical evaluation design challenge: safety metrics must resist Goodhart's Law. A model optimized to minimize measured harmful content rate might simply refuse everything — scoring perfectly on harm while failing completely on utility. Published safety evaluations from Anthropic (2023) explicitly report both harmful content rates and over-refusal rates on curated benign sets.
Select math benchmarks that match your difficulty target: GSM8K is saturated for frontier models; use MATH, AIME, or AMC problems for discriminative evaluation. For safety, always report both harmful content rate and over-refusal rate on paired harmful/benign sets — a model that refuses everything scores perfectly on harm but fails on utility.
You're designing a safety evaluation protocol for a consumer-facing AI assistant that will be used by a general audience including minors. The model must refuse truly harmful requests but must not over-refuse everyday questions about health, history, chemistry, or current events.