By 2022, large language models could write fluent prose, summarize documents, and pass many reading-comprehension tests. But a quiet alarm was growing among researchers: the benchmarks were breaking. Models trained on internet text were memorizing benchmark answer patterns rather than developing genuine reasoning. The field needed a harder test — one where you cannot guess your way to a right answer.
Math and code became that test. A program either runs correctly or it doesn't. A proof is either valid or it isn't. Unlike natural-language tasks, correctness is checkable by machine. This property transformed these domains into the gold standard for evaluating whether AI systems can actually reason.
Early NLP benchmarks like GLUE (2018) and SuperGLUE (2019) were designed to test language understanding across tasks including sentiment, entailment, and reading comprehension. Models improved rapidly — GPT-3 matched human performance on SuperGLUE within months of its release in 2020. The benchmarks had been saturated faster than anticipated.
The core problem: models trained on web-scraped text had likely encountered phrasings similar to benchmark test items during pretraining. This is called data contamination. When a model "sees" test-adjacent data during training, its benchmark score overstates its true reasoning ability. Math problems and programming tasks are far more robust to this failure mode because their solutions require chains of valid logical steps, not just pattern-matching on surface features.
A secondary problem was multiple-choice exploitability. On many benchmarks, a model that learns statistical correlations between question phrasing and correct-answer positions can score well above chance without understanding the content. Mathematical problem-solving and open-ended code generation largely eliminate this shortcut.
Microsoft Research's paper introducing SuperGLUE noted that GPT-2 fine-tuned models had already narrowed the gap to human performance on original GLUE to just 0.1 points on average — within a year of GLUE's publication. The authors designed SuperGLUE specifically to restore headroom. By July 2020, the T5 model from Google had surpassed the human baseline on SuperGLUE too. The arms race made clear that language-only benchmarks were insufficient.
A mathematical answer to a competition problem is right or wrong in a binary, checkable way. A Python function either passes a suite of unit tests or it doesn't. This property — called automatic verifiability — is what makes these domains so valuable as benchmarks. It enables:
1. Scalable evaluation. You can run thousands of test cases without human judges. This makes large-scale capability measurement practical.
2. Reward signal for training. Because correctness is checkable, it can serve as a training signal. A model can attempt a problem, check its answer, and learn from the outcome. This is the foundation of reinforcement learning from verifiable rewards (RLVR), which became central to models like DeepSeek-R1 and OpenAI's o1.
3. Resistance to reward hacking. When the ground truth is a computable answer rather than a human preference rating, models have a harder time "gaming" the metric through style or verbosity rather than correctness.
Researchers have focused on three primary domains. Formal mathematics (theorem proving) requires step-by-step logical deductions verified by proof assistants like Lean 4 or Isabelle. Every inference step is checkable. Competition mathematics (problems from AMC, AIME, and Olympiad contests) requires creative problem-solving with a verifiable numerical or algebraic answer. Competitive programming (problems from Codeforces, LeetCode, IOI) requires writing code that passes hidden test suites — combining algorithmic reasoning with syntactically valid implementation.
Each domain tests a different facet. Formal proofs test long-chain deductive validity. Competition math tests creative mathematical insight. Competitive coding tests the translation of abstract reasoning into executable logic. Together, they form a multi-dimensional picture of AI reasoning capability that natural language benchmarks simply cannot provide.
Math and code benchmarks are not just academic scorecards. The training methods that improve performance on these benchmarks — particularly reinforcement learning on verifiable rewards — turn out to generalize. Models that learn to reason carefully on math and code problems become better reasoners across many other domains. Understanding the benchmark landscape is therefore foundational to understanding the broader reasoning revolution.
You've just learned why math and code became the gold standard for AI benchmarking. Now put that knowledge to work. In this lab, discuss with the AI assistant the properties of good vs. bad benchmarks, explore the GLUE saturation event in more depth, and think through what "automatic verifiability" means in practice.
In October 2021, Dan Hendrycks and colleagues at UC Berkeley and elsewhere released the MATH benchmark — 12,500 problems drawn from AMC 8, AMC 10, AMC 12, AIME, and the MATHCOUNTS competitions, spanning seven subject areas: prealgebra, algebra, number theory, counting and probability, geometry, intermediate algebra, and precalculus. Each problem was labeled with a difficulty from 1 to 5.
The initial GPT-3 result on MATH was 4.5% accuracy on the hardest Level 5 problems. Fine-tuned models reached around 6–7%. This was not a benchmark that would saturate quickly. The authors wrote: "We find that many state-of-the-art AI models struggle to solve these problems."
Early attempts to improve MATH scores focused on scaling — bigger models trained on more data. The Minerva model from Google Research, published in June 2022, represented the most systematic effort. Trained on 118 billion tokens of scientific content including LaTeX-formatted mathematics from arXiv papers and web pages, Minerva (540B parameters) achieved 33.6% on MATH and 50% on MATH's Level 1–2 problems. This was a dramatic improvement from GPT-3's baseline, but still far below human performance on the harder problems.
Minerva also used majority voting: it generated 64 or 256 candidate solutions to each problem and selected the answer that appeared most frequently. This sampling-and-voting technique (called "self-consistency" in the broader literature, introduced by Google Brain in 2022) showed that inference-time compute could substitute for training-time compute on math tasks.
Google Research's paper "Solving Quantitative Reasoning Problems with Language Models" showed that domain-specific pretraining on mathematical content — not just general scaling — was important. Minerva's 540B parameter model achieved 50.3% on MATH using 64-sample majority voting. The result demonstrated that mathematical reasoning benefited substantially from seeing mathematical notation and structure during training, not just natural language descriptions of math.
Parallel to Minerva's scaling approach, Jason Wei, Denny Zhou, and colleagues at Google Brain showed in January 2022 that chain-of-thought prompting — asking models to write out intermediate reasoning steps before giving a final answer — dramatically improved mathematical problem-solving. On the GSM8K grade-school math benchmark, chain-of-thought prompting with a 540B PaLM model reached 74%, compared to 17% for standard prompting of the same model.
The key insight was structural: many math problems require multi-step reasoning where each step must be correct for the final answer to be correct. Forcing the model to make intermediate steps explicit gives it a "scratchpad" that helps maintain logical coherence across the solution.
This finding would become foundational. The o1, o3, and DeepSeek-R1 reasoning models that emerged in 2024–2025 can be understood as taking chain-of-thought to its logical conclusion: instead of prompting for one chain, train the model to generate extended, self-correcting reasoning traces.
The American Invitational Mathematics Examination (AIME) is the second stage of the AMC competition pathway — 15 problems with integer answers from 0 to 999, typically requiring several hours for top human competitors. Median scores for students who qualify for AIME (themselves in approximately the top 5% of AMC participants) are typically 3–5 out of 15.
In February 2025, OpenAI's o3 model scored 25.3% on AIME 2024 at low compute settings and significantly higher with more sampling, reaching scores that would place it in qualifying range for the American Mathematics Olympiad (AIMO). DeepSeek-R1, released in January 2025, scored 72.6% on AIME 2024 (pass@1). These numbers would have been considered impossible extrapolations from Minerva's 2022 baseline just three years earlier.
Three factors compounded. First, domain-specific pretraining data: models trained on LaTeX-formatted mathematical content learned to handle formal notation and multi-step algebraic manipulation. Second, chain-of-thought and extended reasoning: eliciting step-by-step solutions substantially improved accuracy even without changing model weights. Third, and most consequentially for 2024–2025, reinforcement learning on verifiable rewards: training models to optimize for correct final answers — checked automatically — allowed them to develop long internal reasoning chains that catch and correct their own errors.
The AIME results from 2025 are especially significant because AIME problems are not on the internet in solved form in large quantities, making data contamination a less plausible explanation. The improvements appear to reflect genuine advances in mathematical reasoning capability.
Dig into the specifics of competition math as an AI benchmark. Discuss with the assistant how difficulty levels matter, what chain-of-thought prompting actually does to a model's reasoning process, and why the jump from GSM8K (grade school) to MATH (competition) was so consequential for the field.
In July 2021, OpenAI released Codex and its associated evaluation suite, HumanEval — 164 Python programming problems, each requiring the model to complete a function given its docstring. The benchmark used hidden unit tests to check correctness. The initial Codex model (12B parameters) solved 28.8% of problems on the first attempt.
HumanEval was a genuine advance. Unlike prior code benchmarks that evaluated token overlap with reference solutions, it checked whether the code actually ran correctly. But within two years, GPT-4 would score above 85% on HumanEval — saturation arrived again, and in exactly the same pattern as with GLUE.
HumanEval's 164 problems became a standard leaderboard. Key milestones: Codex (12B, 2021) — 28.8%; GPT-3.5 (2022) — approximately 48%; GPT-4 (March 2023) — 67%; with sampling (pass@10 or pass@100) all of these numbers rose substantially. By mid-2023, specialized code models like WizardCoder and Code Llama were achieving 73–78% on HumanEval in a single attempt.
The problems in HumanEval are self-contained: a function that reverses a string, checks if a number is prime, or computes a Fibonacci sequence. They do not require understanding a codebase, reading documentation, or reasoning about multi-file dependencies. As models saturated the benchmark, the field had to ask: does this score measure what we actually care about?
Researchers created harder variants. MBPP (Mostly Basic Programming Problems, Google Research 2021) added 374 crowd-sourced Python problems at varying difficulty. DS-1000 (Lai et al., 2022) focused on data science tasks using real APIs from NumPy, Pandas, TensorFlow, PyTorch, and SciPy — testing whether models could correctly use complex library interfaces. APPS (Hendrycks et al., 2021) included 10,000 competitive programming problems drawn from platforms like Codeforces, at difficulty levels ranging from interview to Olympiad.
APPS was harder than HumanEval: early GPT-3 solved roughly 0.9% of competition-level problems on APPS. But even APPS evaluated individual problems — it did not test the ability to navigate a real software project.
Princeton NLP and the University of Chicago released SWE-bench in October 2023: 2,294 real GitHub issues from 12 popular Python repositories including Django, Flask, NumPy, Pandas, Matplotlib, and Sympy. Each issue required generating a code patch that passed the repository's existing test suite. The initial evaluation of Claude 2 and GPT-4 resolved approximately 1.96% and 1.74% of issues respectively — dramatically lower than HumanEval scores suggested. SWE-bench demonstrated a massive gap between generating isolated functions and reasoning about real software engineering contexts.
SWE-bench problems are radically different from HumanEval. Resolving a GitHub issue requires: reading and understanding an existing bug report; navigating a large, unfamiliar codebase to locate the relevant code; understanding the intended behavior from documentation or tests; writing a targeted patch; and ensuring the patch does not break other parts of the system. This is much closer to what professional software engineers actually do.
The gap between HumanEval performance and SWE-bench performance was stark and informative. A model that scores 80% on HumanEval might score 2% on SWE-bench because the former tests pattern-completion while the latter tests multi-step reasoning about complex, real-world contexts.
Progress on SWE-bench has since accelerated substantially. Devin (Cognition AI, March 2024) demonstrated an agentic approach — using the model to iteratively explore the codebase, form hypotheses, and refine patches — and achieved 13.86% on SWE-bench. By early 2025, leading systems including SWE-agent with Claude 3.5 Sonnet and purpose-built agentic systems were approaching 40–50% on the verified subset (SWE-bench Verified), which filters to 500 problems confirmed by human engineers as having clear correct solutions.
The progression from HumanEval to APPS to SWE-bench traces a path from function completion to competitive algorithm design to real software engineering. Each step adds a new dimension of reasoning: multi-step problem decomposition, codebase navigation, hypothesis formation and testing, and integration awareness.
The agentic approaches that performed best on SWE-bench — where a model loops through reading, planning, editing, and verifying in multiple rounds — mirror what a human developer actually does. This suggests that the most important variable for coding benchmarks is not raw model capability in isolation but the ability to engage in extended, self-correcting reasoning over complex contexts. That observation connects directly back to RLVR and the reasoning training methods explored in earlier modules.
You've traced the evolution from HumanEval to SWE-bench. Now use this lab to go deeper: explore what agentic coding systems actually do differently from single-pass generation, discuss what "real software engineering" demands that benchmarks are still failing to capture, and think through where the next benchmark gap might appear.
In July 2024, Google DeepMind announced that AlphaProof and AlphaGeometry 2, working together, had solved four of six problems from the 2024 International Mathematical Olympiad — achieving a silver-medal score equivalent. Two of those problems were solved by AlphaProof, which generated formal Lean 4 proofs verified by the proof assistant. One problem took AlphaProof three days of continuous compute to solve. This was the first time an AI system had achieved IMO silver-medal performance.
The IMO result was qualitatively different from AIME performance. IMO problems require multi-step creative mathematical insight at a level that separates the top few hundred students globally each year. Solving them — and producing machine-verified proofs — represented a new category of accomplishment.
A proof assistant like Lean 4 (or Isabelle, Coq, or Metamath) is a software system that checks mathematical proofs step by step according to formal logical rules. Every inference in a Lean 4 proof must be explicitly justified; the system accepts the proof if and only if every step is valid. This eliminates the possibility of a plausible-looking but subtly incorrect argument — the failure mode that sometimes affects competition math where the answer is correct but the reasoning that produced it is flawed.
Formal proofs are therefore the most rigorous form of verifiable correctness in mathematics. When AlphaProof produced a Lean 4 proof of an IMO problem, there was no ambiguity about whether the solution was correct. The proof assistant had verified every step.
AlphaProof was built on top of Gemini, fine-tuned to translate mathematical problems into formal Lean 4 statements and then search for proofs using reinforcement learning. The system was trained on a library of formalized mathematical theorems and used self-play — attempting to prove statements, learning from failures, generating new candidate lemmas — to improve its proof-search capability.
AlphaGeometry 2 handled the two geometry problems using a neuro-symbolic approach: a language model generates auxiliary geometric constructions (insight steps), while a rule-based symbolic engine derives consequences from those constructions. This hybrid approach was more effective on geometry than pure neural methods because geometric reasoning benefits from the exact constraint-satisfaction that symbolic systems provide.
The IMO 2024 result was not without caveats. AlphaProof required significantly more time per problem than human competitors (three days for one problem vs. hours for human solvers). It also could not fully solve Problems 5 and 6, which required number-theoretic and combinatorial insights the system hadn't developed. But the result demonstrated that formal mathematical reasoning at an elite competition level was now within reach for AI systems.
DeepMind's blog post confirmed the result: AlphaProof solved IMO 2024 Problems 1, 2, and 6; AlphaGeometry 2 solved Problem 4. This amounted to 28 points out of a possible 42 — equivalent to a silver medal. The proof for Problem 6 (a combinatorics problem about rectangular boards) took AlphaProof approximately 72 hours of compute to find and verify. The result was praised by IMO coordinators as representing correct, complete proofs rather than approximations.
Even as AlphaProof achieved IMO silver, a new benchmark was being designed to push further. FrontierMath, assembled by Epoch AI and announced in November 2024, consists of hundreds of problems at the research mathematics level — problems that require expertise comparable to a PhD in mathematics or beyond, covering areas like algebraic geometry, analytic number theory, and advanced combinatorics. The problems were created by professional mathematicians and reviewed for novelty to minimize contamination risk.
Early evaluations showed that top models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet solved fewer than 2% of FrontierMath problems. This benchmark established a frontier well beyond competition mathematics — one that current AI systems cannot approach, providing years of headroom for measuring progress.
The existence of FrontierMath points to a possible future in which AI systems contribute to original mathematical research, not just reproduce or extend known results. This remains speculative for now, but the trajectory from GPT-3's 4.5% on MATH Level 5 to AlphaProof's IMO silver in three years suggests that the extrapolation should not be dismissed.
Taken together, the progression from MATH → AIME → IMO → FrontierMath in mathematics, and from HumanEval → SWE-bench → agentic coding systems in software, tells a coherent story. AI reasoning capability has improved substantially and continues to improve. The improvements are driven by training methods — particularly reinforcement learning on verifiable rewards — that were discovered precisely because math and code provide the automatically checkable ground truth those methods require.
The relationship is circular in a productive way: better benchmarks enable better training methods; better training methods produce systems that can attempt harder benchmarks; harder benchmarks reveal new capability gaps that drive the next round of research. Understanding this cycle is fundamental to understanding why the reasoning revolution happened when it did, and what direction it is heading.
The reason math and code benchmarks have been so generative for AI research is not just that they are hard — it is that they provide the automatically verifiable feedback signal that makes training methods like RLVR work. Difficulty alone is not sufficient; verifiability is the essential ingredient. FrontierMath provides both, which is why it may prove as consequential for the next phase of reasoning research as MATH was for 2022–2024.
This lab focuses on the frontier: what formal proof verification means for AI reliability, what the IMO 2024 result implies about AI's mathematical capabilities, and what FrontierMath's sub-2% solve rate tells us about where genuine gaps remain. Discuss with the assistant what "AI doing original mathematics" might actually look like — and what it would require.