Module 6 · Lesson 1

Why Math and Code? The Benchmark Problem

When language alone stopped being enough to test reasoning, researchers turned to domains with objectively verifiable answers.

What makes mathematics and programming uniquely powerful lenses for measuring AI reasoning?

By 2022, large language models could write fluent prose, summarize documents, and pass many reading-comprehension tests. But a quiet alarm was growing among researchers: the benchmarks were breaking. Models trained on internet text were memorizing benchmark answer patterns rather than developing genuine reasoning. The field needed a harder test — one where you cannot guess your way to a right answer.

Math and code became that test. A program either runs correctly or it doesn't. A proof is either valid or it isn't. Unlike natural-language tasks, correctness is checkable by machine. This property transformed these domains into the gold standard for evaluating whether AI systems can actually reason.

The Benchmark Saturation Crisis

Early NLP benchmarks like GLUE (2018) and SuperGLUE (2019) were designed to test language understanding across tasks including sentiment, entailment, and reading comprehension. Models improved rapidly — GPT-3 matched human performance on SuperGLUE within months of its release in 2020. The benchmarks had been saturated faster than anticipated.

The core problem: models trained on web-scraped text had likely encountered phrasings similar to benchmark test items during pretraining. This is called data contamination. When a model "sees" test-adjacent data during training, its benchmark score overstates its true reasoning ability. Math problems and programming tasks are far more robust to this failure mode because their solutions require chains of valid logical steps, not just pattern-matching on surface features.

A secondary problem was multiple-choice exploitability. On many benchmarks, a model that learns statistical correlations between question phrasing and correct-answer positions can score well above chance without understanding the content. Mathematical problem-solving and open-ended code generation largely eliminate this shortcut.

Real Event — GLUE Saturation, September 2019

Microsoft Research's paper introducing SuperGLUE noted that GPT-2 fine-tuned models had already narrowed the gap to human performance on original GLUE to just 0.1 points on average — within a year of GLUE's publication. The authors designed SuperGLUE specifically to restore headroom. By July 2020, the T5 model from Google had surpassed the human baseline on SuperGLUE too. The arms race made clear that language-only benchmarks were insufficient.

What "Verifiable Correctness" Actually Means

A mathematical answer to a competition problem is right or wrong in a binary, checkable way. A Python function either passes a suite of unit tests or it doesn't. This property — called automatic verifiability — is what makes these domains so valuable as benchmarks. It enables:

1. Scalable evaluation. You can run thousands of test cases without human judges. This makes large-scale capability measurement practical.

2. Reward signal for training. Because correctness is checkable, it can serve as a training signal. A model can attempt a problem, check its answer, and learn from the outcome. This is the foundation of reinforcement learning from verifiable rewards (RLVR), which became central to models like DeepSeek-R1 and OpenAI's o1.

3. Resistance to reward hacking. When the ground truth is a computable answer rather than a human preference rating, models have a harder time "gaming" the metric through style or verbosity rather than correctness.

Key Domains and Their Properties

Researchers have focused on three primary domains. Formal mathematics (theorem proving) requires step-by-step logical deductions verified by proof assistants like Lean 4 or Isabelle. Every inference step is checkable. Competition mathematics (problems from AMC, AIME, and Olympiad contests) requires creative problem-solving with a verifiable numerical or algebraic answer. Competitive programming (problems from Codeforces, LeetCode, IOI) requires writing code that passes hidden test suites — combining algorithmic reasoning with syntactically valid implementation.

Each domain tests a different facet. Formal proofs test long-chain deductive validity. Competition math tests creative mathematical insight. Competitive coding tests the translation of abstract reasoning into executable logic. Together, they form a multi-dimensional picture of AI reasoning capability that natural language benchmarks simply cannot provide.

Why This Matters

Math and code benchmarks are not just academic scorecards. The training methods that improve performance on these benchmarks — particularly reinforcement learning on verifiable rewards — turn out to generalize. Models that learn to reason carefully on math and code problems become better reasoners across many other domains. Understanding the benchmark landscape is therefore foundational to understanding the broader reasoning revolution.

Key Terms

Data ContaminationWhen test-set items or near-identical phrasings appear in a model's training data, inflating benchmark scores beyond what genuine reasoning ability would produce.

Automatic VerifiabilityThe property of a task whose correct answer can be checked by a deterministic program rather than requiring human judgment.

RLVRReinforcement Learning from Verifiable Rewards — training AI using correctness signals from automatically checkable answers rather than human preference ratings.

Benchmark SaturationWhen AI models match or exceed the human baseline on a benchmark, rendering it no longer useful for measuring further progress.

Lesson 1 Quiz

Why Math and Code? The Benchmark Problem

Why did NLP benchmark saturation happen so quickly with benchmarks like GLUE and SuperGLUE?

Correct. Data contamination combined with tasks that rewarded surface pattern-matching rather than deep reasoning allowed models to saturate these benchmarks faster than their designers anticipated.

Not quite. The core issues were data contamination and insufficient task difficulty — the benchmarks rewarded pattern-matching more than genuine reasoning.

What is "automatic verifiability" and why does it matter for AI evaluation?

Exactly right. When correctness is checkable by machine, evaluation scales cheaply and models cannot "game" the score through style or verbosity alone.

Not quite. Automatic verifiability means a deterministic program — not another model or a human — can check the answer. This is what makes math and code so valuable as benchmarks.

Which of the following best describes why multiple-choice benchmarks are vulnerable to exploitation by language models?

Correct. This is the "option bias" or "answer position" exploit — models learn shortcut heuristics rather than reasoning through the content.

The vulnerability is subtler: models learn which answer positions or phrasings correlate with correctness in the training data, bypassing genuine comprehension.

Lab 1: Benchmark Design Thinking

Explore what makes a good AI reasoning benchmark — and where existing ones fall short.

Your Mission

You've just learned why math and code became the gold standard for AI benchmarking. Now put that knowledge to work. In this lab, discuss with the AI assistant the properties of good vs. bad benchmarks, explore the GLUE saturation event in more depth, and think through what "automatic verifiability" means in practice.

Start here: "What's the difference between a benchmark that tests reasoning and one that tests memorization? Give me a concrete example from the GLUE or SuperGLUE era."

AI Lab Assistant

Benchmark Design

Welcome to Lab 1. I'm here to help you think through AI benchmark design — what makes a benchmark genuinely test reasoning versus memorization, and why math and code emerged as the gold standard. What would you like to explore first?

Module 6 · Lesson 2

Competition Math — From MATH Dataset to AIME

How a dataset of 12,500 competition problems became the proving ground for the most important AI reasoning advances of the 2020s.

How did performance on competition mathematics evolve from near-zero to near-human, and what drove those leaps?

In October 2021, Dan Hendrycks and colleagues at UC Berkeley and elsewhere released the MATH benchmark — 12,500 problems drawn from AMC 8, AMC 10, AMC 12, AIME, and the MATHCOUNTS competitions, spanning seven subject areas: prealgebra, algebra, number theory, counting and probability, geometry, intermediate algebra, and precalculus. Each problem was labeled with a difficulty from 1 to 5.

The initial GPT-3 result on MATH was 4.5% accuracy on the hardest Level 5 problems. Fine-tuned models reached around 6–7%. This was not a benchmark that would saturate quickly. The authors wrote: "We find that many state-of-the-art AI models struggle to solve these problems."

The Scaling Baseline (2021–2022)

Early attempts to improve MATH scores focused on scaling — bigger models trained on more data. The Minerva model from Google Research, published in June 2022, represented the most systematic effort. Trained on 118 billion tokens of scientific content including LaTeX-formatted mathematics from arXiv papers and web pages, Minerva (540B parameters) achieved 33.6% on MATH and 50% on MATH's Level 1–2 problems. This was a dramatic improvement from GPT-3's baseline, but still far below human performance on the harder problems.

Minerva also used majority voting: it generated 64 or 256 candidate solutions to each problem and selected the answer that appeared most frequently. This sampling-and-voting technique (called "self-consistency" in the broader literature, introduced by Google Brain in 2022) showed that inference-time compute could substitute for training-time compute on math tasks.

Real Event — Minerva Paper, June 2022

Google Research's paper "Solving Quantitative Reasoning Problems with Language Models" showed that domain-specific pretraining on mathematical content — not just general scaling — was important. Minerva's 540B parameter model achieved 50.3% on MATH using 64-sample majority voting. The result demonstrated that mathematical reasoning benefited substantially from seeing mathematical notation and structure during training, not just natural language descriptions of math.

The Chain-of-Thought Breakthrough

Parallel to Minerva's scaling approach, Jason Wei, Denny Zhou, and colleagues at Google Brain showed in January 2022 that chain-of-thought prompting — asking models to write out intermediate reasoning steps before giving a final answer — dramatically improved mathematical problem-solving. On the GSM8K grade-school math benchmark, chain-of-thought prompting with a 540B PaLM model reached 74%, compared to 17% for standard prompting of the same model.

The key insight was structural: many math problems require multi-step reasoning where each step must be correct for the final answer to be correct. Forcing the model to make intermediate steps explicit gives it a "scratchpad" that helps maintain logical coherence across the solution.

This finding would become foundational. The o1, o3, and DeepSeek-R1 reasoning models that emerged in 2024–2025 can be understood as taking chain-of-thought to its logical conclusion: instead of prompting for one chain, train the model to generate extended, self-correcting reasoning traces.

AIME 2024: The Benchmark Moment

The American Invitational Mathematics Examination (AIME) is the second stage of the AMC competition pathway — 15 problems with integer answers from 0 to 999, typically requiring several hours for top human competitors. Median scores for students who qualify for AIME (themselves in approximately the top 5% of AMC participants) are typically 3–5 out of 15.

In February 2025, OpenAI's o3 model scored 25.3% on AIME 2024 at low compute settings and significantly higher with more sampling, reaching scores that would place it in qualifying range for the American Mathematics Olympiad (AIMO). DeepSeek-R1, released in January 2025, scored 72.6% on AIME 2024 (pass@1). These numbers would have been considered impossible extrapolations from Minerva's 2022 baseline just three years earlier.

4.5%

GPT-3 on MATH Level 5 (2021)

33.6%

Minerva 540B on MATH (2022)

72.6%

DeepSeek-R1 on AIME 2024 (2025)

What Drove the Improvement?

Three factors compounded. First, domain-specific pretraining data: models trained on LaTeX-formatted mathematical content learned to handle formal notation and multi-step algebraic manipulation. Second, chain-of-thought and extended reasoning: eliciting step-by-step solutions substantially improved accuracy even without changing model weights. Third, and most consequentially for 2024–2025, reinforcement learning on verifiable rewards: training models to optimize for correct final answers — checked automatically — allowed them to develop long internal reasoning chains that catch and correct their own errors.

The AIME results from 2025 are especially significant because AIME problems are not on the internet in solved form in large quantities, making data contamination a less plausible explanation. The improvements appear to reflect genuine advances in mathematical reasoning capability.

Key Terms

MATH BenchmarkA dataset of 12,500 competition mathematics problems released by Hendrycks et al. in 2021, spanning difficulty levels 1–5 and seven subject areas.

Chain-of-Thought PromptingA prompting technique where models are asked to produce intermediate reasoning steps before a final answer, improving accuracy on multi-step problems.

Majority Voting / Self-ConsistencyGenerating multiple candidate solutions and selecting the most common answer, trading inference-time compute for higher accuracy.

AIMEAmerican Invitational Mathematics Examination — a prestigious competition exam used as a high-difficulty benchmark for AI math reasoning due to its creative problem structure and human-verifiable answers.

Lesson 2 Quiz

Competition Math — From MATH Dataset to AIME

What was GPT-3's approximate accuracy on MATH Level 5 problems when the benchmark was released in 2021?

Correct. GPT-3 achieved approximately 4.5% on Level 5 problems, confirming that the MATH benchmark would not saturate quickly.

Not quite. GPT-3 scored approximately 4.5% on Level 5 problems — the MATH benchmark was designed specifically to resist rapid saturation.

What was the key innovation in Google Research's Minerva model that helped it score 33.6% on MATH?

Exactly. Domain-specific pretraining on mathematical text combined with 64-sample majority voting were the core contributions of the Minerva approach.

The key factors were domain-specific pretraining on LaTeX-formatted mathematical content and majority voting across multiple generated solutions — not a symbolic engine or simple fine-tuning.

Why is AIME considered a particularly rigorous benchmark for AI reasoning, beyond just being difficult?

Correct. The combination of automatically verifiable answers and limited contamination risk makes AIME scores particularly credible as evidence of genuine reasoning progress.

The key features are automatic verifiability (integer answers 0–999) and limited data contamination risk — not the grading format or licensing status.

Lab 2: Exploring Math Benchmarks

Probe the relationship between benchmark design choices and what AI capabilities they actually measure.

Your Mission

Dig into the specifics of competition math as an AI benchmark. Discuss with the assistant how difficulty levels matter, what chain-of-thought prompting actually does to a model's reasoning process, and why the jump from GSM8K (grade school) to MATH (competition) was so consequential for the field.

Start here: "Explain why chain-of-thought prompting helps on multi-step math problems. What's actually happening inside the model when it writes out intermediate steps?"

AI Lab Assistant

Competition Math

Welcome to Lab 2. I'm ready to explore competition mathematics benchmarks with you — how they're structured, what they test, and why chain-of-thought reasoning was such a pivotal discovery. What's on your mind?

Module 6 · Lesson 3

Code Generation Benchmarks — HumanEval to SWE-bench

From generating standalone functions to resolving real GitHub issues: how coding benchmarks evolved to test progressively deeper reasoning.

What does the progression from HumanEval to SWE-bench reveal about what AI needs to do to genuinely reason about software?

In July 2021, OpenAI released Codex and its associated evaluation suite, HumanEval — 164 Python programming problems, each requiring the model to complete a function given its docstring. The benchmark used hidden unit tests to check correctness. The initial Codex model (12B parameters) solved 28.8% of problems on the first attempt.

HumanEval was a genuine advance. Unlike prior code benchmarks that evaluated token overlap with reference solutions, it checked whether the code actually ran correctly. But within two years, GPT-4 would score above 85% on HumanEval — saturation arrived again, and in exactly the same pattern as with GLUE.

The HumanEval Era (2021–2023)

HumanEval's 164 problems became a standard leaderboard. Key milestones: Codex (12B, 2021) — 28.8%; GPT-3.5 (2022) — approximately 48%; GPT-4 (March 2023) — 67%; with sampling (pass@10 or pass@100) all of these numbers rose substantially. By mid-2023, specialized code models like WizardCoder and Code Llama were achieving 73–78% on HumanEval in a single attempt.

The problems in HumanEval are self-contained: a function that reverses a string, checks if a number is prime, or computes a Fibonacci sequence. They do not require understanding a codebase, reading documentation, or reasoning about multi-file dependencies. As models saturated the benchmark, the field had to ask: does this score measure what we actually care about?

MBPP, DS-1000, and Intermediate Benchmarks

Researchers created harder variants. MBPP (Mostly Basic Programming Problems, Google Research 2021) added 374 crowd-sourced Python problems at varying difficulty. DS-1000 (Lai et al., 2022) focused on data science tasks using real APIs from NumPy, Pandas, TensorFlow, PyTorch, and SciPy — testing whether models could correctly use complex library interfaces. APPS (Hendrycks et al., 2021) included 10,000 competitive programming problems drawn from platforms like Codeforces, at difficulty levels ranging from interview to Olympiad.

APPS was harder than HumanEval: early GPT-3 solved roughly 0.9% of competition-level problems on APPS. But even APPS evaluated individual problems — it did not test the ability to navigate a real software project.

Real Event — SWE-bench Release, October 2023

Princeton NLP and the University of Chicago released SWE-bench in October 2023: 2,294 real GitHub issues from 12 popular Python repositories including Django, Flask, NumPy, Pandas, Matplotlib, and Sympy. Each issue required generating a code patch that passed the repository's existing test suite. The initial evaluation of Claude 2 and GPT-4 resolved approximately 1.96% and 1.74% of issues respectively — dramatically lower than HumanEval scores suggested. SWE-bench demonstrated a massive gap between generating isolated functions and reasoning about real software engineering contexts.

SWE-bench: Why It Matters

SWE-bench problems are radically different from HumanEval. Resolving a GitHub issue requires: reading and understanding an existing bug report; navigating a large, unfamiliar codebase to locate the relevant code; understanding the intended behavior from documentation or tests; writing a targeted patch; and ensuring the patch does not break other parts of the system. This is much closer to what professional software engineers actually do.

The gap between HumanEval performance and SWE-bench performance was stark and informative. A model that scores 80% on HumanEval might score 2% on SWE-bench because the former tests pattern-completion while the latter tests multi-step reasoning about complex, real-world contexts.

Progress on SWE-bench has since accelerated substantially. Devin (Cognition AI, March 2024) demonstrated an agentic approach — using the model to iteratively explore the codebase, form hypotheses, and refine patches — and achieved 13.86% on SWE-bench. By early 2025, leading systems including SWE-agent with Claude 3.5 Sonnet and purpose-built agentic systems were approaching 40–50% on the verified subset (SWE-bench Verified), which filters to 500 problems confirmed by human engineers as having clear correct solutions.

1.74%

GPT-4 on SWE-bench (Oct 2023)

13.86%

Devin on SWE-bench (Mar 2024)

~49%

Top agentic systems on SWE-bench Verified (2025)

What the Progression Reveals

The progression from HumanEval to APPS to SWE-bench traces a path from function completion to competitive algorithm design to real software engineering. Each step adds a new dimension of reasoning: multi-step problem decomposition, codebase navigation, hypothesis formation and testing, and integration awareness.

The agentic approaches that performed best on SWE-bench — where a model loops through reading, planning, editing, and verifying in multiple rounds — mirror what a human developer actually does. This suggests that the most important variable for coding benchmarks is not raw model capability in isolation but the ability to engage in extended, self-correcting reasoning over complex contexts. That observation connects directly back to RLVR and the reasoning training methods explored in earlier modules.

Key Terms

HumanEvalOpenAI's 164-problem Python function completion benchmark (2021), using unit tests for automatic verification. Saturated by top models by 2023.

SWE-benchA benchmark of 2,294 real GitHub issues from popular Python repositories, requiring patch generation that passes existing test suites. Released October 2023.

Agentic CodingAn approach where an AI iteratively explores a codebase, forms hypotheses, writes and tests patches in a multi-step loop rather than generating code in a single pass.

Pass@kA metric measuring whether at least one of k generated code samples passes the test suite, used to evaluate code generation quality across a distribution of attempts.

Lesson 3 Quiz

Code Generation Benchmarks — HumanEval to SWE-bench

What was the most important methodological advance HumanEval introduced over prior code benchmarks?

Correct. Execution-based evaluation — checking whether code actually works — was a fundamental step forward from lexical similarity metrics like BLEU score.

The key advance was execution-based evaluation: using unit tests to check whether generated code actually produces correct outputs, rather than comparing token sequences to a reference answer.

Why did GPT-4's HumanEval score of ~67% dramatically overstate its real-world software engineering capability, as demonstrated by SWE-bench?

Exactly right. The two benchmarks test fundamentally different reasoning tasks. SWE-bench revealed that function completion skill does not transfer to real software engineering contexts without additional capabilities.

The gap reflects the different reasoning demands of the two tasks — HumanEval tests pattern completion, while SWE-bench requires multi-step reasoning across a complex, real codebase.

What approach did Devin (Cognition AI, 2024) take to achieve 13.86% on SWE-bench — substantially above single-pass model attempts?

Correct. The agentic, iterative approach — explore, hypothesize, implement, test, revise — was the key innovation, mirroring how human software engineers actually work.

Devin's breakthrough was the agentic loop: iterative exploration, hypothesis formation, patching, and testing — not a larger model or fine-tuning on the test distribution.

Lab 3: Code Benchmark Analysis

Examine what different code benchmarks actually measure and what the gaps between them reveal.

Your Mission

You've traced the evolution from HumanEval to SWE-bench. Now use this lab to go deeper: explore what agentic coding systems actually do differently from single-pass generation, discuss what "real software engineering" demands that benchmarks are still failing to capture, and think through where the next benchmark gap might appear.

Start here: "Walk me through what an AI agent solving a SWE-bench problem actually needs to do, step by step, compared to what a model solving a HumanEval problem does."

AI Lab Assistant

Code Benchmarks

Welcome to Lab 3. I'm ready to explore the world of code generation benchmarks with you — from HumanEval's function completion tasks to SWE-bench's real GitHub issues. What aspect would you like to dig into?

Module 6 · Lesson 4

Formal Proofs, Frontier Models, and What Comes Next

Lean 4, AlphaProof, and the question of whether AI can do mathematics that humans haven't done yet.

What is the difference between solving competition math and doing formal mathematics — and why does it matter for the future of AI reasoning?

In July 2024, Google DeepMind announced that AlphaProof and AlphaGeometry 2, working together, had solved four of six problems from the 2024 International Mathematical Olympiad — achieving a silver-medal score equivalent. Two of those problems were solved by AlphaProof, which generated formal Lean 4 proofs verified by the proof assistant. One problem took AlphaProof three days of continuous compute to solve. This was the first time an AI system had achieved IMO silver-medal performance.

The IMO result was qualitatively different from AIME performance. IMO problems require multi-step creative mathematical insight at a level that separates the top few hundred students globally each year. Solving them — and producing machine-verified proofs — represented a new category of accomplishment.

Formal Verification: What Lean 4 Does

A proof assistant like Lean 4 (or Isabelle, Coq, or Metamath) is a software system that checks mathematical proofs step by step according to formal logical rules. Every inference in a Lean 4 proof must be explicitly justified; the system accepts the proof if and only if every step is valid. This eliminates the possibility of a plausible-looking but subtly incorrect argument — the failure mode that sometimes affects competition math where the answer is correct but the reasoning that produced it is flawed.

Formal proofs are therefore the most rigorous form of verifiable correctness in mathematics. When AlphaProof produced a Lean 4 proof of an IMO problem, there was no ambiguity about whether the solution was correct. The proof assistant had verified every step.

AlphaProof and AlphaGeometry 2

AlphaProof was built on top of Gemini, fine-tuned to translate mathematical problems into formal Lean 4 statements and then search for proofs using reinforcement learning. The system was trained on a library of formalized mathematical theorems and used self-play — attempting to prove statements, learning from failures, generating new candidate lemmas — to improve its proof-search capability.

AlphaGeometry 2 handled the two geometry problems using a neuro-symbolic approach: a language model generates auxiliary geometric constructions (insight steps), while a rule-based symbolic engine derives consequences from those constructions. This hybrid approach was more effective on geometry than pure neural methods because geometric reasoning benefits from the exact constraint-satisfaction that symbolic systems provide.

The IMO 2024 result was not without caveats. AlphaProof required significantly more time per problem than human competitors (three days for one problem vs. hours for human solvers). It also could not fully solve Problems 5 and 6, which required number-theoretic and combinatorial insights the system hadn't developed. But the result demonstrated that formal mathematical reasoning at an elite competition level was now within reach for AI systems.

Real Event — AlphaProof IMO Result, July 2024

DeepMind's blog post confirmed the result: AlphaProof solved IMO 2024 Problems 1, 2, and 6; AlphaGeometry 2 solved Problem 4. This amounted to 28 points out of a possible 42 — equivalent to a silver medal. The proof for Problem 6 (a combinatorics problem about rectangular boards) took AlphaProof approximately 72 hours of compute to find and verify. The result was praised by IMO coordinators as representing correct, complete proofs rather than approximations.

FrontierMath and the Next Level

Even as AlphaProof achieved IMO silver, a new benchmark was being designed to push further. FrontierMath, assembled by Epoch AI and announced in November 2024, consists of hundreds of problems at the research mathematics level — problems that require expertise comparable to a PhD in mathematics or beyond, covering areas like algebraic geometry, analytic number theory, and advanced combinatorics. The problems were created by professional mathematicians and reviewed for novelty to minimize contamination risk.

Early evaluations showed that top models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet solved fewer than 2% of FrontierMath problems. This benchmark established a frontier well beyond competition mathematics — one that current AI systems cannot approach, providing years of headroom for measuring progress.

The existence of FrontierMath points to a possible future in which AI systems contribute to original mathematical research, not just reproduce or extend known results. This remains speculative for now, but the trajectory from GPT-3's 4.5% on MATH Level 5 to AlphaProof's IMO silver in three years suggests that the extrapolation should not be dismissed.

The Bigger Picture: What These Benchmarks Reveal

Taken together, the progression from MATH → AIME → IMO → FrontierMath in mathematics, and from HumanEval → SWE-bench → agentic coding systems in software, tells a coherent story. AI reasoning capability has improved substantially and continues to improve. The improvements are driven by training methods — particularly reinforcement learning on verifiable rewards — that were discovered precisely because math and code provide the automatically checkable ground truth those methods require.

The relationship is circular in a productive way: better benchmarks enable better training methods; better training methods produce systems that can attempt harder benchmarks; harder benchmarks reveal new capability gaps that drive the next round of research. Understanding this cycle is fundamental to understanding why the reasoning revolution happened when it did, and what direction it is heading.

4/6

IMO 2024 problems solved by DeepMind's AlphaProof + AlphaGeometry 2

<2%

Top models on FrontierMath (Nov 2024)

Silver

IMO 2024 equivalent medal level

Core Insight

The reason math and code benchmarks have been so generative for AI research is not just that they are hard — it is that they provide the automatically verifiable feedback signal that makes training methods like RLVR work. Difficulty alone is not sufficient; verifiability is the essential ingredient. FrontierMath provides both, which is why it may prove as consequential for the next phase of reasoning research as MATH was for 2022–2024.

Key Terms

Proof AssistantA software system (e.g., Lean 4, Coq, Isabelle) that verifies mathematical proofs step by step against formal logical rules, providing machine-certified correctness.

AlphaProofDeepMind's system (2024) that combined Gemini fine-tuning with reinforcement learning to generate and verify Lean 4 proofs, achieving IMO 2024 silver-medal performance.

FrontierMathA benchmark of research-level mathematics problems (Epoch AI, November 2024) designed to measure AI progress beyond competition mathematics, currently solved at <2% by top models.

Neuro-symbolicA hybrid AI approach combining neural network components (for pattern recognition and generation) with symbolic reasoning components (for exact logical or geometric inference).

Lesson 4 Quiz

Formal Proofs, Frontier Models, and What Comes Next

What made AlphaProof's IMO 2024 result qualitatively different from earlier AI competition math achievements?

Correct. The formal verification aspect — machine-certified Lean 4 proofs — distinguishes this from simply producing a correct numerical answer. Every step of the reasoning was verified.

The key distinction is formal verification. AlphaProof didn't just find correct answers — it produced Lean 4 proofs where every logical step was machine-verified. It solved 4 of 6 problems (not all 6) and took much longer than human competitors.

Why does FrontierMath represent a more meaningful test of AI mathematical reasoning than AIME or standard competition benchmarks?

Exactly right. Research-level problems with novelty review and a sub-2% solve rate for current top models give FrontierMath the combination of difficulty and contamination resistance needed to track genuine progress.

FrontierMath's significance comes from problem difficulty at the research-mathematics level, novelty review to limit contamination, and a sub-2% solve rate for current top models — providing long-horizon headroom for measuring progress.

What is the "productive circularity" described in Lesson 4 between benchmarks and training methods?

Correct. The cycle — verifiable benchmarks enable RLVR-style training, improved models tackle harder benchmarks, new gaps drive research — explains much of the reasoning revolution's momentum.

The productive circularity is more specific: verifiable benchmarks provide the feedback signal RLVR needs; RLVR-trained models push performance on harder benchmarks; new capability gaps motivate the next benchmark and training advance. This is what's driven the rapid progress.

Lab 4: Frontier Reasoning Capabilities

Explore what formal proofs, AlphaProof, and FrontierMath reveal about where AI reasoning is headed.

Your Mission

This lab focuses on the frontier: what formal proof verification means for AI reliability, what the IMO 2024 result implies about AI's mathematical capabilities, and what FrontierMath's sub-2% solve rate tells us about where genuine gaps remain. Discuss with the assistant what "AI doing original mathematics" might actually look like — and what it would require.

Start here: "What's the difference between an AI solving an IMO problem by finding the right answer versus by generating a verified Lean 4 proof? Why does that distinction matter for AI trustworthiness?"

AI Lab Assistant

Frontier Reasoning

Welcome to Lab 4. We're at the frontier now — AlphaProof, formal verification, FrontierMath, and the question of what it would mean for AI to do genuinely original mathematics. What would you like to explore?

Module 6 Test

Math and Code as Reasoning Benchmarks — 15 questions · 80% to pass

1. What property makes mathematical and programming tasks superior to natural-language tasks for evaluating AI reasoning?

Correct. Automatic verifiability is the core property that makes math and code valuable for both evaluation and training.

The essential property is automatic verifiability — correctness checked by a program, not a human — enabling scalable evaluation and RLVR-style training.

2. The GLUE benchmark was saturated by AI models within approximately how long of its release?

Correct. GLUE was published in 2018 and models had nearly matched the human baseline by 2019, leading to SuperGLUE — which was itself saturated by mid-2020.

GLUE (2018) was nearly saturated by mid-2019 — approximately one year. SuperGLUE followed and was saturated by T5 by July 2020.

3. What was the primary training-data innovation behind Google Research's Minerva model?

Correct. Domain-specific pretraining on mathematical content — not fine-tuning or RLVR — was Minerva's key contribution in 2022.

Minerva's key innovation was pretraining on scientific and mathematical content including LaTeX notation from arXiv, combined with majority voting at inference time.

4. Chain-of-thought prompting improved PaLM's performance on GSM8K grade-school math from approximately 17% to 74%. What structural property of math problems explains this large gain?

Correct. The scratchpad function of intermediate steps is the mechanistic reason chain-of-thought helps: errors in early steps can be caught rather than silently propagating.

Chain-of-thought helps because multi-step problems require sequential correct reasoning. Writing steps out creates a scratchpad that maintains logical coherence across the chain — errors become visible rather than silent.

5. DeepSeek-R1 scored 72.6% on AIME 2024 (pass@1). Why is this result considered strong evidence of genuine reasoning improvement rather than data contamination?

Correct. Limited online availability of fully worked AIME solutions significantly reduces contamination risk compared to more widely published benchmarks.

The contamination-resistance argument rests on limited online availability of complete AIME solutions and the difficulty of memorizing specific integer answers for problems requiring multi-step creative reasoning.

6. The MATH benchmark was released in 2021 with problems from which competitions?

Correct. The MATH benchmark drew problems from the AMC competition family and MATHCOUNTS, spanning difficulty levels 1–5 and seven subject areas.

MATH drew from the AMC competition family (AMC 8, 10, 12, and AIME) plus MATHCOUNTS — not Olympiad-level or standardized test problems.

7. What does "pass@k" measure in code generation evaluation?

Correct. Pass@k allows researchers to evaluate models that may produce correct solutions probabilistically — a single failure doesn't disqualify a model that succeeds 30% of the time.

Pass@k asks: if we sample k attempts, does at least one pass? This is useful when models solve problems probabilistically and single-attempt evaluation understates their actual capability.

8. What was GPT-4's approximate solve rate on SWE-bench when it was first evaluated in October 2023?

Correct. GPT-4's 1.74% SWE-bench solve rate, despite its high HumanEval score, was the pivotal demonstration that function completion does not transfer to real software engineering.

GPT-4 resolved approximately 1.74% of SWE-bench issues — a dramatic contrast to its ~67% HumanEval score, revealing the gap between function completion and real-world software engineering.

9. DS-1000 tested AI code generation on real APIs from libraries including NumPy, Pandas, and PyTorch. What capability does this test that HumanEval does not?

Correct. Real-world API usage requires knowing library-specific conventions, argument semantics, and return types — not just general programming logic.

DS-1000 tests whether models can correctly use complex real-world libraries (NumPy, Pandas, etc.) — a different capability from the algorithmic function completion HumanEval measures.

10. AlphaGeometry 2 handled IMO 2024 geometry problems using a "neuro-symbolic" approach. What does this mean?

Correct. The neuro-symbolic hybrid leverages neural creativity for insight generation and symbolic precision for exact inference — combining the strengths of both approaches.

Neuro-symbolic means a hybrid: a language model handles creative reasoning (generating constructions), while a symbolic engine handles exact formal inference. Each component does what it does best.

11. What is the significance of a proof assistant (like Lean 4) accepting a proof, compared to a human mathematician reviewing it?

Correct. Formal verification provides mathematical certainty that human review, however expert, cannot fully guarantee — every step is machine-checked against explicit rules.

Proof assistants provide formal certainty that human review cannot: every logical step is verified against explicit rules, making subtle errors impossible to miss the way human reviewers sometimes do.

12. FrontierMath was designed by Epoch AI and consists of problems at what level of difficulty?

Correct. FrontierMath targets research mathematics, well beyond competition level, with novelty review to minimize contamination risk.

FrontierMath operates at the research mathematics level — PhD+ expertise required in areas like algebraic geometry and analytic number theory — not competition or undergraduate level.

13. What does the ~50% solve rate of leading agentic systems on SWE-bench Verified (2025) suggest about the direction of AI coding progress?

Correct. The rapid improvement from ~2% (GPT-4, 2023) to ~50% (leading agentic systems, 2025) shows that the agentic loop approach is the right architectural direction for complex software engineering.

The trajectory from ~2% to ~50% in under two years demonstrates that agentic, iterative approaches are the key to software engineering capability — and improvement is ongoing.

14. Why did researchers create APPS (Hendrycks et al., 2021) if HumanEval already existed as a code benchmark?

Correct. APPS added the dimension of algorithmic problem-solving at competitive difficulty levels — from interview problems to Olympiad-grade challenges.

APPS filled the gap between HumanEval's simple function completion and competitive programming — 10,000 problems at difficulty levels from interview to Olympiad, testing creative algorithmic reasoning.

15. Which statement best describes the relationship between math/code benchmarks and RLVR training methods?

Correct. This mutually reinforcing cycle — verifiable benchmarks enable RLVR; RLVR models expose new benchmark frontiers — is the engine of the reasoning revolution described throughout this module.

The relationship is deeply intertwined: automatic verifiability provides the RLVR training signal; RLVR models push benchmark frontiers; new frontiers motivate next-generation benchmarks and training methods. This cycle is central to the reasoning revolution.