GPT vs. Claude vs. Gemini · Module 5 · Lesson 1

The Benchmark Industrial Complex

Why the leaderboards you read in headlines are almost never measuring what you think they are.

In May 2023, a now-famous Reddit thread documented something strange: GPT-4 had begun failing a test it had aced just weeks earlier. The test was simple — "Is 17077 a prime number?" In March it got it right. By June it was confidently wrong. Stanford researchers Lingjiao Chen, Matei Zaharia, and James Zou published "How Is ChatGPT's Behavior Changing over Time?" in July 2023, confirming the drift with statistical rigor. The paper became a flashpoint — not because models degrade, but because no one had built benchmark infrastructure to catch it in real time.

The benchmarks AI companies publish at launch are snapshots. They are marketing artifacts as much as scientific instruments. Understanding what they actually measure — and what they systematically miss — is the difference between choosing a tool intelligently and being dazzled by a number on a press release.

What Benchmarks Actually Are

An AI benchmark is a fixed dataset of questions or tasks with known correct answers, used to score model performance as a percentage. The most cited ones include MMLU (Massive Multitask Language Understanding — 57 academic subjects, 14,000+ questions), HumanEval (164 Python programming problems released by OpenAI in 2021), HellaSwag (commonsense reasoning, sentence completion), GSM8K (grade-school math word problems), and BIG-Bench Hard (23 difficult reasoning tasks).

Each was designed with a specific, narrow purpose. MMLU tests breadth of factual recall. HumanEval tests code correctness on toy problems. GSM8K tests arithmetic reasoning in plain language. None were designed as general intelligence tests, yet they are routinely presented that way in launch announcements.

When Google announced Gemini Ultra in December 2023, it led with the claim that Gemini Ultra achieved 90.0% on MMLU — the first model to surpass human expert performance (estimated at 89.8%). The headline spread globally. What received far less coverage: the "human expert" baseline was self-reported by the original MMLU authors from a small volunteer sample, and Gemini used a non-standard 32-shot chain-of-thought prompting method rather than the 5-shot format used in most prior comparisons. OpenAI promptly noted that GPT-4, using the same 32-shot CoT method, also exceeded 90%.

The Contamination Problem

The single most serious structural flaw in modern AI benchmarking is data contamination: benchmark questions appear in training data, so the model has effectively memorized answers rather than demonstrating generalization. Because training datasets for frontier models span hundreds of billions of tokens scraped from the public internet — and because benchmark datasets are published on the public internet — contamination is nearly impossible to prevent and difficult to detect after the fact.

In April 2024, researchers at MIT and the University of Washington published a study in arXiv:2404.00329 demonstrating that when MMLU and GSM8K test questions were slightly rephrased (same concept, different wording), frontier model performance dropped by 10–15 percentage points on average. This strongly implies significant contamination in current scores.

OpenAI acknowledged this problem in the GPT-4 technical report, noting they attempted to remove contaminating data but could not guarantee completeness. Anthropic's model cards for Claude 2 and Claude 3 include similar caveats. Google's Gemini technical report discusses decontamination procedures for Gemini 1.5, but independent verification remains limited.

CRITICAL DISTINCTION

A model scoring 95% on GSM8K does not mean it can reliably do your company's financial modeling. GSM8K problems are short, single-step or two-step arithmetic in clean natural language. Real financial analysis involves ambiguous data, multi-step dependencies, unit conversions, and domain conventions the benchmark never tests. Benchmark scores describe performance on a specific artifact — not capability in your use case.

Goodhart's Law Comes for AI

Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." In AI development, this dynamic is acute. Because benchmark scores drive press coverage, investor confidence, and enterprise adoption decisions, labs have strong incentives to optimize for benchmarks specifically — not for general capability.

This manifests in documented ways. In late 2023, multiple independent researchers noted that several models showed dramatically better performance on MMLU when prompted in specific formats that resembled the benchmark's structure, but performed significantly worse on equivalent questions asked conversationally. The benchmark score was real. The generalization was not.

In January 2024, Hugging Face's Open LLM Leaderboard — which had become an influential ranking for open-source models — was found by contributors to have been gamed by at least several model submissions that appeared fine-tuned specifically on benchmark test sets. Hugging Face subsequently added contamination detection and revised the leaderboard architecture. The incident illustrated that even community-run, transparency-focused evaluation infrastructure is vulnerable to Goodhart dynamics.

WHAT TO REMEMBER

Benchmark scores are the beginning of evaluation, not the end. The questions worth asking: What dataset was used, and when was it published relative to training? What prompting format was used? Has the benchmark been independently replicated? Does the task distribution match your actual use case? Labs that publish detailed technical reports with methodology transparency — all three major labs do to varying degrees — give you more to work with than launch-day press releases alone.

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

In the July 2023 Stanford study by Chen, Zaharia, and Zou, what specific phenomenon did researchers document about GPT-4?

✓ Correct. Chen et al. documented that GPT-4's behavior on tasks like primality testing had drifted significantly over time — a finding that exposed the lack of continuous benchmark monitoring in the industry.

✗ Not quite. The Stanford paper specifically documented behavioral drift — that GPT-4's answers to the same questions changed measurably between March and June 2023, including on tasks it had previously handled correctly.

When Google announced Gemini Ultra surpassing "human expert performance" on MMLU at 90.0%, what methodological difference did critics highlight?

✓ Correct. The 32-shot CoT method vs. 5-shot standard prompting was the key methodological gap. OpenAI noted GPT-4 also exceeded 90% under the same 32-shot CoT conditions, which didn't make the initial headlines.

✗ The key issue was prompting format: Gemini used 32-shot chain-of-thought reasoning prompts rather than the standard 5-shot format, making direct comparisons to prior model scores misleading.

The 2024 arXiv:2404.00329 study found that slightly rephrasing MMLU and GSM8K questions caused frontier model performance to drop by approximately how much?

✓ Correct. The 10–15 percentage point drop on rephrased-but-equivalent questions is strong evidence for training data contamination. Models appear to have partially memorized benchmark answers rather than fully generalizing the underlying reasoning.

✗ The drop was 10–15 percentage points on average. This substantial gap on semantically equivalent but differently-worded questions is the primary evidence cited for significant benchmark contamination.

Lab 1: Benchmark Anatomy

Dissect what specific benchmarks actually test — and what they miss.

Interrogating Benchmark Claims

In this lab you'll practice asking the right questions about benchmark scores. The AI assistant knows the specific details of MMLU, HumanEval, GSM8K, HellaSwag, and BIG-Bench Hard — including their design assumptions, known limitations, and documented contamination incidents.

Try to understand what a specific benchmark actually measures, then probe its limitations for a real-world use case you care about.

Try asking: "I'm evaluating AI models for customer support email drafting. A vendor says their model scores 87% on MMLU. How should I interpret that score for my use case, and what benchmarks would actually be relevant?"

Benchmark Anatomy Lab AI TUTOR

GPT vs. Claude vs. Gemini · Module 5 · Lesson 2

Reading the Technical Reports

GPT-4, Claude 3, and Gemini 1.5 each published detailed technical papers. Here is what they actually say — and what they quietly omit.

On March 14, 2023, OpenAI released the GPT-4 technical report — all 98 pages of it. Within hours, AI researchers noted something remarkable: the report meticulously documented benchmark performance across dozens of evaluations, yet revealed almost nothing about training data, parameter count, or architecture. The abstract stated plainly: "We report the development of GPT-4, a large multimodal model... We discuss the challenges and our approach to addressing them." It said little about the approach itself. The paper prompted an open letter signed by thousands calling for greater AI transparency — and it forced Anthropic and Google to decide how much their own reports would reveal.

The GPT-4 Technical Report: What It Shows

The GPT-4 technical report (OpenAI, March 2023) is primarily an evaluation report, not a model architecture paper. It documents performance on 26 standardized exams (including the Bar Exam, where GPT-4 scored ~90th percentile), multiple academic benchmarks, and a series of red-teaming and safety evaluations conducted by 50+ external experts.

The report introduced the concept of predictable scaling — showing that capability on certain tasks could be predicted from earlier training checkpoints. This was scientifically significant. It also introduced the Evals framework, which OpenAI subsequently open-sourced, allowing third parties to run standardized evaluations.

What the report omits: training data composition, model size, compute budget, RLHF methodology details, and the composition of the red-teaming sample. The justification given was competitive sensitivity and safety (disclosing too much architecture detail could help bad actors). Critics, including many of the signatories to the March 2023 open letter organized by the Future of Life Institute, argued this made independent safety verification impossible.

The Claude 3 Model Card: Anthropic's Approach

Anthropic's Claude 3 model card (March 2024) takes a different structure. Rather than a single technical report, Anthropic publishes model cards following the format proposed by Mitchell et al. (2019) — standardized documents covering intended use, evaluation data, and ethical considerations alongside performance metrics.

The Claude 3 family card documents performance for three model tiers (Haiku, Sonnet, Opus) on the same benchmark suite, enabling direct within-family comparison. It explicitly discusses Constitutional AI methodology — the RLHF variant Anthropic developed where the model is trained against a written constitution of principles rather than purely human preference ratings. This level of methodology disclosure is more detailed than OpenAI's RLHF documentation.

On benchmarks, Claude 3 Opus reportedly scored 86.8% on MMLU (5-shot), 95.0% on GSM8K, and showed strong results on graduate-level reasoning (GPQA: 50.4%). The model card notes that Anthropic used a "held-out evaluation set" for MMLU to reduce contamination risk, though the exact decontamination procedure is not detailed.

The card also contains unusually specific failure mode documentation — noting specific categories where Claude 3 underperforms, including some spatial reasoning tasks and certain multilingual contexts. This type of honest limitation disclosure is more common in Anthropic's documentation than in OpenAI's or Google's equivalent materials.

The Gemini Technical Reports: Scale and Omission

Google published two major technical reports in this period: the original Gemini report (December 2023) and the Gemini 1.5 report (February 2024). Together they represent the most extensive multimodal evaluation documentation of the three companies — covering audio, image, video, and text tasks that GPT-4 and Claude 3 reports did not systematically address.

The Gemini 1.5 Pro report in particular introduced the 1-million-token context window evaluation, documenting performance on tasks like "needle in a haystack" retrieval (finding a specific fact buried in up to 1M tokens of text) with impressive results. This benchmark category — long-context retrieval — had not been systematically evaluated by OpenAI or Anthropic in their published reports, giving Google a documentation advantage in that dimension.

However, independent researchers identified inconsistencies in the Gemini Ultra launch. In January 2024, a Google DeepMind researcher acknowledged in a post on X that the video demonstrations shown in the original Gemini announcement were not real-time — the video was sped up and the model's inputs were still frames, not live video. Google updated its blog post to clarify this. The incident, while not directly a benchmark issue, illustrates how evaluation claims can be shaped by presentation choices.

DOCUMENTATION COMPARISON

OpenAI's reports excel at exam-based benchmarks and real-world professional tests. Anthropic's model cards provide the most transparent failure mode documentation and RLHF methodology detail. Google's reports are strongest on multimodal and long-context evaluations. Each company's documentation strengths align suspiciously well with their respective competitive advantages — a pattern worth noticing.

PRACTICAL GUIDANCE

When a benchmark score is cited without a link to the underlying technical report, ask for it. All three labs publish their reports publicly. Within the report, check: (1) the prompting format used, (2) whether a decontamination procedure is described, (3) whether results are from the company's own evaluation or independently replicated, and (4) whether the specific model version tested matches the model version available to you via API.

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

What is the primary reason OpenAI's GPT-4 technical report does not disclose model size, training data, or architecture details?

✓ Correct. OpenAI explicitly stated competitive sensitivity and the argument that architectural details could assist adversarial use as justifications. Critics, including Future of Life Institute letter signatories, argued this made independent safety verification impossible.

✗ OpenAI explicitly cited competitive sensitivity and safety — specifically that disclosing architecture could assist bad actors. This rationale was criticized by the Future of Life Institute open letter signatories as making independent safety verification impossible.

What distinguishes Anthropic's Constitutional AI (CAI) methodology disclosure in Claude 3's model card compared to OpenAI's RLHF documentation?

✓ Correct. Anthropic's Constitutional AI involves training the model against an explicit written constitution of principles — a methodological distinction that is more thoroughly documented than OpenAI's RLHF approach. The Claude 3 model card explicitly discusses this methodology.

✗ Constitutional AI is meaningfully different from standard RLHF: it involves an explicit written constitution of principles rather than purely human preference ratings. Anthropic's model card documents this methodology in more detail than OpenAI's equivalent materials cover RLHF.

What specific misrepresentation did Google acknowledge regarding its original Gemini announcement video demonstrations?

✓ Correct. A Google DeepMind researcher acknowledged the video was not real-time interaction — it was sped up and used still image frames as inputs. Google updated its blog post to clarify this after independent researchers flagged the discrepancy.

✗ The acknowledged issue was that the demonstration video was sped up and the model received still image frames, not live video input — contrary to what the presentation implied. Google updated its blog post to reflect this after researchers raised the issue.

Lab 2: Technical Report Analysis

Practice reading between the lines of AI company documentation.

What the Reports Reveal — and Conceal

In this lab you'll practice analyzing specific claims from OpenAI, Anthropic, and Google's technical documentation. The AI assistant can walk you through specific passages, methodology choices, and how to identify what's missing from a technical report.

Bring a real benchmark score you've seen cited, or ask about specific sections of the published reports. The goal is to build the analytical habit of reading documentation critically.

Try asking: "In the Gemini 1.5 technical report, what evidence is given for the 1-million-token context window claims, and how was 'needle in a haystack' performance actually measured?"

Technical Report Lab AI TUTOR

GPT vs. Claude vs. Gemini · Module 5 · Lesson 3

Independent Evaluations and What They Found

Third-party researchers, the LMSYS Chatbot Arena, and enterprise evaluators have run tests the labs didn't. The results differ — sometimes dramatically.

By early 2024, the LMSYS Chatbot Arena had accumulated over 500,000 human preference votes on head-to-head model comparisons — making it one of the largest independent AI evaluation datasets ever assembled. The methodology was elegant in its simplicity: show two anonymous model responses to the same prompt, let a human pick the better one, reveal the models only after voting. The resulting Elo ratings painted a picture of model quality that did not always match company-published benchmarks. GPT-4 dominated early. Claude 3 Opus briefly took the lead. Gemini 1.5 Pro surged in multimodal categories. The leaderboard moved in real time, reflecting actual deployed model versions — not the snapshot evaluated in a technical report months earlier.

LMSYS Chatbot Arena: Methodology and Findings

The LMSYS Chatbot Arena (developed by researchers at UC Berkeley and LMSYS) uses Elo ratings derived from pairwise human preference votes. As of mid-2024, the dataset included over 1 million votes across 50+ models. Because voters self-select their prompts, the distribution reflects what real users actually ask — a significant advantage over fixed benchmark datasets whose prompt distribution may not match real usage.

Key findings from Arena data through 2024: GPT-4o (released May 2024) achieved the highest Arena Elo score of any model at its release, briefly holding the #1 position. Claude 3 Opus was rated highest for instruction following and nuanced writing tasks by Arena voters. Gemini 1.5 Pro showed particularly strong performance on longer prompts and multimodal tasks where Arena added image support. Open-source models (LLaMA 3 70B, Mistral Large) consistently ranked well for their parameter count but did not match frontier closed models on Arena Elo through mid-2024.

The Arena methodology has its own limitations. Voter bias is real — the platform skews toward English-speaking, technically literate users. Prompt distribution may favor conversational and creative tasks over enterprise workflows. And because models are sometimes unblinded imperfectly (stylistic tells can reveal model identity), voter neutrality is not fully guaranteed.

Enterprise and Domain-Specific Evaluations

Several well-documented third-party evaluations have produced results that nuance the headline benchmark story. In October 2023, a team at Johns Hopkins published an evaluation of GPT-4, Claude 2, and PaLM 2 on clinical reasoning tasks using actual USMLE questions beyond the standard MedQA dataset. GPT-4 led, but the performance gap narrowed significantly on Step 2 CK questions involving ambiguous clinical presentations — tasks requiring judgment rather than recall.

In March 2024, LlamaIndex published an evaluation of RAG (Retrieval-Augmented Generation) pipeline performance across GPT-4, Claude 3, and Gemini 1.5. The finding: Gemini 1.5 Pro significantly outperformed both on long-document retrieval tasks due to its larger context window, but underperformed on multi-hop reasoning across retrieved chunks. GPT-4 showed the best multi-hop reasoning. Claude 3 Sonnet offered the best cost-to-performance ratio for standard RAG deployments.

In April 2024, Scale AI published their SEAL (Safety, Evaluations, and Alignment Lab) leaderboard results, which used human expert evaluators rather than automated scoring. Claude 3 Opus rated highest for instruction following precision. GPT-4 rated highest for factual accuracy in STEM domains. Gemini 1.5 Pro rated highest for multilingual task handling. The pattern: each model leads in different dimensions depending on evaluation methodology.

The HumanEval Controversy

HumanEval — the 164-problem Python coding benchmark released by OpenAI in 2021 — became one of the most-cited scores in AI capability discussions. By 2024, multiple frontier models reported pass@1 scores above 85%, with GPT-4o reporting 90.2% and Claude 3.5 Sonnet reporting 92% in their respective technical materials.

However, two independent analyses cast doubt on what these scores mean in practice. In 2023, researchers at the University of Edinburgh found that model performance on HumanEval problems was significantly better when the docstring matched training data patterns — again suggesting contamination. More practically, a 2024 evaluation by EvalPlus (an extended HumanEval dataset with additional test cases for each problem) found that pass rates dropped by an average of 15–20 percentage points when additional edge cases were added to the original test suite. A model scoring 90% on standard HumanEval might score 72–75% on EvalPlus, which better approximates production coding requirements.

KEY PATTERN

Independent evaluations consistently reveal a pattern: no single model dominates across all task types when evaluation is done rigorously. GPT-4 family models tend to lead on factual STEM tasks and multi-step reasoning under time pressure. Claude models tend to lead on instruction precision, nuanced writing, and long-document handling. Gemini models tend to lead on multimodal tasks, multilingual performance, and very long context retrieval. This pattern should shape which model you evaluate first for a given use case — not leaderboard rankings alone.

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.

What is the core methodological advantage of the LMSYS Chatbot Arena compared to traditional benchmarks like MMLU?

✓ Correct. Because Arena voters choose their own prompts, the evaluation distribution reflects actual user needs — a significant improvement over fixed benchmark datasets designed years before current models existed.

✗ The key advantage is that Arena prompt distribution reflects what real users actually ask — voters bring their own tasks. This avoids the mismatch between fixed benchmark task distributions and real-world use cases.

The EvalPlus extension of HumanEval found that adding additional edge-case test cases caused average model performance to drop by approximately how much?

✓ Correct. The 15–20 percentage point drop when additional edge-case test cases were added is a strong indicator that reported pass rates on standard HumanEval overestimate real-world coding performance.

✗ The drop was 15–20 percentage points. EvalPlus extended each HumanEval problem with additional test cases; models that passed standard HumanEval often failed the new edge cases, indicating benchmark scores don't fully capture production coding capability.

According to the Scale AI SEAL leaderboard, which model rated highest for instruction following precision in April 2024?

✓ Correct. Scale AI's SEAL leaderboard, which used human expert evaluators rather than automated scoring, rated Claude 3 Opus highest for instruction following precision. GPT-4 led on STEM factual accuracy; Gemini 1.5 Pro led on multilingual tasks.

✗ Claude 3 Opus rated highest for instruction following precision on the Scale AI SEAL leaderboard. The leaderboard used human expert evaluators, which surfaces different strengths than automated benchmarks — GPT-4 led on STEM accuracy and Gemini 1.5 Pro led on multilingual tasks.

Lab 3: Independent Evaluation

Practice interpreting third-party evaluation results and Arena data critically.

Reading Between the Rankings

In this lab you'll work through the logic of independent evaluations — LMSYS Chatbot Arena, SEAL, EvalPlus, domain-specific evals — and practice explaining why rankings diverge from company benchmarks. The AI assistant can walk through specific evaluation methodologies and help you apply the right framework for a given use case.

Bring a real scenario where you need to pick a model, or ask about why Arena Elo and MMLU scores sometimes point in opposite directions.

Try asking: "I'm building a RAG pipeline for a legal document review tool. How should I use Arena data, EvalPlus results, and domain-specific evals together to choose between GPT-4, Claude 3, and Gemini 1.5 Pro?"

Independent Evaluation Lab AI TUTOR

GPT vs. Claude vs. Gemini · Benchmark Reality Check · Lesson 4

Building Your Own Evaluation: How to Actually Know Which Model is Best for Your Task

Generic benchmarks cannot predict your use case. Here is how to design task-specific evals that give you answers you can act on.

In 2023, a fintech company was choosing between GPT-4 and Claude 2 for a contract summarization product. On MMLU, GPT-4 led by four points. On the company's own 50-document test set — real contracts from their legal team, graded by their in-house counsel — Claude 2 outperformed GPT-4 on precision of obligation extraction by a margin that mattered. The MMLU gap had predicted nothing. The in-house eval predicted everything. They shipped with Claude. The lesson is not that Claude is better at contracts. The lesson is that the only evaluation that predicts your outcome is one that uses your task, your data, and your quality standard.

Why Generic Benchmarks Don't Predict Your Use Case

MMLU tests breadth across 57 academic subjects using multiple-choice format. HumanEval tests correctness on 164 toy Python problems with clean, unambiguous specs. Neither dataset resembles a customer support email, a legal clause, a financial narrative, or a product description — the tasks most organizations actually care about. When a benchmark measures something different from your task, its predictive validity for your task approaches zero regardless of sample size.

The mismatch runs deeper than topic. Benchmark prompts are typically short, self-contained, and unambiguous. Real-world prompts are often long, context-dependent, and underspecified. Benchmark scoring is binary (correct/incorrect) or uses fixed rubrics. Real-world quality involves judgment calls about tone, completeness, format compliance, and domain conventions that a fixed rubric cannot capture. And benchmark data is static — the same questions every time — while real-world inputs are unpredictable, evolving, and occasionally adversarial.

This does not mean benchmarks are useless for your decision. They establish a rough capability floor. A model that scores 45% on MMLU is unlikely to perform well on your knowledge-intensive task. But between models scoring 82% and 87%, the benchmark tells you almost nothing about which one to pick for your specific application.

Designing Task-Specific Evaluations

A useful task-specific eval has four components: a representative input set, a clear quality rubric, a reliable scoring mechanism, and enough volume to distinguish signal from noise.

Input set: Collect 50–200 real examples from your actual use case. If you do not have real examples yet, create them using domain experts who understand what realistic inputs look like. Avoid synthetic inputs generated by AI models — they tend to be cleaner and more uniform than real inputs, which biases your eval toward models that handle tidy prompts well.

Quality rubric: Define what "good" looks like before you see any model outputs. For extraction tasks: precision (did it capture everything relevant?) and recall (did it avoid including things that weren't there?). For generation tasks: accuracy, format compliance, appropriate length, tone. For reasoning tasks: correctness of conclusion and quality of reasoning chain. The rubric should be specific enough that two raters applying it independently agree at least 80% of the time.

Scoring mechanism: Human evaluation is the gold standard but expensive. Automated scoring using a reference answer works for tasks with clear correct answers (extraction, translation, summarization against a reference). LLM-as-judge (using a separate model to score outputs) is increasingly common for open-ended generation — but calibrate it by checking agreement with human raters on a sample. Never use the same model family to judge outputs from that same family.

Volume: With fewer than 30 examples, you cannot reliably distinguish a real 5-point accuracy difference from noise. For most practical decisions, 50–100 examples provides useful signal. If you are making a decision that will affect millions of requests or cost millions of dollars, invest in 200–500.

Prompt Sensitivity Testing

One of the most important and most neglected dimensions of model evaluation is prompt sensitivity — how much model output quality varies when you rephrase the same instruction. The 2024 contamination research that found 10–15 point drops from rephrasing was exposing this dynamic at benchmark scale. The same phenomenon applies to your production prompts.

To test prompt sensitivity: take your best-performing prompt for a given task and create three to five variations — different ordering of instructions, different phrasing of the same constraints, different examples in the few-shot block. Run each variation against your eval set. If quality is stable across variations, you have a robust prompt-model combination. If quality swings by more than 10 percentage points across reasonable phrasings, you have a fragile setup that will degrade when real inputs drift from your training distribution.

Sensitivity testing also reveals which model is safer to deploy. A model that scores 88% on your best prompt but 70% on a reasonable alternative is riskier than one scoring 83% consistently across all variants. In production, you cannot control how users phrase their requests — consistency matters more than peak performance.

Latency and Cost as Evaluation Dimensions

Quality metrics are necessary but not sufficient for a deployment decision. Latency and cost are evaluation dimensions that belong in your model comparison framework alongside accuracy scores.

Latency matters differently by application type. A synchronous user-facing feature (autocomplete, real-time chat) requires median latency under 2–3 seconds and tail latency (P95) under 5 seconds. A batch processing pipeline that runs overnight can tolerate 30-second requests. Measure time-to-first-token (relevant for streaming interfaces) separately from total response time. All three frontier labs' APIs show significant latency variance under load — benchmark under realistic concurrency, not just sequential single requests.

Cost math compounds quickly at scale. A model with 30% better accuracy at 4× the cost may be the wrong choice for a feature handling 10 million requests per day. Build a simple cost model: (requests per day) × (average tokens per request) × (price per token). Compare that across models at your expected quality threshold. Often a smaller, cheaper model (GPT-4o mini, Claude 3 Haiku, Gemini 1.5 Flash) comes within 5–10 points of a frontier model on your specific task while costing 10–20× less per token.

YOUR EVALUATION CHECKLIST

Before committing to a model for production: (1) Run your own task-specific eval on at least 50 real examples. (2) Test at least three prompt variants to check sensitivity. (3) Measure latency under realistic concurrency. (4) Build a cost projection at your expected request volume. (5) Check whether the model version you tested is the version available in production — labs update models silently. Only then treat a benchmark score as a useful signal.

THE BOTTOM LINE

The entire benchmark ecosystem exists because evaluating AI is genuinely hard and expensive. Labs publish standardized benchmarks because the alternative — running custom evals for every potential customer — is impossible. But you are not every potential customer. You have a specific task, specific quality requirements, and specific cost constraints. The 30–100 hours it takes to build a real task-specific eval is almost always worth it before committing to a model at production scale. No headline number substitutes for it.

Lesson 4 Quiz

3 questions — free, untracked, retake anytime.

When designing a task-specific evaluation, what is the minimum number of real examples typically needed to reliably distinguish a genuine 5-point accuracy difference from noise?

✓ Correct. With fewer than 30 examples you cannot reliably distinguish real differences from statistical noise. 50–100 examples provides useful signal for most practical decisions.

✗ Fewer than 30 examples cannot reliably distinguish a real 5-point difference from noise. The lesson recommends 50–100 examples for most decisions, and 200–500 for high-stakes or high-volume production choices.

What does prompt sensitivity testing reveal that a single best-prompt evaluation cannot?

✓ Correct. Testing multiple prompt variants reveals whether the model-prompt combination is robust or fragile. A model scoring 88% on your best prompt but 70% on reasonable alternatives is riskier to deploy than one scoring 83% consistently.

✗ Prompt sensitivity testing reveals robustness — whether quality holds across different phrasings of the same instruction. In production you cannot control how users phrase requests, so a model with consistent performance across variants is safer than one with a high peak but fragile average.

According to the lesson's cost analysis guidance, what should you always compare before ruling out a smaller, cheaper model like GPT-4o mini or Claude 3 Haiku?

✓ Correct. Smaller models often come within 5–10 points on task-specific quality while costing 10–20× less per token. At high request volume that gap compounds dramatically — the cost math must be run against your actual quality threshold, not benchmark scores.

✗ The right comparison is task-specific quality vs. cost-per-token at your request volume. Smaller models like Haiku or Flash often fall within 5–10 points of frontier models on specific tasks while costing 10–20× less — a tradeoff benchmark scores alone cannot reveal.

Lab 4: Synthesis and Integration

Apply and extend the concepts from this lesson through guided conversation with an AI assistant.

Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.

Lab 4 Assistant AI Assistant

Module Test

15 questions covering all lessons — free, untracked, retake anytime.

Score: 0/15

What does MMLU stand for, and how many academic subjects does it cover?

✓ Correct. MMLU stands for Massive Multitask Language Understanding. It covers 57 academic subjects ranging from elementary mathematics to professional law and medicine, using multiple-choice format throughout.

✗ MMLU stands for Massive Multitask Language Understanding and covers 57 academic subjects in multiple-choice format. It is one of the broadest knowledge benchmarks used to evaluate frontier models.

HumanEval, OpenAI's coding benchmark, consists of how many Python programming problems?

✓ Correct. HumanEval consists of 164 Python programming problems released by OpenAI in 2021. Each problem includes a function signature, docstring, and test cases; models are scored on whether their generated code passes the tests (pass@1).

✗ HumanEval contains 164 Python programming problems. Released by OpenAI in 2021, it became one of the most widely cited coding benchmarks despite concerns about data contamination and the simplicity of its test cases relative to production code.

What is the precise formulation of Goodhart's Law as it applies to AI benchmarking?

✓ Correct. Goodhart's Law states "When a measure becomes a target, it ceases to be a good measure." In AI, this means that once benchmark scores drive commercial decisions, labs have strong incentives to optimize specifically for those scores rather than the underlying capability the benchmark was designed to measure.

✗ Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." Applied to AI benchmarks, this explains why optimizing for MMLU or HumanEval scores can diverge from improving actual capability — the measure stops tracking what it was meant to track.

What is benchmark gaming via test set contamination, and why is it difficult to prevent in frontier model training?

✓ Correct. Because benchmarks are published on the public internet and training data spans hundreds of billions of tokens scraped from the web, benchmark questions naturally appear in training data. Models may then partially memorize answers rather than generalize the underlying reasoning — which is why performance drops when questions are rephrased.

✗ Contamination occurs because benchmark datasets are published publicly online, and frontier model training data is scraped from the public internet at massive scale. This means benchmark questions naturally appear in training data, and models may memorize answers rather than learn the underlying reasoning — the key evidence being the 10–15 point drop when questions are rephrased.

The LMSYS Chatbot Arena uses what rating system to rank models based on human preference votes?

✓ Correct. The LMSYS Chatbot Arena uses Elo ratings derived from pairwise human preference votes — the same system used in chess rankings. Two anonymous model responses are shown; a human picks the better one; the reveal updates both models' Elo scores accordingly.

✗ LMSYS Chatbot Arena uses Elo ratings, the same system used in chess rankings. Two anonymous model outputs are shown for the same prompt; the human picks the better one; the result updates both models' scores. This pairwise approach has accumulated over 1 million votes across 50+ models.

The GPT-4 technical report (March 2023) is notable primarily for what it omits. Which of the following was NOT disclosed in that report?

✓ Correct. OpenAI's GPT-4 technical report omitted model size, training data composition, and compute budget, citing competitive sensitivity and safety concerns. The report did document exam performance, red-teaming, and predictable scaling — but withheld the architectural details that would enable independent safety verification.

✗ The GPT-4 technical report omitted model parameter count, training data composition, and compute budget. OpenAI cited competitive sensitivity and the argument that architecture disclosure could assist bad actors. The report did include exam performance (Bar Exam ~90th percentile), red-teaming results, and the predictable scaling finding.

When Google announced Gemini Ultra achieved 90.0% on MMLU — surpassing the "human expert" baseline of 89.8% — what methodological issue did critics identify?

✓ Correct. Gemini used 32-shot chain-of-thought prompting while most prior model comparisons on MMLU used 5-shot prompting. OpenAI noted that GPT-4, evaluated under the same 32-shot CoT conditions, also exceeded 90% — a fact that received far less coverage than the initial announcement.

✗ The key methodological gap was prompting format: Gemini used 32-shot chain-of-thought prompting vs. the standard 5-shot format used in prior comparisons. Under identical 32-shot CoT conditions, GPT-4 also exceeded 90% on MMLU — which significantly undermined the "first to surpass human expert performance" framing.

MT-Bench is designed to evaluate what specific capability that most standard benchmarks do not measure?

✓ Correct. MT-Bench (Multi-Turn Benchmark) evaluates multi-turn conversation quality — specifically how well a model handles follow-up questions, maintains context, and produces coherent responses across an extended exchange. This dimension is absent from single-turn benchmarks like MMLU and HumanEval.

✗ MT-Bench evaluates multi-turn conversation quality — how well a model responds to follow-up questions and maintains coherence across multiple turns of dialogue. Standard benchmarks like MMLU and HumanEval are single-turn; they do not test the conversational continuity that matters in real chat-based applications.

HELM (Holistic Evaluation of Language Models) was developed by which institution, and what distinguishes its approach from single-benchmark evaluations?

✓ Correct. HELM was developed at Stanford and takes a holistic approach — evaluating models across multiple scenarios and multiple metrics simultaneously (accuracy, calibration, robustness, fairness, efficiency). This multi-dimensional view contrasts with single-benchmark evaluations that optimize for one number.

✗ HELM is Stanford's Holistic Evaluation of Language Models. Its distinguishing feature is breadth: rather than a single accuracy score, it evaluates models across many scenarios and metrics simultaneously — accuracy, calibration, robustness, fairness, and efficiency — giving a more complete picture of capability and risk.

What consistent finding emerged from the LMSYS Chatbot Arena leaderboard compared to company-published benchmark rankings?

✓ Correct. A key finding from Arena data is that leaderboard rankings frequently diverge from company-published benchmark rankings. Different models led in different Arena categories (instruction following, multimodal, long-context) in ways that did not track MMLU or HumanEval standings — reinforcing that benchmark scores are poor proxies for overall human preference.

✗ Arena rankings consistently diverged from benchmark rankings. Models that led on MMLU or HumanEval did not reliably lead on human preference votes, and different models dominated different Arena categories. This divergence is one of the strongest pieces of evidence that generic benchmark scores are poor predictors of real-world user preference.

Which type of information do all three major labs (OpenAI, Anthropic, Google) commonly omit from their technical reports, regardless of differences in disclosure philosophy?

✓ Correct. Across all three labs, training data composition details and exact parameter counts are consistently withheld — the stated reasons ranging from competitive sensitivity (OpenAI) to safety concerns. This omission is what makes independent safety verification and contamination auditing so difficult.

✗ All three labs consistently omit training data composition and exact model parameter counts from their technical reports. This is true even as they differ in how much RLHF methodology or failure mode detail they disclose. The omission of training data details is what makes independent benchmark contamination auditing nearly impossible.

The 2024 study (arXiv:2404.00329) demonstrated prompt sensitivity by rephrasing benchmark questions. What did the 10–15 percentage point performance drop on rephrased questions most strongly suggest?

✓ Correct. A 10–15 point drop on semantically equivalent but differently-worded questions is strong evidence that models partially memorized original benchmark questions from training data rather than fully learning the underlying reasoning. True understanding should be robust to paraphrase; memorization is not.

✗ The 10–15 point drop on rephrased-but-equivalent questions is the primary evidence for significant training data contamination. If models had genuinely learned the underlying reasoning, paraphrasing should cause minimal performance change. The large drop indicates answers were partly memorized from benchmark questions that appeared in training data.

EvalPlus extended HumanEval by adding extra test cases to each problem. What did this reveal about reported pass@1 scores?

✓ Correct. EvalPlus found average pass rates dropped 15–20 percentage points when additional edge cases were added. A model scoring 90% on standard HumanEval might score 72–75% on EvalPlus — a gap that better approximates production coding requirements where edge cases are the norm.

✗ EvalPlus found pass rates dropped by 15–20 percentage points on average. Models that passed original HumanEval test cases often failed the additional edge cases, meaning standard HumanEval scores significantly overstate real-world coding performance. A 90% HumanEval score might correspond to roughly 72–75% on EvalPlus.

When designing a task-specific evaluation, what is the recommended minimum inter-rater agreement rate for a quality rubric to be considered specific enough for reliable scoring?

✓ Correct. A quality rubric should be specific enough that two independent raters applying it agree at least 80% of the time. Below that threshold, the rubric is too vague and the scores it produces are not reproducible — making comparisons between models unreliable.

✗ The lesson recommends 80% inter-rater agreement as the threshold for a rubric being specific enough to use. If two raters applying the same rubric independently agree less than 80% of the time, the rubric is too ambiguous and the resulting scores cannot reliably distinguish between models.

According to the lesson's guidance on latency benchmarking, what is the recommended practice for measuring latency before a production deployment decision?

✓ Correct. Real production latency must be measured under realistic concurrency (not sequential single requests), and you need both time-to-first-token (critical for streaming interfaces) and total response time, plus P95 tail latency — because all three labs' APIs show significant variance under load that single-request tests do not reveal.

✗ The lesson recommends benchmarking under realistic concurrency, measuring time-to-first-token separately from total response time, and checking P95 tail latency in addition to median. Single sequential requests dramatically underestimate real latency under load, and all three frontier API providers show significant variance that only concurrent testing reveals.

The Benchmark Industrial Complex

Lesson 1 Quiz

Lab 1: Benchmark Anatomy

Interrogating Benchmark Claims

Reading the Technical Reports

Lesson 2 Quiz

Lab 2: Technical Report Analysis

What the Reports Reveal — and Conceal

Independent Evaluations and What They Found

Lesson 3 Quiz

Lab 3: Independent Evaluation

Reading Between the Rankings

Building Your Own Evaluation: How to Actually Know Which Model is Best for Your Task

Lesson 4 Quiz

Lab 4: Synthesis and Integration

Module Test

Module Test Result