Module 5 · Lesson 1

Reading Comprehension & QA Evaluation

From SQuAD to BoolQ — how we measure whether a model truly understands text

What makes a question-answering benchmark meaningful, and why did SQuAD's near-human scores fail to tell the whole story?

When the Stanford Question Answering Dataset launched in 2016, it seemed to offer a clean scoreboard for machine reading comprehension. By early 2018, two systems — from Microsoft and Alibaba — posted F1 scores above human baseline on the leaderboard within days of each other. Headlines declared the reading comprehension problem solved. It wasn't.

What SQuAD Actually Measures

SQuAD (Stanford Question Answering Dataset) presents models with a Wikipedia passage and a question whose answer is a contiguous span of text within that passage. Performance is measured with two metrics: Exact Match (EM) — whether the predicted span matches the gold answer character-for-character — and F1, which gives partial credit for overlapping tokens.

The benchmark was carefully constructed: crowdworkers read passages and wrote questions, then separate workers identified answer spans. The result was 100,000+ question-answer pairs grounded in real text. But it had a structural constraint that would prove consequential — every answer had to be extractable verbatim from the passage.

Exact Match (EM)Binary score: 1 if predicted answer string matches gold answer exactly (after normalization), 0 otherwise. Strict but insensitive to near-misses.

F1 Score (span-level)Token-overlap F1 between predicted and gold answer spans. Rewards partial credit — a prediction of "Barack Obama" gets partial credit against "President Barack Obama."

Extractive QAThe model selects a span from the input text as its answer, rather than generating free-form text. Simplifies evaluation but limits naturalness.

The Adversarial Gap: SQuAD 2.0

In 2018, Rajpurkar et al. released SQuAD 2.0, adding 50,000 unanswerable questions — questions that looked plausible but had no answer in the passage. Models that had seemingly "solved" SQuAD 1.1 dropped dramatically. The best models at SQuAD 2.0 launch scored around 66–67% F1; humans scored ~89%. The gap had re-opened.

This exposed a core weakness: models were doing sophisticated pattern matching against passage text rather than understanding when information was absent. A model trained to always find an answer span would confidently extract wrong spans when the correct answer was "I don't know."

Key Finding · 2018

On SQuAD 2.0, the BERT-large model (the dominant architecture at release) scored 80.0 F1 — still 9 points below human performance of 89.5. Models that had "beaten" human-level on SQuAD 1.1 remained substantially below humans when unanswerable questions were introduced.

Beyond Extraction: BoolQ and Multi-Hop QA

BoolQ (Boolean Questions, 2019) shifted the format: given a passage and a yes/no question, the model must output True or False. This sounds simpler but requires genuine inference — questions like "Can you survive a lightning strike?" paired with a passage about lightning strike survival rates demand synthesis, not span extraction.

HotpotQA (2018) introduced multi-hop reasoning: answering a question requires combining information from two separate Wikipedia paragraphs. Both supporting facts must be identified, not just the final answer. This made it much harder to shortcut through superficial text matching.

The progression — SQuAD → SQuAD 2.0 → BoolQ → HotpotQA — illustrates benchmark iteration: each version closed a loophole that allowed models to score well without truly understanding text.

SQuAD 1.1 Human

~82% EM / ~91% F1

Crowdworker inter-annotator agreement baseline used as human performance target

SQuAD 2.0 Human

~89.5% F1

Human baseline on combined answerable + unanswerable questions

BoolQ Baseline

62.3% acc

Majority-class baseline (always predict True); human accuracy ~90%

HotpotQA Human

~91% F1

Human performance on distractor setting (10 paragraphs, 2 relevant)

Evaluator's Takeaway

When selecting a QA benchmark, verify whether it tests extraction, inference, or abstention. A model that scores 90% on SQuAD 1.1 may score 70% on SQuAD 2.0 and fail on multi-hop tasks — the same capability gap, revealed by different benchmark design choices.

Quiz — Reading Comprehension & QA

3 questions · select the best answer

1. What structural constraint in SQuAD 1.1 made it easier for models to game the benchmark?

Correct. SQuAD 1.1 required extractive answers — models learned to find plausible spans rather than reason about the passage, which allowed high scores without deep comprehension.

Not quite. The key constraint was that answers had to be verbatim spans from the passage, enabling extractive pattern matching rather than true comprehension.

2. What capability did SQuAD 2.0 add that revealed a major weakness in systems that had "beaten" SQuAD 1.1?

Correct. SQuAD 2.0 added ~50,000 plausible but unanswerable questions. Models that had been trained to always extract a span failed badly, exposing that they lacked any sense of "I don't know."

Not this one. SQuAD 2.0's innovation was adding unanswerable questions — testing whether models could recognize when a passage doesn't contain the answer.

3. In span-level F1 scoring for QA, what does the F1 metric specifically measure?

Correct. Span-level F1 computes precision and recall over shared tokens between prediction and gold answer, then takes their harmonic mean. It gives partial credit for partially correct spans.

Not quite. F1 here measures token overlap — the harmonic mean of precision (fraction of predicted tokens that appear in the gold answer) and recall (fraction of gold tokens captured in the prediction).

Lab — QA Metric Analysis

Interactive practice · discuss QA evaluation design with an AI tutor

Your Task

You're evaluating a reading comprehension system for a legal document search tool. The system must both extract answers from contracts and recognize when a contract doesn't address the question. Discuss which metrics and benchmark design choices apply to this scenario.

Start by describing one weakness of using SQuAD 1.1 EM/F1 metrics for this legal QA use case — then ask the tutor what benchmark design features you'd need instead.

QA Evaluation Tutor

L1 Lab

Hello! I'm your QA evaluation tutor. We're going to work through how to choose and design reading comprehension benchmarks for real-world applications. Tell me: what's one limitation of using standard SQuAD 1.1 EM/F1 for evaluating a legal document QA system?

Module 5 · Lesson 2

Code Generation Evaluation

HumanEval, pass@k, and the challenge of measuring whether code actually works

Why did GitHub Copilot's launch in 2021 force researchers to rethink what "correct code" even means for a benchmark?

When OpenAI published the Codex paper in August 2021, they introduced HumanEval — 164 hand-written programming problems, each with a function signature, docstring, and unit tests. The central metric was pass@k: given k generated solutions, does at least one pass all tests? It was a deliberately functional metric. The code either ran correctly or it didn't.

The HumanEval Benchmark

HumanEval was designed to avoid the contamination problems of code benchmarks derived from GitHub — if a model is trained on GitHub, it may have seen benchmark solutions verbatim. The 164 problems were original, spanning string manipulation, list operations, math, and simple algorithms. Each problem included at least 7 unit tests.

The key metric, pass@k, works as follows: generate k code samples for a given problem. If any one passes all unit tests, the problem is solved. The metric is estimated using an unbiased estimator rather than sampling — this corrects for the statistical variance of small k values.

pass@kProbability that at least one of k generated solutions passes all unit tests. Common values: pass@1 (typical single generation), pass@10, pass@100. Higher k favors recall over precision.

Functional CorrectnessA code solution is correct if and only if it passes the provided test suite — not if it looks correct syntactically or stylistically. Tests define ground truth.

BLEU for CodeToken-overlap metric borrowed from machine translation. Largely deprecated for code evaluation because syntactically different code can be semantically identical and vice versa.

What HumanEval Doesn't Measure

164 problems is a small dataset. By 2023, multiple papers documented that models' HumanEval scores could be inflated by training data contamination — the problems, originally private, had appeared in discussion forums. OpenAI addressed this in part with HumanEval+, which expanded test coverage per problem, but the sample size remained limited.

More importantly, HumanEval problems are self-contained functions. Real software engineering involves multi-file codebases, API integration, debugging existing code, and writing tests themselves. The SWE-bench benchmark (2023, Princeton NLP) addressed this directly: it presents models with actual GitHub issues from real Python repositories and asks them to produce a code patch that passes the repository's existing test suite. SWE-bench pass rates for GPT-4 at launch were under 2% — a stark contrast to HumanEval scores above 85%.

Real Case · SWE-Bench · 2023

SWE-bench evaluated 2,294 real GitHub issues from 12 popular Python repositories. GPT-4 with retrieval-augmented generation resolved 1.74% of issues. Claude 2 resolved 4.80%. The gap with HumanEval scores (GPT-4: ~87%, Claude: ~71%) illustrated how isolated function completion differs from real-world software engineering.

MBPP and Competitive Programming

Google's MBPP (Mostly Basic Python Programming, 2021) provided 374 crowdsourced problems targeting beginner-level Python. Unlike HumanEval's expert-written problems, MBPP problems came from crowd workers, introducing more natural language variation and some ambiguity in problem statements.

At the other extreme, APPS (Automated Programming Progress Standard, 2021) included competitive programming problems at introductory, interview, and competition difficulty levels. pass@1 on APPS competition problems remained below 5% for all models tested at launch — establishing that current code generation models, despite high HumanEval scores, cannot reliably solve hard algorithmic problems.

LiveCodeBench (2024) addressed contamination by continuously adding new problems from competitive programming contests, ensuring models cannot have seen solutions during training.

Benchmark	Size	Problem Type	Key Metric
HumanEval	164	Self-contained functions	pass@k (functional)
MBPP	374	Beginner Python	pass@k
APPS	10,000	Competitive programming	pass@k by difficulty
SWE-bench	2,294	Real GitHub issues	% resolved (full repo tests)
LiveCodeBench	Ongoing	New contest problems	pass@k, contamination-controlled

Evaluator's Takeaway

Always match code benchmark difficulty to your deployment context. A high HumanEval score tells you about isolated function generation from docstrings — it tells you very little about multi-file debugging, codebase navigation, or test-driven development. For production code generation tools, SWE-bench-style evaluation on real repositories provides far more signal.

Quiz — Code Generation Evaluation

3 questions · select the best answer

1. What does pass@k measure in code generation evaluation?

Correct. pass@k checks functional correctness: if any one of k sampled solutions passes the complete test suite, the problem counts as solved. It's a recall-oriented metric — higher k makes it easier to succeed.

Not quite. pass@k is about functional correctness via unit tests. It asks: does at least one of k generated solutions pass all tests? Token overlap (like BLEU) is a separate and largely deprecated approach for code.

2. Why did SWE-bench scores (~2–5%) diverge so dramatically from HumanEval scores (~70–87%) for the same models?

Correct. HumanEval tests isolated function generation from a docstring. SWE-bench requires understanding a multi-file codebase, interpreting a GitHub issue, writing a patch, and passing existing repository tests — a fundamentally harder task.

Not this. The divergence reflects task complexity: HumanEval is isolated function completion, while SWE-bench requires navigating real codebases, understanding issues, and producing patches — skills that don't map cleanly onto HumanEval performance.

3. Why was LiveCodeBench introduced as an improvement over HumanEval?

Correct. By drawing problems from recent competitive programming contests, LiveCodeBench ensures models haven't seen solutions during training — addressing the contamination problem that affected static benchmarks like HumanEval.

Not quite. LiveCodeBench's innovation is its continuously updated problem set from new competitions, making contamination much harder. Static benchmarks like HumanEval eventually get solved by training on forums that discuss the problems.

Lab — Code Benchmark Design

Interactive practice · design a code evaluation strategy

Your Task

Your company is evaluating code generation models to integrate into an internal Python data-engineering tool. Engineers will use it to write ETL scripts, fix bugs in pandas/dask pipelines, and generate SQL queries from natural language. You need to choose or design a benchmark strategy.

Ask the tutor: given that our use case involves multi-file pipelines and SQL, why might HumanEval pass@1 be a misleading primary metric? What would a better evaluation protocol look like?

Code Evaluation Tutor

L2 Lab

Welcome to the code evaluation lab! We'll work through designing a realistic benchmark for a data engineering tool. Start by telling me: why might HumanEval pass@1 be misleading for evaluating models on multi-file ETL pipelines and SQL generation?

Module 5 · Lesson 3

Summarization & Translation Evaluation

ROUGE, BLEU, BERTScore, and why automatic metrics keep getting replaced

If BLEU predicted human judgment well enough to run machine translation for 20 years, why has the field moved beyond it — and what came next?

BLEU — Bilingual Evaluation Understudy — was published by Papineni et al. at IBM in 2002. For the next two decades, it was the standard for machine translation evaluation. By the 2020s, researchers were publishing papers showing BLEU correlated poorly with human judgments on modern neural MT outputs, and the ACL community was actively debating whether to retire it. The metric had outlived the era of translation it was designed for.

BLEU: How It Works and Why It Breaks

BLEU computes modified n-gram precision between a hypothesis (model output) and one or more reference translations, applying a brevity penalty for short outputs. It correlates well with human judgment when systems are clearly far apart in quality — but poorly discriminates between modern high-quality systems.

A 2020 meta-analysis by Mathur et al. at ACL demonstrated that BLEU had near-zero correlation with human rankings when comparing top-performing neural MT systems on the WMT19 shared task. Systems ranked very differently by human evaluators had almost identical BLEU scores. The problem: BLEU rewards n-gram overlap but ignores paraphrase, synonymy, and word order at the sentence level.

BLEUModified n-gram precision (1–4 grams) with brevity penalty. Corpus-level metric; low variance at scale. Requires reference translations; ignores semantic equivalence.

ROUGE-LLongest Common Subsequence overlap between hypothesis and reference. Standard for summarization. ROUGE-N uses n-gram overlap. Both suffer the same semantic blindness as BLEU.

BERTScoreUses contextual BERT embeddings to match hypothesis tokens to reference tokens. Correlates better with human judgment for paraphrase-heavy outputs. Computationally heavier and model-dependent.

Summarization: CNN/DailyMail and ROUGE's Limits

The CNN/DailyMail dataset became the standard summarization benchmark after Hermann et al. (2015) at DeepMind released it. News articles paired with bullet-point highlights provided extractive summaries as references. ROUGE-1, ROUGE-2, and ROUGE-L became the default metrics.

By 2020, multiple papers documented that high-ROUGE abstractive summaries were frequently factually inconsistent with source documents. A 2020 study by Kryscinski et al. (Salesforce Research) found that 30% of summaries from then-state-of-the-art models contained factual errors. ROUGE had no mechanism to detect hallucination — a model that made up plausible-sounding facts could score just as high as a faithful one.

This led to factuality-aware evaluation: FactCC, DAE, and similar metrics that check whether each sentence of a summary is logically entailed by the source document. SummEval (2021, Columbia NLP) provided a large-scale human annotation study comparing 14 automatic metrics against human judgments of coherence, consistency, fluency, and relevance.

Real Finding · Salesforce Research · 2020

Kryscinski et al. evaluated summarization systems on CNN/DailyMail using a trained fact-checking model (FactCC). Models achieving state-of-the-art ROUGE scores produced summaries where ~30% of sentences were inconsistent with the source article. High ROUGE and high factuality were not correlated.

Moving Toward Human and Learned Metrics

BERTScore (Zhang et al., 2020) addressed the semantic blindness problem by computing greedy matching between BERT embeddings of hypothesis and reference tokens. It outperformed BLEU and ROUGE in correlation with human judgments across multiple datasets. However, BERTScore is still reference-dependent — it still requires gold-standard reference outputs.

COMET (Crosslingual Optimized Metric for Evaluation of Translation, 2020) went further: a learned metric trained directly on human quality judgments from WMT shared tasks. By conditioning on source, reference, and hypothesis, COMET captures translation adequacy in ways n-gram overlap cannot. It became the primary metric for WMT 2022 and 2023.

For summarization, the field has moved toward multi-dimensional evaluation: separate scores for faithfulness, relevance, coherence, and fluency. Unified Summarization Evaluation (UniEval, 2022) used a pre-trained model to ask binary questions about each dimension, enabling reference-free evaluation.

BLEU Correlation (WMT19)

~0.00–0.08

Pearson correlation with human ranking among top MT systems (Mathur et al., 2020)

COMET Correlation

~0.60–0.75

System-level Pearson correlation with human judgments on WMT tasks

ROUGE Factuality

~30% error rate

Summarization systems with high ROUGE can still have ~30% factually inconsistent sentences

BERTScore F1 Correlation

>ROUGE on most tasks

Consistently higher human correlation than ROUGE-L and BLEU on SummEval benchmark

Evaluator's Takeaway

For summarization, never use ROUGE alone — add a factuality check (FactCC, DAE, or an LLM-based consistency judge). For translation, prefer COMET over BLEU for system comparison. For both tasks, human evaluation of a stratified sample remains essential when deployment quality matters.

Quiz — Summarization & Translation Evaluation

3 questions · select the best answer

1. What specific failure of BLEU was documented by Mathur et al. (2020) on the WMT19 MT shared task?

Correct. Mathur et al. found BLEU had near-zero correlation with human system rankings for top WMT19 systems — strong evidence that BLEU can no longer discriminate quality among modern neural MT outputs.

Not this one. The documented failure was that BLEU correlated nearly zero with human rankings among top neural systems. Modern systems can have very similar BLEU scores but very different human-judged quality.

2. What critical dimension of summarization quality does ROUGE fail to capture?

Correct. ROUGE only measures n-gram overlap with a reference — it has no mechanism to check factual consistency. A model can hallucinate facts while still achieving high ROUGE by using enough words from the reference summary.

Not quite. ROUGE's central failure for summarization is that it cannot detect factual inconsistency — a model can make up facts while scoring well if the hallucinated content uses words that appear in the reference.

3. What key innovation made COMET outperform BLEU for machine translation evaluation?

Correct. COMET is a learned metric — it's trained directly on human adequacy and fluency ratings from WMT shared tasks, allowing it to model what human evaluators actually care about rather than approximating it with n-gram statistics.

Not quite. COMET's key innovation is being a learned metric: it trains on human quality judgments and takes source sentence, reference, and hypothesis as input — capturing translation adequacy at the semantic level, not just surface n-gram overlap.

Lab — Summarization Metric Selection

Interactive practice · choose and justify evaluation metrics

Your Task

Your team is building a clinical note summarization system. Doctors will use it to generate discharge summaries from lengthy patient records. Accuracy is critical — a factual error could affect patient care. You need to choose an evaluation metric suite for this system.

Tell the tutor why ROUGE-L alone would be insufficient for this application, and ask what additional metrics or evaluation approaches would be appropriate for a high-stakes medical summarization context.

Summarization Evaluation Tutor

L3 Lab

Welcome! We're working through metric selection for a clinical summarization system — a high-stakes application where factual errors have real consequences. Start by telling me: why would ROUGE-L alone be a dangerous choice for evaluating medical discharge summaries?

Module 5 · Lesson 4

Reasoning, Math & Safety Evaluation

GSM8K, MATH, MMLU, and how we evaluate what a model must never do

Why did models that scored near-perfectly on grade-school math benchmarks still fail on competition math — and how do safety evaluations avoid the same Goodhart's Law trap?

GSM8K — Grade School Math — launched in 2021 with 8,500 linguistically diverse grade-school word problems. GPT-3 scored 35% with chain-of-thought prompting. By 2023, GPT-4 scored above 90%. Researchers moved to MATH: 12,500 competition mathematics problems from AMC, AIME, and other olympiad tracks. GPT-4's score: around 52% on MATH at launch — far lower, and for the hardest subset, under 20%.

Math Reasoning Benchmarks: A Ladder of Difficulty

GSM8K tests multi-step arithmetic reasoning in natural language. Each problem requires 2–8 steps. The benchmark was important because it specifically required showing work — chain-of-thought prompting, where models generate intermediate reasoning steps, was shown to dramatically improve accuracy (Wei et al., 2022). But GSM8K problems are solvable without symbolic manipulation; they rely on arithmetic and basic algebra.

MATH (Hendrycks et al., 2021) introduced genuine mathematical difficulty: competition problems in algebra, geometry, number theory, calculus, and combinatorics. The difficulty levels 4 and 5 (the hardest) remained below 25% accuracy for most frontier models until 2024. This revealed a gap between "following arithmetic steps" and "mathematical reasoning."

MINERVA (Google, 2022) specifically examined whether models could solve quantitative scientific problems requiring multi-step symbolic reasoning. The benchmark used 272 STEM problems and found that standard few-shot prompting with PaLM 540B solved 14.1% — but a fine-tuned model (Minerva) reached 33.6%, demonstrating that training data composition (mathematical text) substantially drives math capability.

Chain-of-Thought (CoT)Prompting technique where the model generates explicit intermediate reasoning steps before the final answer. Shown to improve accuracy substantially on multi-step reasoning tasks.

MMLUMassive Multitask Language Understanding — 57 academic subjects from elementary to professional level. Multiple-choice format. Measures breadth of knowledge rather than deep reasoning.

Safety EvaluationSystematic testing of model behavior on harmful, sensitive, or policy-violating inputs. Includes red-teaming, refusal rate measurement, and over-refusal analysis.

MMLU: Knowledge Breadth vs. Reasoning Depth

MMLU (Hendrycks et al., 2021) presented 15,908 multiple-choice questions across 57 subjects: professional law, medicine, history, computer science, ethics, and more. It became the most widely cited general-capability benchmark. GPT-3 scored 43.9%; GPT-4 scored approximately 86.4% at launch. Human expert performance was estimated at ~89% for a domain expert covering all subjects.

However, MMLU measures primarily recall and surface-level reasoning. Models that perform near human-expert level on MMLU still fail significantly on tasks requiring novel problem formulation, causal reasoning, or adversarial inputs. MMLU-Pro (2024) addressed this by adding harder reasoning-required questions and expanding answer choices from 4 to 10, substantially reducing score inflation from random guessing.

Real Case · MMLU Contamination · 2023

A 2023 study found that several frontier models appeared to have inflated MMLU scores due to test set contamination — the multiple-choice questions and answers were present in web-scraped training data. Models could retrieve rather than reason. This prompted the adoption of stricter data decontamination protocols and renewed interest in dynamic evaluation.

Safety Evaluation: Refusal, Harm, and Over-Refusal

Safety evaluation faces a unique challenge: the cost of false negatives (failing to refuse harmful content) and false positives (refusing benign content) are both real, but measured very differently. Anthropic's model cards describe red-teaming processes where human specialists attempt to elicit harmful outputs using jailbreaks, prompt injections, and social engineering. The key metrics include:

Harmful content rate — the fraction of attempts that successfully elicit policy-violating outputs. Over-refusal rate — the fraction of benign requests incorrectly refused. Attack success rate (ASR) — specifically for adversarial red-teaming, the fraction of jailbreak attempts that succeed.

AdvBench (2023) presented 520 harmful instruction strings testing model refusals across categories: harmful information synthesis, hate speech, cybercrime, and others. The GCG (Greedy Coordinate Gradient) attack by Zou et al. (2023) demonstrated that gradient-based adversarial suffixes could achieve near-100% ASR on aligned models — causing models to produce harmful content they would normally refuse.

The critical evaluation design challenge: safety metrics must resist Goodhart's Law. A model optimized to minimize measured harmful content rate might simply refuse everything — scoring perfectly on harm while failing completely on utility. Published safety evaluations from Anthropic (2023) explicitly report both harmful content rates and over-refusal rates on curated benign sets.

GSM8K → GPT-4

>90%

Near-saturation by 2023; benchmark no longer discriminates frontier models

MATH Lvl 5 → GPT-4

~16–20%

Competition-level math remains difficult; reveals gap between arithmetic and symbolic reasoning

MMLU → GPT-4

~86%

Near human-expert aggregate; contamination concerns prompted MMLU-Pro with 10-choice questions

GCG Attack ASR

~100% on early models

Gradient-based adversarial suffix attack (Zou et al. 2023) achieved near-perfect attack success rates on aligned models

Evaluator's Takeaway

Select math benchmarks that match your difficulty target: GSM8K is saturated for frontier models; use MATH, AIME, or AMC problems for discriminative evaluation. For safety, always report both harmful content rate and over-refusal rate on paired harmful/benign sets — a model that refuses everything scores perfectly on harm but fails on utility.

Quiz — Reasoning, Math & Safety

3 questions · select the best answer

1. Why did the field move from GSM8K to MATH for benchmarking frontier model mathematical reasoning by 2023?

Correct. When multiple frontier models score above 90% on a benchmark, it no longer differentiates them. MATH's harder competition problems, where frontier models score far lower, provide meaningful signal for comparing capable systems.

Not quite. The primary reason for the shift was saturation: frontier models exceeded 90% on GSM8K, making it uninformative for comparing them. MATH's competition-level difficulty retained discrimination even as models improved.

2. What specific problem does MMLU-Pro (2024) address that MMLU did not?

Correct. With 4-choice MMLU, random guessing yields 25%. Models could inflate scores by learning surface patterns rather than reasoning. MMLU-Pro's 10-choice format reduces random baseline to 10% and adds harder reasoning-required questions.

Not quite. MMLU-Pro's key change is expanding from 4 to 10 answer choices and adding harder questions requiring explicit reasoning. This reduces score inflation from surface pattern matching and the high baseline from random guessing.

3. In safety evaluation, why is it essential to report over-refusal rate alongside harmful content rate?

Correct. This is Goodhart's Law applied to safety: optimizing harmful content rate to zero is trivial — just refuse everything. A complete safety evaluation must measure both the cost of producing harm and the cost of failing on legitimate requests.

Not quite. The reason is Goodhart's Law: a model can achieve a perfect harmful content rate simply by refusing all requests, making it useless. Over-refusal rate captures this failure mode — both metrics together give a meaningful picture of safety-utility tradeoffs.

Lab — Safety Evaluation Design

Interactive practice · design a safety benchmark protocol

Your Task

You're designing a safety evaluation protocol for a consumer-facing AI assistant that will be used by a general audience including minors. The model must refuse truly harmful requests but must not over-refuse everyday questions about health, history, chemistry, or current events.

Tell the tutor: what two separate test sets would you need to properly measure this model's safety-utility tradeoff, and what metrics would you report for each? Then ask how to avoid Goodhart's Law when optimizing based on these metrics.

Safety Evaluation Tutor

L4 Lab

Welcome to the safety evaluation lab. We're designing a rigorous evaluation protocol for a consumer AI assistant. A key challenge: measuring both harmful content production and over-refusal. Start by describing the two distinct test sets you'd need to properly evaluate this system's safety-utility tradeoff.

Module 5 — Task-Specific Evaluation

15 questions · 80% required to pass

1. What does Exact Match (EM) measure in QA evaluation?

Correct. EM is a strict binary metric: after normalization (lowercasing, punctuation removal), the prediction must match the gold answer character-for-character to score 1.

EM is binary: the normalized prediction must exactly match the gold answer. Partial credit is handled by F1, not EM.

2. HotpotQA was designed to require what capability that SQuAD 1.1 did not test?

Correct. HotpotQA requires combining facts from two separate Wikipedia paragraphs — testing genuine multi-hop reasoning rather than single-passage span extraction.

HotpotQA's key innovation was multi-hop reasoning: each question requires combining information from two supporting paragraphs rather than finding a span in one passage.

3. What is the primary advantage of pass@k with k=100 over pass@1 in code generation evaluation?

Correct. pass@100 asks: can the model produce a correct solution at all, given many chances? It reveals the shape of the model's output distribution — useful for understanding capability ceilings regardless of consistency.

pass@100 is a recall-oriented metric — it tests whether any correct solution is in the model's distribution, not just whether the first sample is correct. Higher k = measuring best-case capability.

4. SWE-bench evaluates models on what type of task?

Correct. SWE-bench uses 2,294 real GitHub issues from popular Python repositories. Models must produce code patches that make the repository's existing tests pass — far closer to real software engineering than HumanEval.

SWE-bench uses real GitHub issues from real repositories. Models must write patches that resolve the issue according to the project's existing test suite — not isolated function generation.

5. Why was BLEU declared inadequate for comparing top neural MT systems in 2020?

Correct. Mathur et al. (2020) showed BLEU had near-zero Pearson correlation with human system rankings on WMT19 top systems. Systems humans ranked very differently received nearly identical BLEU scores.

The documented failure: BLEU had near-zero correlation with human rankings among top WMT19 neural MT systems — it could no longer discriminate quality differences that humans could clearly see.

6. BERTScore improves on ROUGE and BLEU primarily because it:

Correct. BERTScore uses greedy matching between contextual embeddings — a paraphrase of the reference that uses different words can still receive a high score if the embeddings are close in semantic space.

BERTScore's key improvement: it uses BERT's contextual embeddings to match tokens semantically, not just by surface form — so paraphrases and synonyms can score highly even if they don't appear in the reference.

7. The Salesforce Research (Kryscinski et al., 2020) finding about CNN/DailyMail summarization was:

Correct. FactCC evaluation revealed ~30% factual inconsistency rates in top-ROUGE summaries — exposing that ROUGE measures textual overlap, not factual faithfulness.

The finding: ~30% of sentences in high-ROUGE summaries were factually inconsistent with source documents. High ROUGE and high factuality are uncorrelated because ROUGE cannot detect hallucination.

8. COMET outperformed BLEU for MT evaluation because it is:

Correct. COMET trains on human adequacy and fluency ratings from WMT, learning to predict human judgments rather than approximating them with n-gram statistics.

COMET is a learned metric: trained on human quality judgments from WMT shared tasks, taking source, reference, and hypothesis as input to predict human-like quality scores.

9. Chain-of-thought (CoT) prompting most significantly improves performance on which type of task?

Correct. Wei et al. (2022) showed CoT dramatically improves performance on tasks like GSM8K that require chaining intermediate reasoning steps. It has little effect on single-step retrieval tasks.

CoT's largest gains appear on multi-step reasoning tasks — math word problems, logic puzzles, symbolic reasoning — where explicitly generating intermediate steps helps the model maintain coherent reasoning chains.

10. Why is GSM8K no longer useful for comparing frontier language models as of 2023–2024?

Correct. Saturation: when multiple top models score above 90%, the benchmark can no longer distinguish between them. A discriminative benchmark must have sufficient headroom for the systems being compared.

Saturation is the issue: frontier models exceed 90% on GSM8K, leaving no headroom to distinguish between them. A saturated benchmark provides no useful signal for ranking or selecting among top systems.

11. What was the approximate GPT-4 score on MATH Level 5 (hardest competition math) at launch?

Correct. GPT-4 scored approximately 16–20% on MATH Level 5 at launch — despite exceeding 90% on GSM8K. This gap illustrates that arithmetic reasoning and symbolic mathematical reasoning are distinct capabilities.

GPT-4 scored roughly 16–20% on MATH Level 5. Despite near-perfect GSM8K performance, competition math problems requiring symbolic manipulation remained far beyond reach.

12. MMLU-Pro improves on MMLU by:

Correct. 10-choice format reduces the random-guess baseline from 25% to 10%, and harder questions requiring explicit reasoning reduce the surface pattern-matching advantage that inflated MMLU scores.

MMLU-Pro expands to 10 answer choices (random baseline drops from 25% to 10%) and adds harder, reasoning-required questions — reducing score inflation from pattern matching and guessing.

13. In safety evaluation, attack success rate (ASR) specifically measures:

Correct. ASR is specific to adversarial red-teaming: given a set of crafted jailbreak attempts, what fraction successfully causes the model to produce content it would normally refuse?

ASR is an adversarial metric: the fraction of deliberate jailbreak attempts (adversarial inputs designed to bypass safety measures) that successfully elicit policy-violating outputs from an aligned model.

14. Why did the GCG (Greedy Coordinate Gradient) attack paper (Zou et al., 2023) represent a significant finding for safety evaluation?

Correct. GCG attacks appended adversarial suffixes optimized via gradient descent to achieve near-perfect harmful content elicitation — showing that RLHF alignment is not robust against targeted adversarial optimization.

GCG's key finding: automatically optimized adversarial token suffixes achieved near-100% attack success rates on RLHF-aligned models — demonstrating that alignment is not robust to gradient-based adversarial attacks.

15. Which combination best represents a complete, balanced safety evaluation protocol?

Correct. A complete safety evaluation requires: (1) harmful content rate on harmful inputs, (2) over-refusal rate on benign inputs to catch the "refuse everything" failure mode, and (3) human evaluation for nuanced edge cases automated metrics miss.

A balanced protocol needs both error types: harmful content rate (false negatives — failing to refuse harmful inputs) and over-refusal rate (false positives — refusing benign inputs). Human evaluation adds nuance automated metrics miss.