Lesson 1 · Module 2

What Is an Eval Suite?

From single test cases to structured collections that reveal system behavior

How do you turn a vague worry about your AI into a structured set of tests?

In November 2022, Anthropic released an internal research document describing their model cards and evaluation approach for Claude's early versions. The document revealed that instead of testing a handful of prompts before deployment, the team had assembled hundreds of targeted test cases spanning helpfulness, honesty, and harm avoidance — organized into what they called an evaluation suite. The structure mattered as much as the individual items: without organization, test results produced noise rather than signal.

Defining an Eval Suite

An eval suite is a curated, organized collection of test inputs and expected outputs (or evaluation criteria) designed to measure specific properties of an AI system. The word "suite" is deliberate: it implies grouping, hierarchy, and intentional coverage — not just a pile of random prompts.

Where a single test case tells you whether the model got one thing right, a suite tells you how the model behaves across a domain. A suite for a customer-service chatbot might contain 300 items organized into eight sub-categories: product questions, refund requests, emotional escalations, off-topic redirects, multi-language queries, edge-case policies, competitor mentions, and adversarial probes.

The Anatomy of a Suite

Every robust eval suite shares a common internal structure regardless of domain:

Test Cases

The atomic unit. Each case has an input (prompt or conversation), a reference (expected behavior or gold label), and optional metadata (category, difficulty, source).

Dimensions

Logical groupings within the suite — often called categories or slices. Each dimension tests a distinct capability or risk. Results can be reported per-dimension.

Scoring Protocol

The rule that converts raw model output into a score. Options include exact match, human rating, model-as-judge, regex, or task-specific metrics like BLEU or F1.

Baseline Record

The score of a reference system (often a prior model version) that new releases are compared against. Without a baseline, you cannot tell if a score is good or regressed.

A Real Taxonomy: HELM (2022)

Stanford's Holistic Evaluation of Language Models (HELM), published in November 2022, is the most widely cited public eval suite framework. HELM organized evaluations across 42 scenarios and 7 metric categories: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The insight was that no single number could characterize a model — a suite had to be multi-dimensional by design.

HELM's structure influenced how every major lab subsequently organized internal evaluations. The lesson: suites are not just more tests — they are tests organized around a theory of what matters.

Core Insight

An eval suite is the materialization of a threat model or capability specification. If you cannot articulate what properties you care about before writing test cases, your suite will measure whatever is easy to measure — not whatever matters.

Suite vs. Benchmark vs. Test Set

These terms are often used interchangeably but carry distinct meanings:

BenchmarkA standardized eval that enables comparison across many systems. Benchmarks are public, fixed, and shared. Examples: MMLU, HellaSwag, BIG-Bench.

Test setA held-out partition of training data used to estimate generalization during development. Usually private, replaced as distributions shift.

Eval suiteAn organization-owned, purpose-built collection covering capabilities and risks specific to a deployment. May incorporate benchmarks as sub-components but extends beyond them.

Why suites get contaminated

When a benchmark becomes widely known, models trained on internet data absorb the test cases indirectly — a phenomenon called data contamination. This is why production teams maintain private eval suites alongside public benchmarks. The private suite cannot be studied or memorized.

Starting Small: The Minimum Viable Suite

Teams new to evaluation often wait until they have hundreds of cases. This is a mistake. A minimum viable eval suite can be built from three sources:

Known failure modes — every bug report, user complaint, or red-team finding from the past becomes a test case.
Capability requirements — enumerate the tasks the system must perform and write at least 10 cases per task, varying difficulty and phrasing.
Safety boundaries — document the behaviors the system must never exhibit and write cases that probe those boundaries directly.

Even 50 well-chosen cases across these three sources, with clear expected behavior and a scoring rule, constitutes an eval suite that catches regressions. The goal is not comprehensiveness on day one — it is living documentation of what good looks like.

Lesson 1 Quiz

What Is an Eval Suite? — 4 questions

What distinguishes an eval suite from a simple collection of test cases?

Correct. Organization, a scoring protocol, and a baseline record are what elevate a list of prompts into an eval suite. Size is secondary.

Not quite. The defining feature is structure and comparability — dimensions, scoring rules, and a baseline — not the scoring method or case count alone.

Stanford's HELM framework organized evaluations across 42 scenarios and how many metric categories?

Correct. HELM used 7 metric categories: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

HELM used 7 metric categories — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — across 42 scenarios.

What is "data contamination" in the context of public benchmarks?

Correct. When test cases are publicly known, models trained on web data can memorize answers, making benchmark scores unreliable proxies for actual capability.

Data contamination refers specifically to models indirectly learning benchmark answers from training data, inflating scores beyond true capability.

A minimum viable eval suite should draw cases from which three sources?

Correct. Starting from existing failures, required capabilities, and prohibited behaviors gives a practical foundation that immediately catches regressions.

The three recommended sources are known failure modes (bugs and complaints), capability requirements, and safety boundaries — the things that already tell you what matters.

Lab 1 — Building a Suite Skeleton

Practice structuring eval dimensions for a real product scenario

Your Task

You are designing an eval suite for a legal document summarization tool. The tool takes lengthy contracts and produces plain-language summaries for non-lawyer clients. Work with the AI tutor to identify the right dimensions, write sample test cases, and define a scoring protocol.

Start by telling the tutor: what capability or risk dimension would you put first in your eval suite for this tool, and why?

Eval Suite Tutor

Lab 1

Welcome to Lab 1. We're building an eval suite for a legal document summarization tool — one that condenses contracts into plain-language summaries for non-lawyers. Your first job is to propose dimensions. What capability or risk dimension would you prioritize first in your suite, and what's your reasoning?

Lesson 2 · Module 2

Choosing Dimensions and Coverage

Deciding what your suite must cover — and what it can safely skip

How do you decide which slices of behavior are worth testing without building an infinite suite?

When GitHub and OpenAI evaluated Copilot before its June 2021 launch, internal documents later revealed in a 2022 legal proceeding showed that the team had organized tests into three primary dimensions: code correctness (does the completion compile and pass unit tests?), security (does the suggestion introduce known vulnerable patterns like SQL injection or buffer overflows?), and copyright risk (does the output reproduce verbatim training data?). Each dimension had its own dataset and scoring rule. The security dimension alone contained over 80 distinct vulnerability patterns drawn from the MITRE CWE list. Without this dimensional structure, security regressions would have been invisible inside an aggregate accuracy score.

The Dimensionality Problem

Every AI system has more testable properties than you can afford to test exhaustively. The practical skill in suite design is choosing dimensions that decompose the failure space — meaning a failure in one dimension cannot hide inside a pass in another.

A single aggregate score is almost always misleading. A model that scores 85% overall might score 95% on easy cases and 40% on the hard edge cases that actually matter in production. Dimensions force you to report disaggregated results and catch that gap.

Frameworks for Choosing Dimensions

Three frameworks guide dimension selection in practice:

Risk-based decomposition. List every way the system could cause harm or fail to deliver value. Each distinct failure mode becomes a candidate dimension. Prioritize by probability × severity.
Capability taxonomy. Enumerate the tasks the system must perform. Sub-divide by difficulty (routine vs. edge case) and by input type (short vs. long context, structured vs. free-form).
User population slicing. Consider whether performance should be consistent across demographic groups, language varieties, or domain expertise levels. Each slice that matters becomes a dimension.

The Coverage Matrix

A coverage matrix is a two-dimensional grid where rows are dimensions and columns are difficulty levels (or sub-types). Each cell shows how many test cases cover that intersection. The goal is to have no empty cells in your priority area — and to be honest about which cells you are intentionally leaving sparse.

Dimension	Routine	Edge Case	Adversarial
Factual accuracy	40 cases	20 cases	15 cases
Refusal behavior	10 cases	25 cases	40 cases
Format compliance	30 cases	10 cases	5 cases
Multilingual parity	20 cases	10 cases	0 cases ⚠

The ⚠ flag on the last cell is intentional: you are acknowledging a known gap rather than pretending it does not exist. This is far better than an unchecked assumption of coverage.

The 80/20 Rule for Test Distribution

A common mistake is over-investing in routine cases. If 80% of your suite tests normal, easy inputs, you will get a high score that tells you nothing about reliability at the edges where the system actually fails users. A practical guideline: spend at least 40% of cases on edge cases and adversarial inputs, even though they represent a smaller fraction of real usage — because they represent a disproportionate fraction of real failures.

The Measurement Gap

In 2023, researchers at the AI Now Institute documented that AI systems deployed in hiring contexts typically had eval suites that tested only "average case" resumes. When real demographic data was analyzed, error rates for non-white-sounding names were 2–4× higher — invisible to suites that lacked demographic slicing as a dimension.

Minimum Cases Per Dimension

How many cases do you need per dimension to trust the score? Statistical guidance:

≥ 30 casesMinimum for a dimension score to have meaningful statistical variance. Below this, a single wrong answer swings the score dramatically.

≥ 100 casesSufficient to detect a 5% regression reliably (at 80% power, two-sided test). This is the practical target for high-stakes dimensions.

≥ 300 casesNeeded to detect a 3% change. Relevant for safety-critical behaviors where small regressions matter.

Lesson from GPT-4 Evals

OpenAI's GPT-4 technical report (March 2023) disclosed that their internal "dangerous capabilities" eval suite contained separate dimensions for CBRN knowledge uplift, cyberattack enablement, and persuasion — each scored independently with their own case counts. An aggregate score would have been meaningless: a model could ace general knowledge while failing dangerous-capability limits.

Handling Competing Dimensions

Sometimes dimensions trade off. A model that maximally refuses ambiguous requests scores perfectly on the safety dimension but poorly on helpfulness. Your suite must make these trade-offs visible — not hide them. The professional practice is to report results in a radar chart or tabular breakdown, allowing stakeholders to see the trade-off surface rather than a single number that obscures it.

Lesson 2 Quiz

Choosing Dimensions and Coverage — 4 questions

When GitHub Copilot was evaluated before launch, which three primary dimensions did the team organize its tests around?

Correct. Copilot's suite covered code correctness, security vulnerability patterns, and copyright/verbatim reproduction risk — each with its own dataset.

Copilot's three primary dimensions were code correctness (compiles and passes tests), security (CWE vulnerability patterns), and copyright risk (verbatim training data reproduction).

What is the main purpose of a coverage matrix in eval suite design?

Correct. The coverage matrix makes gaps visible and deliberate rather than accidental — a ⚠ flag on an empty cell is better than an invisible assumption of coverage.

The coverage matrix exists to make coverage explicit — showing where you have cases, where you have gaps, and allowing you to flag intentional sparse cells rather than hiding them.

How many test cases per dimension does research suggest is the minimum for a score to carry meaningful statistical weight?

Correct. Below 30 cases, a single wrong answer shifts the dimension score by more than 3 percentage points, making the number unreliable for tracking regressions.

The practical minimum is 30 cases. Below that, variance is too high for the score to be meaningful. 100 cases is the target for reliably detecting a 5% regression.

Why does the lesson recommend spending at least 40% of test cases on edge cases and adversarial inputs?

Correct. Edge cases are rare in frequency but over-represented in failure reports. A suite heavy on routine cases produces inflated scores that don't reflect production reliability.

The reasoning is that edge and adversarial cases cause a disproportionate share of real failures — so testing them heavily gives better failure signal even though they're rare in normal usage.

Lab 2 — Coverage Matrix Design

Map dimensions and difficulty levels for a real evaluation scenario

Your Task

You are designing the coverage matrix for an AI medical triage chatbot — a system that helps patients decide whether to go to the emergency room, urgent care, or wait for a regular appointment. The stakes are high: under-triaging sends people home who need emergency care; over-triaging overwhelms ERs with non-urgent cases.

Start by telling the tutor: what are the three most important dimensions you would include in the coverage matrix for this system, and why is each one safety-critical?

Eval Suite Tutor

Lab 2

Welcome to Lab 2. We're building a coverage matrix for a medical triage chatbot — one that recommends ER, urgent care, or wait-for-appointment. The stakes are high in both directions. Start by proposing your three most important eval dimensions and explaining why each is safety-critical for this specific system.

Lesson 3 · Module 2

Scoring Methods and Ground Truth

How you decide whether the model passed — and why that decision is harder than it looks

If a model gives a partially correct answer, who decides the score — and how do you make that decision consistent?

When Stephanie Lin, Jacob Hilton, and Owain Evans at Oxford released TruthfulQA in May 2021, they faced a scoring problem that illuminated the entire field. The benchmark contained 817 questions designed to elicit model falsehoods — but how do you score a free-text answer for truthfulness? Their first approach, exact string matching against gold answers, rejected many true answers phrased differently. Their second approach, human raters, cost $15,000 and took six weeks. Their final published method used a fine-tuned GPT-3 classifier trained to match human judgments. The lesson documented in their paper: every scoring method embeds assumptions about what "correct" means, and those assumptions must be made explicit and tested for reliability.

The Scoring Method Spectrum

Scoring methods lie on a spectrum from fully automated to fully human. Each point on the spectrum trades off cost, consistency, and validity differently.

Method	Cost	Consistency	Best For
Exact match	Near zero	Perfect	Classification, multiple choice, structured outputs
Regex / rule-based	Low	High	Format compliance, keyword presence, code patterns
Reference-based (BLEU/ROUGE)	Low	High	Translation, summarization — when reference texts exist
Model-as-judge	Medium	Medium-high	Open-ended generation quality, safety filtering
Human rating	High	Medium	Nuanced quality, novel capability, calibration data
Task completion	Medium	High	Agentic tasks with defined end states

Ground Truth: Three Sources

Before you can score, you need a reference — a ground truth. Ground truth for AI evals comes from three places, each with different reliability profiles:

Expert annotation

Domain experts (doctors, lawyers, engineers) label correct responses. High validity but slow and expensive. Used for safety-critical or highly technical dimensions.

Crowdsourced rating

Non-expert raters judge response quality on defined rubrics. Fast and scalable but noisy. Requires inter-rater reliability checks (Cohen's kappa ≥ 0.6 is the usual bar).

Synthetic / programmatic

Answers derived algorithmically from a knowledge source (e.g., database lookup, code execution). Perfectly consistent but only covers questions with definite answers.

Model-generated + reviewed

A stronger model proposes labels; humans review a sample. Cost-effective at scale but risks propagating model biases into the ground truth set.

Model-as-Judge: Opportunities and Risks

Using a language model to score another language model's outputs — "LLM-as-judge" — became widespread after the MT-Bench paper (Zheng et al., 2023). The approach is powerful but introduces specific failure modes that must be explicitly tested:

Position biasJudge models score the first-listed option higher regardless of quality. Mitigation: swap response order in a second judging pass and average.

Verbosity biasLonger responses score higher regardless of accuracy. Mitigation: include length-controlled test cases in judge calibration.

Self-preference biasA judge model from the same family as the system under test may rate it higher. Mitigation: use a different-family judge or a fine-tuned specialized judge.

Calibration driftThe judge's standards shift between versions. Mitigation: anchor the judge with a fixed calibration set and check its scores on that set before every eval run.

Documented Case — Chatbot Arena (2023)

The LMSYS Chatbot Arena, which uses human head-to-head preference votes to rank models, found that GPT-4 used as an automated judge matched human Elo rankings at r=0.97 — but systematically overrated verbose responses by about 8%. This bias was detectable only because human ratings existed to compare against. Without a human-rated calibration set, the bias would have been invisible.

Writing a Scoring Rubric

For any dimension that cannot be scored by exact match, you need a scoring rubric — a written definition of what each score level means. A four-point rubric for factual accuracy might read:

Score 0 — Incorrect. The response contains a factual error that would mislead a reasonable user.
Score 1 — Incomplete. The response is factually accurate as far as it goes but omits material information the user needs.
Score 2 — Adequate. The response is accurate and includes the key information, with minor omissions or imprecision.
Score 3 — Excellent. The response is accurate, complete, appropriately caveated, and contextually appropriate.

A rubric is only useful if raters agree on it. Pilot every rubric on 20–30 examples with at least two independent raters and report inter-rater agreement before treating the rubric as production-ready.

The Goodhart Problem

Any scoring method, once used to drive model training, becomes a proxy that can be gamed. Models optimized on BLEU scores produce fluent but factually empty text. Models trained to maximize a safety classifier's score learn to avoid trigger words while retaining harmful content. The practical lesson: rotate scoring methods periodically and maintain a held-out set of cases scored by humans that the model has never been trained against.

Lesson 3 Quiz

Scoring Methods and Ground Truth — 4 questions

What was the key lesson from TruthfulQA's scoring problem?

Correct. TruthfulQA's journey through three scoring methods (exact match → human → trained classifier) illustrated that each encodes a theory of correctness that must be tested and disclosed.

The key lesson was that every scoring method — exact match, human rating, classifier — embeds a definition of "correct" that must be made explicit and validated against actual truth.

Which LLM-as-judge bias involves the model scoring the first-presented response higher regardless of quality?

Correct. Position bias leads judge models to favor whichever response appears first in the prompt. The mitigation is to swap the order and average scores across both runs.

This is position bias — the tendency to score the first-listed option higher regardless of quality. Mitigation: swap response order in a second judging pass and average the results.

What minimum inter-rater reliability score (Cohen's kappa) is the conventional bar for crowdsourced ground-truth annotation?

Correct. Cohen's kappa ≥ 0.6 is the widely used threshold for "substantial agreement" in annotation quality. Below this, the labels are too noisy to use as ground truth.

The conventional bar is Cohen's kappa ≥ 0.6, which represents "substantial agreement" among raters. Below 0.6, the annotation noise undermines the value of the ground truth labels.

What does "Goodhart's Law" imply for eval suite scoring methods?

Correct. Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. This requires rotating scoring methods and maintaining human-rated held-out sets.

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Models trained on a scoring signal learn to maximize the proxy, not the underlying quality it was meant to reflect.

Lab 3 — Scoring Protocol Workshop

Design scoring methods and rubrics for hard-to-evaluate outputs

Your Task

You are evaluating an AI creative writing assistant that helps novelists develop plot ideas. The system generates narrative suggestions in response to story prompts. You need to design scoring methods for two dimensions: narrative coherence (does the suggestion make sense and fit the story?) and originality (is it genuinely novel or a cliché rehash?).

Start by telling the tutor: for the "narrative coherence" dimension, what scoring method would you choose and why? Would you use exact match, a rubric with human raters, model-as-judge, or something else?

Eval Suite Tutor

Lab 3

Welcome to Lab 3. We're scoring a creative writing assistant on two tricky dimensions: narrative coherence and originality. Neither has a clean "right answer." Start with narrative coherence — what scoring method would you choose for it, and what are the specific failure modes of that method you'd need to guard against?

Lesson 4 · Module 2

Maintenance, Versioning, and Living Suites

Eval suites are not artifacts — they are living systems that decay if not tended

How do you prevent your eval suite from becoming outdated, contaminated, or no longer meaningful as your AI system evolves?

When Google Brain and collaborators released BIG-Bench in June 2022 — a benchmark of 204 tasks designed to challenge the largest language models — they built in a review mechanism from the start. Within one year, as GPT-4 and Claude solved many tasks that had seemed hard, the team documented in a follow-up paper that 23 of the 204 tasks had become "solved" benchmarks — meaning top models scored above the estimated human ceiling. The original test cases had not changed, but the bar they represented had become irrelevant. This is sometimes called benchmark saturation, and it is the most common form of eval suite decay.

Why Eval Suites Decay

An eval suite that was meaningful at launch can become useless through several mechanisms. Understanding each helps you build maintenance procedures to counter them:

SaturationModels achieve scores near the ceiling, removing the ability to distinguish better from best. Solution: add harder cases, increase task complexity, or retire saturated dimensions.

Distribution shiftThe real-world inputs the system receives evolve, but test cases don't. The suite stops representing actual usage. Solution: periodic refresh with sampled production logs (anonymized).

Model contaminationTest cases leak into training data — either through public disclosure or data supply chains. Solution: keep a private holdout, rotate cases, and version everything.

Rubric driftThe definition of "good" in a rubric shifts informally as team members change. Solution: store rubrics in version control, require formal change review to modify them.

Capability mismatchNew system capabilities (tools, multi-step reasoning) are not represented in the suite. Solution: quarterly capability audits to identify missing dimensions.

Versioning Your Suite

Treat your eval suite like production software. Every change to a test case, rubric, or scoring protocol should produce a new version with a changelog. This matters because:

Regression detection requires version consistency. If test cases change between model versions, you cannot tell whether a score change reflects model improvement or test change.
Audits require historical records. Regulatory frameworks for AI (EU AI Act, FDA guidance for AI-as-medical-device) require documentation of how evaluation was conducted, what version of the suite was used, and when cases were added or retired.
Reproducibility demands a fixed reference. Published model cards that cite eval results must tie those results to a specific, archived suite version — otherwise the claim cannot be verified.

Industry Practice — Anthropic Model Cards

Anthropic's published model cards for Claude models list specific eval suite names and version numbers alongside reported scores, along with acknowledgment of known limitations and saturation concerns. This allows external researchers to assess whether the reported metrics are still meaningful or have been superseded by newer, harder versions of the same suite.

A Practical Maintenance Schedule

Teams that maintain healthy suites over multi-year periods typically follow a cadence like this:

Frequency	Activity
Every model release	Run full suite; compare to prior version baseline; flag any dimension where score moves ±3% or more
Monthly	Review production incident log; convert new failure modes into test cases within 2 weeks of incident
Quarterly	Audit coverage matrix; identify dimensions approaching saturation (>90% score); audit rubric consistency across raters
Annually	Full suite review: retire obsolete dimensions, add new capability dimensions, refresh distribution-shifted cases, archive old version

The Holdout Principle

Inspired by the train/test split in ML, a well-managed suite maintains a permanently held-out partition — typically 15–25% of cases — that is never used in any training or fine-tuning pipeline. This partition is the only reliable long-term measure of true generalization. It should:

Never be published or shared outside the evaluation team, even internally.
Only be run at designated release checkpoints — not during iterative development — to preserve its clean status.
Be refreshed annually to prevent de facto contamination through the iterative fine-tuning process over time.

The Meta-Eval Problem

The final challenge of living suites is that your scoring process itself must be evaluated. Does your rubric still capture what you mean by "good"? Does your judge model still agree with human raters at the same rate it did when you calibrated it? A meta-eval — periodically re-running a fixed calibration set through your scoring process and comparing to human judgments from the original calibration — answers this question. The recommended cadence is quarterly or after any change to the scoring model or rubric.

Connecting Suites to Decision Processes

The most common eval suite failure is not technical decay — it is organizational disconnection. Suites that are not tied to release gates, model card requirements, or incident review processes drift into exercises that teams run but no one acts on. The practical requirement is simple: identify at least one decision that will be blocked by a failing eval score. This creates the organizational pressure that keeps suites maintained and meaningful.

Lesson 4 Quiz

Maintenance, Versioning, and Living Suites — 4 questions

What happened to 23 of BIG-Bench's 204 tasks within one year of its 2022 release?

Correct. Benchmark saturation — models exceeding human-level performance on tasks — is the most common form of eval suite decay, rendering those dimensions unable to distinguish better from best.

Those 23 tasks became saturated: top models scored above the estimated human ceiling, meaning the tasks could no longer distinguish between model quality levels. This is benchmark saturation.

Why is versioning an eval suite treated the same as versioning production software?

Correct. If test cases change between model evaluations, score differences are uninterpretable — you cannot tell whether the model improved or the test got easier. Versioning preserves interpretability.

The reason is interpretability: if you change test cases without versioning, you cannot determine whether a score change reflects model improvement or test change. Versioning preserves the ability to track regression.

What is the "holdout principle" in eval suite management?

Correct. The holdout partition — typically 15–25% of cases — is never shared, never used in training, and only run at designated release points to preserve its value as a true measure of generalization.

The holdout principle means keeping 15–25% of cases permanently sequestered from all training and development pipelines, running them only at official release checkpoints to maintain clean generalization signal.

What organizational practice ensures an eval suite remains meaningful over time?

Correct. Suites disconnected from real decisions become theater. Tying at least one release gate or deployment decision to eval results creates the organizational pressure to keep suites maintained and honest.

The key organizational connection is tying eval results to a decision that can be blocked by failure. Without this, suites drift into exercises no one acts on, and maintenance motivation disappears.

Lab 4 — Suite Lifecycle Planning

Design a maintenance and versioning strategy for a long-lived eval suite

Your Task

You have just shipped a content moderation AI for a social media platform. The eval suite you built at launch contains 500 test cases across 6 dimensions: hate speech, harassment, spam, misinformation, CSAM detection, and self-harm content. You need to plan for 3 years of maintenance as the platform evolves, adversarial users adapt, and model capabilities improve.

Start by telling the tutor: which of the five decay mechanisms (saturation, distribution shift, contamination, rubric drift, capability mismatch) is your highest risk in year one for a content moderation system, and what would your first maintenance action be?

Eval Suite Tutor

Lab 4

Welcome to Lab 4. You've shipped a content moderation eval suite — 500 cases, 6 dimensions — and now you need a 3-year maintenance plan. Let's start with threat assessment: of the five decay mechanisms we covered (saturation, distribution shift, contamination, rubric drift, capability mismatch), which is the highest risk in year one for a content moderation system, and what's your first maintenance action to address it?

Module 2 — Module Test

Designing Eval Suites · 15 questions · Pass at 80%

1. Which of the following best describes an eval suite?

Correct.

An eval suite is distinguished by its organization: dimensions, scoring protocol, and a baseline for comparison — not just a collection of test prompts.

2. HELM's key methodological contribution was organizing evaluations across multiple dimensions rather than a single aggregate score. How many metric categories did HELM use?

Correct. HELM used 7 metric categories: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

HELM used 7 metric categories across 42 scenarios.

3. Why do production teams maintain private eval suites alongside public benchmarks?

Correct. Data contamination — models absorbing public test cases through training — makes private, unseen test cases the only reliable proxy for true capability.

Private suites exist because public benchmarks can be indirectly memorized through training on internet data — a phenomenon called data contamination.

4. The three sources for building a minimum viable eval suite are:

Correct. These three sources give immediate practical coverage: existing failures, required capabilities, and hard limits.

The three practical sources are known failure modes (bugs and complaints), capability requirements (what the system must do), and safety boundaries (what it must never do).

5. GitHub Copilot's pre-launch evaluation drew its security dimension test cases primarily from:

Correct. The security dimension contained 80+ patterns drawn from the MITRE CWE list, making it the first documented use of a public vulnerability taxonomy to structure an LLM eval suite.

Copilot's security eval dimension drew from the MITRE CWE list — over 80 known vulnerability patterns.

6. What is a coverage matrix in eval suite design?

Correct. The coverage matrix makes gaps visible — a ⚠ flag on an empty cell is better than an invisible assumption of coverage.

A coverage matrix is a grid with dimensions as rows and difficulty levels as columns, showing where you have coverage and where gaps are intentional or not.

7. To reliably detect a 5% regression in a dimension score, approximately how many test cases are needed?

Correct. 100 cases provides sufficient statistical power (80% power, two-sided test) to reliably detect a 5% regression in a dimension score.

100 cases is the target for reliably detecting a 5% regression. 30 is the minimum for any meaningful variance; 300 is needed for detecting 3% changes.

8. The AI Now Institute documented that hiring AI systems had error rates 2–4× higher for non-white-sounding names. What eval suite design failure caused this to remain invisible?

Correct. Without demographic slicing as an explicit dimension, the bias was absorbed into the average score and invisible until real-world data was analyzed.

The failure was dimensional: demographic performance was never a dimension in the eval suite, so the 2–4× error rate disparity hid inside aggregate accuracy numbers.

9. TruthfulQA's final published scoring method was:

Correct. After rejecting exact match (too rigid) and pure human rating (too expensive), TruthfulQA settled on a fine-tuned classifier that approximated human judgment at scale.

TruthfulQA's final method was a fine-tuned GPT-3 classifier trained to replicate human truthfulness ratings — after finding exact match and pure human rating both inadequate.

10. Which LLM-as-judge bias is mitigated by swapping response order between two judging passes and averaging the scores?

Correct. Position bias — scoring the first-presented response higher — is mitigated by presenting both orders and averaging, canceling out the positional advantage.

Position bias (favoring the first-listed response) is mitigated by swapping order and averaging. Verbosity bias requires length-controlled calibration; self-preference requires a different-family judge.

11. What is "Goodhart's Law" and why does it matter for eval suite scoring?

Correct. Goodhart's Law explains why models trained to maximize BLEU produce fluent nonsense, and why safety classifiers can be gamed — the proxy becomes the target.

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Models optimize the proxy rather than the underlying property, which is why scoring methods must be rotated and held-out sets maintained.

12. What is "benchmark saturation"?

Correct. Saturation means the benchmark can no longer discriminate — as happened with 23 of BIG-Bench's 204 tasks within one year of its release.

Benchmark saturation occurs when top models score at or above the human ceiling, making the benchmark unable to rank models by capability. This happened to 23 BIG-Bench tasks within one year.

13. The recommended size of a permanently held-out suite partition is:

Correct. A 15–25% holdout is large enough to produce reliable scores at release but small enough to leave the majority of cases available for development-time iteration.

The practical guidance is 15–25%: large enough for statistical reliability, small enough to preserve most cases for active development use.

14. How often should a "meta-eval" — re-running calibration cases through the scoring process and comparing to original human judgments — be conducted?

Correct. Quarterly meta-evals catch scoring drift before it quietly invalidates an entire evaluation cycle's worth of results.

The recommended cadence for meta-eval is quarterly or after any change to the scoring process — catching drift before it silently corrupts an entire evaluation period's results.

15. What is the most common cause of eval suites becoming useless in organizations — not through technical decay, but through organizational failure?

Correct. Eval suites that don't block anything become theater — run for compliance, ignored for decisions. Tying at least one decision to suite results is the simplest fix.

The core organizational failure is disconnection: when no decision is blocked by a failing eval, the suite loses its reason for existing and maintenance motivation evaporates.