Model Evaluation and Benchmarks

1. The BBQ (Bias Benchmark for QA) benchmark tests for bias in language models by presenting what specific type of scenario?

Correct. BBQ specifically uses ambiguous contexts — where no correct answer exists — to test whether models default to stereotyped attributions when they should express genuine uncertainty.

BBQ's design: present ambiguous contexts where no definitive answer is available. A fair model should express uncertainty; a biased model assigns blame to stereotyped groups — revealing bias through its responses to genuinely ambiguous situations.

2. What is "reward hacking" in RLHF, and which principle does it instantiate?

Correct. Reward hacking is Goodhart's Law in the RLHF setting: the policy learns to maximize the reward proxy through means that don't reflect genuine quality improvement. Regular human spot-checks of high-reward outputs are the primary detection mechanism.

Not quite. Reward hacking is when the model learns to score highly on the reward model without genuinely improving — an instance of Goodhart's Law applied to the reward proxy.

3. The Pass@k metric in code evaluation measures:

Correct. Pass@k is the probability that at least one of k sampled solutions is functionally correct — used by HumanEval and other code benchmarks to account for stochastic generation.

Incorrect. Pass@k measures whether at least one of k samples solves the problem — a probabilistic measure for stochastic code generation. Review Lesson 3.

4. Which type of evaluation gap is MOST significant for assessing whether a model is safe to deploy in a high-stakes autonomous setting?

Correct. Autonomous high-stakes deployment requires understanding how a model behaves across many sequential decisions, under error recovery demands, and with real tools — none of which single-turn benchmarks can reveal.

Incorrect. For autonomous high-stakes deployment, the most critical gap is the inability of standard benchmarks to assess multi-step, agentic, long-horizon behavior — precisely what matters most when a model is acting autonomously.

5. What innovation did BIG-Bench introduce specifically to address training data contamination?

Correct. BIG-Bench introduced canary strings — distinctive token sequences that let practitioners check whether benchmark data appears in training corpora.

Incorrect. BIG-Bench's contamination solution was canary strings. Review Lesson 2.

6. How many multiple-choice questions does the original MMLU benchmark contain?

Correct. MMLU contains 14,042 questions across 57 subjects.

Incorrect. MMLU contains 14,042 questions. 817 is TruthfulQA; 8,500 is GSM8K; 448 is GPQA.

7. A model scores 98% on an eval dataset and scores haven't changed across five model versions. This describes which dataset failure mode?

Correct. When scores are consistently near-perfect and don't differentiate between model versions, the dataset has saturated — it has no remaining discriminative power.

Incorrect. Near-perfect, non-differentiating scores indicate saturation. Review Lesson 2's three failure modes.

8. Which approach to resolving rater disagreement is most appropriate when annotator subjectivity reflects genuine diversity in human opinion rather than measurement error?

Correct. When disagreement reflects genuine diversity of perspective (as in subjective tasks like sentiment or offensiveness), forcing a majority-vote gold label erases minority viewpoints. Preserving the annotation distribution is the recommended approach.

Not quite. For tasks where disagreement reflects legitimate diversity of perspective, the emerging best practice is to preserve the full distribution of rater judgments — majority vote erases minority perspectives that may be equally valid.

9. What statistical test is appropriate for comparing two model versions on paired binary (pass/fail) eval items?

Correct. McNemar's test is specifically designed for paired binary outcomes — comparing two classifiers (or model versions) on the same set of items, where each item has a binary pass/fail outcome.

Incorrect. McNemar's test is the appropriate test for paired binary outcomes in model comparison. Review Lesson 4.

10. "Format gaming" of a benchmark refers to:

Correct. Format gaming exploits the structure of evaluation — answer distribution regularities, positional biases, formatting conventions — rather than improving the capability the benchmark is designed to measure.

Incorrect. Format gaming specifically exploits the structural conventions of benchmark evaluation (e.g., answer position distributions) to gain points without improving the underlying measured capability.

11. The MIT/University of Washington 2023 study detected contamination by observing that:

Correct. Paraphrase degradation — performance dropping on semantically equivalent but lexically different questions — is the behavioural signature of memorisation.

The paraphrase degradation finding is the key result: genuine understanding should be robust to surface rephrasing; memorisation is not.

12. What is the purpose of "anchor examples" in a human annotation protocol?

Correct. Anchor examples give raters a concrete reference for each point on the scale, aligning their internal standards so that a "4" from one rater means the same thing as a "4" from another.

Not quite. Anchor examples are pre-labeled outputs at specific scale points that calibrate rater judgment — without them, different raters may operate with completely different mental models of what each rating level means.

13. A research team reports Fleiss's κ = 0.35 on their annotation task. According to conventional interpretation, what does this indicate?

Correct. By convention, κ below 0.4 indicates poor agreement. A κ of 0.35 suggests the annotation protocol — likely the operationalization, instructions, or anchor examples — needs substantial revision before the data can be trusted.

Not quite. Fleiss's κ = 0.35 falls below 0.4, the conventional threshold for poor agreement. This indicates the annotation protocol needs revision — the task is likely under-operationalized or lacks adequate anchor examples.

14. What is benchmark saturation?

Correct. Saturation means the benchmark can no longer differentiate top models because they all score near the maximum.

Incorrect. Saturation refers to the point where top models all score near ceiling — the benchmark loses discriminating power for comparing frontier systems.

15. What specific technique did EleutherAI use to demonstrate score sensitivity without changing model weights?

Correct. Changing only the regex for answer extraction shifted LLaMA-1 65B MMLU scores by up to 3.2 percentage points.

EleutherAI changed the answer extraction regex — how the model's text output is parsed to determine if it chose the correct option — producing up to 3.2pp of score difference.

16. What is the primary advantage of pass@k with k=100 over pass@1 in code generation evaluation?

Correct. pass@100 asks: can the model produce a correct solution at all, given many chances? It reveals the shape of the model's output distribution — useful for understanding capability ceilings regardless of consistency.

pass@100 is a recall-oriented metric — it tests whether any correct solution is in the model's distribution, not just whether the first sample is correct. Higher k = measuring best-case capability.

17. The 2021 Sap et al. toxicity study found that aggregate IRR statistics had concealed what kind of pattern?

Correct. Demographic disaggregation of disagreements revealed a systematic pattern: White annotators rated African American English text as toxic at significantly higher rates than Black annotators, a bias invisible in aggregate agreement statistics.

Not quite. The concealed pattern was systematic demographic bias — specifically that White annotators rated AAE text as more toxic than Black annotators did — invisible until rater demographics were disaggregated.

18. In a three-gate CI/CD eval pipeline, which gate runs the FULL eval suite including LLM-as-judge scoring?

Correct. Gate 2 (staging deploy) runs the complete eval suite, targeting under 30 minutes. Gate 1 runs a fast 50–100 item deterministic subset; Gate 3 runs online eval on live traffic.

Incorrect. The full eval suite including LLM-as-judge runs at Gate 2 (staging). Review Lesson 4's CI/CD integration patterns.

19. Perplexity-based contamination probing works by:

Correct. The perplexity gap between original and paraphrased versions is the contamination signal — memorised text gets lower perplexity (higher probability) than semantically equivalent novel phrasing.

Perplexity probing exploits the fact that models assign higher probability to memorised sequences. A gap between original and paraphrased versions indicates the model has memorised the specific phrasing.

20. Chain-of-thought (CoT) prompting most significantly improves performance on which type of task?

Correct. Wei et al. (2022) showed CoT dramatically improves performance on tasks like GSM8K that require chaining intermediate reasoning steps. It has little effect on single-step retrieval tasks.

CoT's largest gains appear on multi-step reasoning tasks — math word problems, logic puzzles, symbolic reasoning — where explicitly generating intermediate steps helps the model maintain coherent reasoning chains.

Final Exam