Evaluation and Testing for AI

1. The COMPAS recidivism algorithm case illustrates which evaluation failure?

Correct.

COMPAS's designers measured aggregate accuracy. They did not evaluate subgroup-specific false positive rates — which revealed that Black defendants were flagged as high-risk incorrectly at nearly twice the rate of white defendants.

2. The Biden Administration's October 2023 Executive Order on AI Safety made red-teaming:

Correct. The EO established mandatory government sharing of red-team findings before deployment of powerful models — converting red-teaming from a voluntary best practice to a regulatory requirement.

The EO required sharing red-team results with the federal government before deploying powerful models — converting voluntary practice to a regulatory condition.

3. Which of the following is an example of Goodhart's Law applied to AI benchmarks?

✓ Correct — Correct. Goodhart's Law: when the measure becomes the target, it ceases to be a good measure. Fine-tuning on benchmark-adjacent data improves the score without the corresponding improvement in the underlying capability the benchmark was designed to measure.

Goodhart's Law describes the corruption of a metric when it becomes a target. The clearest benchmark manifestation is optimizing specifically for the score — through data selection, fine-tuning, or cherry-picking — rather than developing the underlying capability.

4. What type of production drift is hardest to detect automatically, and why?

Correct. Concept drift requires knowing what the correct answer should be — which often requires human labeling or indirect proxies, neither of which is available instantaneously at scale.

Incorrect. Concept drift is hardest because detecting it requires knowing what correct outputs should look like after the world has changed — which requires ground truth or proxy signals, not just distribution statistics.

5. A benchmark reaches saturation when:

✓ Correct — Correct. Saturation means the best systems are indistinguishable from each other on the metric — typically within 1–2 percentage points — so the benchmark can no longer serve its function of discriminating capability levels.

Saturation is specifically about discriminative power: when top models cluster within measurement noise, the benchmark cannot tell them apart and should be retired or replaced.

6. The MT-Bench paper (Zheng et al., 2023) is significant because it demonstrated that:

✓ Correct — Correct. MT-Bench showed ~80% agreement between GPT-4 judgments and human preferences — matching inter-human agreement rates — which legitimized automated LLM-as-Judge evaluation.

MT-Bench demonstrated that GPT-4 as a judge achieved ~80% agreement with human preferences — comparable to inter-human agreement. This was the key finding that drove adoption of LLM-as-Judge methods.

7. SWE-bench was created because HumanEval was found to have low ecological validity for which type of work?

✓ Correct — Correct. HumanEval tests isolated algorithmic functions solvable in under 20 lines. SWE-bench was created to test repository-level engineering — fixing real GitHub issues in real codebases — where HumanEval performance was found to be a poor predictor.

SWE-bench specifically targets real repository-level engineering: models must navigate existing codebases, understand architectural context, and produce patches that pass existing test suites — tasks fundamentally different from HumanEval's isolated function generation.

8. Why is red-teaming described as a falsification tool rather than a verification tool?

Correct. The logical asymmetry — failures are informative, non-failures are ambiguous — means red-team results can discover specific vulnerabilities but cannot certify that no vulnerabilities exist.

Red-teaming falsifies safety claims by finding failures, but cannot verify safety — the absence of found failures may reflect an incomplete or insufficiently creative exercise rather than genuine safety.

9. A 2022 meta-analysis found that fewer than what percentage of NLP benchmark papers reported effect-size measures?

✓ Correct — Correct. Fewer than 8% of NLP benchmark papers reported effect-size measures, revealing that the field routinely reports statistical significance without the accompanying measure of whether the difference is practically meaningful.

The documented figure was fewer than 8%, which is strikingly low compared to fields like psychology and medicine where effect-size reporting is standard practice.

10. Why is ROUGE-L an inappropriate primary quality gate for an open-ended generation task like customer support responses?

Correct. The University of Edinburgh analysis found low correlation between ROUGE-L and human judgments for open-ended generation tasks.

Incorrect. ROUGE-L correlates poorly with human judgments for open-ended generation — making it an unreliable gate for that task type.

11. What Cohen's κ threshold is generally used to flag poor inter-annotator agreement requiring guideline revision?

Correct. κ below 0.4 is considered poor agreement — raters are interpreting the task inconsistently, signaling that guidelines need revision before the full annotation run proceeds.

κ below 0.4 signals poor agreement. The standard interpretation: <0.4 poor, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 near-perfect.

12. The Zou et al. (2023) adversarial suffix paper's most alarming finding was:

Correct. Transferability was the key finding: suffixes discovered against accessible open-source models worked against proprietary commercial models despite never having been tested on them.

The critical finding was transferability — suffixes optimized against open-source models successfully attacked proprietary black-box commercial models, including GPT-4.

13. The 2024 CMU cipher-based jailbreak study found which key vulnerability across GPT-4, Claude 2, and Gemini Pro?

Correct. Simple encoding transformations bypassed safety training across multiple frontier models — demonstrating that safety training in natural language does not automatically extend to encoded representations of the same harmful content.

Simple ciphers (Caesar, ROT13, Base64) reliably bypassed safety training across all three models because safety training in natural language didn't generalize to encoded inputs.

14. RLAIF stands for:

✓ Correct — Correct. RLAIF — Reinforcement Learning from AI Feedback — uses AI-generated preference labels instead of human labels to train reward models, dramatically reducing annotation costs.

RLAIF stands for Reinforcement Learning from AI Feedback. It substitutes AI-generated preference labels for human ones in the reward modeling stage of RLHF training pipelines.

15. HELM's key methodological contribution was organizing evaluations across multiple dimensions rather than a single aggregate score. How many metric categories did HELM use?

Correct. HELM used 7 metric categories: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

HELM used 7 metric categories across 42 scenarios.

16. Which of the following is the best description of a "regression surface" in an AI system?

Correct. The regression surface includes model weights, prompts, retrieval systems, API versions, infrastructure, and preprocessing — every change point where behavior can silently degrade.

The regression surface is every change point in the system — model weights, prompts, retrieval indexes, API versions, preprocessing — where a change could cause previously acceptable behavior to degrade.

17. The Kevin Liu / Bing Chat "Sydney" incident in 2023 demonstrated which security risk?

Correct. Liu's injection revealed the hidden "Sydney" system prompt including its instruction to remain confidential — demonstrating that system prompt secrecy cannot be enforced through prompt-level instructions alone.

Liu used direct injection to extract the hidden system prompt, including its self-referential instruction to remain secret — demonstrating that system prompt confidentiality cannot be reliably enforced via instructions to the model.

18. FinanceBench was designed to expose the validity gap between MMLU Finance scores and what real-world capability?

✓ Correct — Correct. FinanceBench targets the gap between MMLU Finance's multiple-choice conceptual recall and the actual skills required in financial analysis: synthesizing noisy data, identifying material risks in qualitative disclosures, and reasoning across conflicting sources.

FinanceBench specifically tests numerical reasoning over real financial documents and qualitative synthesis — capabilities that MMLU Finance's multiple-choice format cannot assess despite measuring related domain knowledge.

19. What is position bias and how is it mitigated in pairwise evaluation?

Correct. Position bias is the tendency to prefer the first or last item presented. Randomizing which response appears in position A vs. B allows detection and correction.

Position bias is primacy/recency preference — favoring the first or last item. Randomize presentation order across raters to detect and correct for it.

20. What is "threshold creep" in the context of AI regression programs?

Correct. Threshold creep is a organizational failure mode — each sprint, rather than fixing the regression, the team adjusts the threshold. Over time, standards erode to the point where the regression program no longer protects quality.

Threshold creep is an organizational failure mode. Rather than blocking deployment and fixing a regression, teams under pressure raise the acceptable threshold — just this once. Repeated across many deployments, this progressively erodes quality standards until the regression program is meaningless.

Final Exam