Module 6 · Lesson 1

Why Humans Are Still in the Loop

Automatic metrics capture what is measurable. Human judges capture what matters.

When does a machine score mislead — and what does a human rater actually see that software cannot?

When OpenAI released GPT-3 in June 2020, automated benchmarks on SuperGLUE placed it near human parity on reading comprehension tasks. Yet the moment real users interacted with it, they surfaced failures that no automated test caught: the model confidently fabricated legal citations, wrote plausible-sounding but factually inverted historical claims, and produced text that was grammatically perfect yet logically circular. Perplexity scores had said nothing about truthfulness. OpenAI's own red-teamers — human evaluators — were the ones who eventually catalogued these failure modes and forced a rethinking of what evaluation should measure.

The Limits of Automated Metrics

Automated metrics dominate AI evaluation because they are cheap, reproducible, and scalable. BLEU scores a translation in milliseconds; perplexity runs on a single GPU pass; accuracy on a benchmark is a single number that fits in a leaderboard cell. But each of these metrics measures a proxy — a signal that correlates with quality under controlled conditions — not quality itself.

The problem is that proxies break when models game them. In 2022, researchers at Google published a paper showing that models fine-tuned directly on BLEU scores produced translations that maximized n-gram overlap while simultaneously becoming harder for human readers to parse. The metric had been optimized; comprehension had degraded. This is an instance of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Human evaluation does not suffer from this specific failure because human judges assess the actual experience of reading and using a model's output — not a mathematical abstraction of it. A human can notice that a response is technically accurate but condescending in tone, or that it answers the literal question while ignoring the obvious intent behind it.

Core Tension

Automated metrics are scalable but gameable. Human evaluation is valid but expensive. Most serious evaluation pipelines use both, treating automated scores as a filter and human judgment as the ground truth.

What Human Raters Actually Assess

Human evaluation protocols are designed to elicit judgments about properties that resist formalization. The most common target properties include:

Fluency Does the text read naturally? Are sentences grammatically well-formed and stylistically appropriate for the context? Automated metrics can approximate this, but human raters catch awkward phrasing that n-gram models miss.

Coherence Does the response hang together logically? Does each sentence follow from what came before? Long-range coherence failures — where a model contradicts itself three paragraphs later — are nearly invisible to sentence-level metrics.

Faithfulness For summarization and grounded tasks: does the output stay true to the source material? Hallucination detection is a core human evaluation task that no current automated metric reliably handles.

Helpfulness Does the response actually serve the user's need? This requires understanding intent, which is fundamentally a human judgment. A response that is accurate but unhelpful scores identically to an accurate, helpful response on most automated metrics.

Safety Does the output avoid harms? Toxicity classifiers catch egregious slurs but miss subtle manipulative framing, culturally specific offenses, and context-dependent harm — all of which human reviewers can flag.

The Historical Baseline Problem

One of the most important functions of human evaluation is establishing what "human-level performance" actually means on a given task. When researchers claimed GPT-4 achieved "human parity" on certain benchmarks in 2023, the claim was contested precisely because the human baseline had been collected under conditions — time pressure, single-pass annotation — that differed radically from how the model was being used. Human evaluation is required not just to assess model outputs, but to properly characterize the human performance bar those outputs are being compared against.

This is why organizations like Anthropic, OpenAI, and Google DeepMind maintain large-scale human evaluation programs alongside their automated testing infrastructure. The automated metrics tell you whether something changed; the humans tell you whether the change was good.

Key Takeaway

Human evaluation is not a fallback for when automated metrics fail. It is the epistemic foundation against which automated metrics are validated. Metrics are useful only to the extent that they predict what humans would judge — and that relationship must be periodically re-established through direct human assessment.

Lesson 1 Quiz

Why Humans Are Still in the Loop — 3 questions

1. What is the primary reason automated metrics like BLEU can be misleading even when they show high scores?

Correct. When BLEU becomes a training target, models learn to maximize n-gram overlap without improving — and sometimes while hurting — actual readability, exactly as the 2022 Google study demonstrated.

Not quite. The core problem is that optimization pressure against a proxy measure decouples it from the underlying quality it was meant to track — Goodhart's Law in action.

2. Which evaluation property refers to whether a summarization model's output stays true to the source document without inventing details?

Correct. Faithfulness — sometimes called groundedness or factual consistency — measures whether the model's output is supported by the provided source material. Hallucination detection is a central faithfulness task.

Not quite. Faithfulness is the property that covers factual accuracy relative to source material. Fluency is about grammar and style; coherence is about logical flow; helpfulness is about meeting user needs.

3. When GPT-3 launched in 2020, what type of evaluation first surfaced its tendency to produce confident factual errors that automated benchmarks had missed?

Correct. OpenAI's human red-teamers and early real users uncovered failure modes — fabricated citations, factually inverted claims — that benchmarks like SuperGLUE had completely missed because those benchmarks measured proxy signals, not truthfulness.

That's not right. SuperGLUE placed GPT-3 near human parity — it was part of the misleading picture. Human red-teamers and users were the ones who discovered the hallucination and factual error problems.

Lab 1 — The Metric Gap

Explore when automated scores diverge from human judgment

Your Task

You're working with an AI evaluation assistant who can help you reason through cases where automated metrics and human raters disagree. Bring a scenario, a question about a specific evaluation challenge, or ask the assistant to walk you through a concrete example from NLP evaluation history.

Complete at least 3 exchanges to finish this lab.

Suggested start: "Give me an example where a model scored highly on BLEU but would clearly fail a human fluency check" — or ask your own question about the limits of automated metrics.

Evaluation Lab Assistant

L1 · Metric Gap

Welcome to the Metric Gap lab. I'm here to help you explore cases where automated scores tell a different story than human judgment. What would you like to examine — BLEU, perplexity, accuracy, or something else?

Module 6 · Lesson 2

Annotation Design and Rater Protocols

The quality of human evaluation is entirely determined before any rater sees a single output.

What choices made in an annotation protocol determine whether human evaluation is trustworthy or just expensive noise?

In 2011, Rion Snow and colleagues at Stanford published a foundational study showing that aggregated Amazon Mechanical Turk annotations could match expert quality — under the right conditions. But subsequent research revealed the caveat in brutal detail. A 2017 analysis of crowd-sourced sentiment datasets found that when instructions were ambiguous, inter-annotator agreement dropped below 60%, making the resulting labels statistically unreliable as a training signal. The labels had been collected; the protocol had not been designed. The data looked clean; it was not.

The Annotation Task as a Measurement Instrument

An annotation protocol is a measurement instrument in the scientific sense. Like a thermometer or a survey scale, it has reliability (does it produce consistent results across raters and time?) and validity (does it measure what it claims to measure?). Most annotation failures are failures of instrument design, not rater incompetence.

The single most consequential design choice is the operationalization of the target property. "Rate the quality of this response" is not an operationalization; it is an open invitation for raters to apply their own idiosyncratic theories of quality. A proper operationalization specifies: quality along what dimension, using what scale, with what anchors, assessed relative to what baseline.

Scale Design

Most human evaluation uses one of three scale types:

Likert Scale A 5- or 7-point scale from "strongly disagree" to "strongly agree" (or "very poor" to "very good"). Simple and familiar, but susceptible to scale compression — raters often avoid the extremes — and to inconsistent mental models of what each number means.

Comparative / Pairwise Raters see two outputs and select the better one (or declare a tie). Pairwise comparison reduces cognitive load, avoids scale anchoring problems, and tends to produce higher inter-rater agreement. Elo systems can convert pairwise results into continuous quality scores. LMSYS Chatbot Arena uses exactly this design.

Best-Worst Scaling (BWS) Raters see a set of items (typically 4) and select the best and worst. BWS is statistically efficient — each judgment yields more information than a single Likert rating — and is increasingly used in NLP evaluation research.

LMSYS Chatbot Arena as a Live Protocol

The LMSYS Chatbot Arena, launched in 2023 by researchers at UC Berkeley, is the most prominent live deployment of pairwise human evaluation at scale. Users submit a prompt, receive responses from two anonymous models, and vote for the better response. By mid-2024 the platform had collected over one million human preference votes. The resulting Elo leaderboard has become one of the most influential rankings in the field precisely because it is grounded in genuine user preferences rather than benchmark scores.

The Arena's design makes several deliberate protocol choices: raters are the actual users (high ecological validity), tasks are self-selected (realistic distribution of queries), and the pairwise format minimizes scale-anchoring artifacts. Its limitation is that raters are self-selected and uncontrolled — a protocol trade-off that the researchers openly acknowledge.

Protocol Design Checklist

Before collecting a single annotation: specify the target dimension precisely, define the scale anchors with concrete examples, write a rater guide with worked examples, establish a qualification test, and determine your minimum acceptable inter-annotator agreement before analysis begins.

Rater Instructions and the Anchor Problem

Even well-designed scales fail if rater instructions are ambiguous. The industry best practice is to provide anchor examples: actual outputs that have been pre-labeled as "1", "3", and "5" (or equivalent) by domain experts, so raters can calibrate their internal scale against a shared reference. Without anchors, two raters using the same 5-point scale may be operating with completely different mental models of what a "4" looks like.

Google's internal evaluation guidelines for large language model outputs — portions of which have been described in published research — include multiple worked examples per rating level per dimension, precisely to address this calibration problem. The investment in instruction design is substantial, but the alternative is data that cannot be aggregated or compared.

Key Takeaway

Annotation protocol design is not an administrative task that precedes the real work of evaluation. It is the real work. Every hour spent sharpening operationalizations and calibrating anchor examples saves dozens of hours of unusable or misleading data downstream.

Lesson 2 Quiz

Annotation Design and Rater Protocols — 3 questions

1. LMSYS Chatbot Arena uses which evaluation design to build its model leaderboard?

Correct. The Arena shows two anonymous model responses to the same prompt; users vote for the better one; pairwise votes are aggregated into an Elo leaderboard — producing one of the most ecologically valid rankings in the field.

Not quite. The Arena uses pairwise comparison (not Likert or BWS) from self-selected real users, with Elo conversion. This design minimizes scale-anchoring artifacts while capturing genuine user preferences.

2. What is an "anchor example" in a human evaluation protocol, and why is it important?

Correct. Anchor examples are pre-labeled outputs at each rating level (e.g., a "1", "3", and "5" on a 5-point scale) that give raters a concrete, shared reference so their internal scales align.

That's not right. An anchor example is a pre-labeled output at a specific scale point used to calibrate raters. Without anchors, different raters may have completely different mental models of what each number on a scale represents.

3. The 2017 analysis of crowd-sourced sentiment datasets found that poor inter-annotator agreement was primarily caused by what?

Correct. The core problem was protocol design failure — ambiguous instructions led raters to apply their own theories of sentiment, producing agreement rates below 60% and statistically unreliable labels.

Not quite. The root cause was ambiguous instructions — a protocol design failure. When raters lack clear operationalizations and anchor examples, even capable raters produce inconsistent data.

Lab 2 — Protocol Design Workshop

Design a human evaluation protocol for a specific AI task

Your Task

Work with the AI assistant to design a human evaluation protocol for a task you specify. The assistant will push you to operationalize your target dimension precisely, choose an appropriate scale, and draft anchor examples. You'll receive critique and suggestions grounded in real annotation research.

Complete at least 3 exchanges to finish this lab.

Suggested start: "I want to evaluate a customer service chatbot for helpfulness. Help me design the annotation protocol." — or bring your own evaluation scenario.

Evaluation Lab Assistant

L2 · Protocol Design

Protocol Design Workshop open. Tell me what AI system you want to evaluate and what dimension matters most to you — I'll help you build an annotation protocol that actually produces trustworthy data.

Module 6 · Lesson 3

Inter-Rater Reliability and Disagreement Analysis

Agreement is not consensus. Disagreement is often the signal, not the noise.

How do you know whether your human evaluators are measuring the same thing — and what should you do when they are not?

In 2021, Sap and colleagues published research showing that annotators from different demographic backgrounds systematically disagreed on which text samples were toxic — particularly on African American English (AAE) text. White annotators labeled AAE text as toxic at significantly higher rates than Black annotators reading the same text. The inter-rater agreement statistics looked acceptable in aggregate; the systematic bias was invisible without demographic disaggregation of rater judgments. The IRR number had hidden the problem. The research forced the field to rethink what disagreement means — and whose disagreement counts.

Measuring Agreement: The Standard Metrics

Inter-rater reliability (IRR) quantifies the degree to which independent raters produce the same judgments. Several statistics are in common use:

Percent Agreement The simplest measure: proportion of items where all raters agree. Easy to understand but inflated by chance agreement, especially when label distributions are skewed. Two raters who both always choose "acceptable" will show 100% agreement even if they have no shared understanding of the label.

Cohen's κ (Kappa) Corrects for chance agreement between two raters. κ = (observed agreement − chance agreement) / (1 − chance agreement). Values above 0.6 are generally considered acceptable; above 0.8 is strong. Widely used in NLP annotation tasks with two raters.

Fleiss's κ Extends Cohen's κ to more than two raters. Used when multiple annotators each label a subset of items. The most common IRR statistic in large-scale NLP annotation projects.

Krippendorff's α Handles missing data, multiple raters, and ordinal/interval/ratio scales — more flexible than κ. Preferred in communication research and increasingly adopted in AI evaluation when scale level matters.

Intraclass Correlation (ICC) Used when ratings are continuous or on an interval scale. Common in speech quality evaluation (e.g., Mean Opinion Scores for voice synthesis) and in rubric-based scoring tasks.

Interpreting Disagreement

Low IRR is often treated as a problem to be fixed — by rewriting instructions, retraining raters, or discarding borderline items. But this framing can be wrong. Some disagreements are genuine: the underlying judgment is inherently ambiguous, or different raters are applying legitimately different but equally valid perspectives. Forcing artificial consensus in these cases produces a dataset that misrepresents the actual distribution of human judgments.

A more productive approach is disagreement analysis: systematically examining which items produce disagreement, which rater characteristics predict the direction of disagreement, and whether the disagreement pattern reveals something important about the task or about the population of potential users. This is exactly what Sap et al. did in their 2021 toxicity work — the demographic analysis of disagreement was the scientific contribution, not a quality-control failure.

Practical Threshold

Most published NLP evaluation work reports Fleiss's κ and uses 0.6 as an informal acceptance threshold. Below 0.4 is generally considered poor agreement; 0.4–0.6 is moderate; 0.6–0.8 is good; above 0.8 is near-perfect. But these thresholds are conventions, not laws — some tasks (e.g., sarcasm detection) are inherently harder to agree on than others (e.g., grammaticality).

Adjudication and Gold Labels

When raters disagree, three resolution strategies are standard:

Majority vote: the label chosen by the plurality of raters becomes the gold label. Simple and scalable, but discards minority perspectives and can systematically exclude the views of demographic minorities who are outnumbered in the annotation pool.

Expert adjudication: a domain expert reviews disagreed items and assigns the gold label. More expensive but preserves quality on difficult edge cases. The expert's decisions should be logged and auditable.

Preserving disagreement: rather than forcing a single gold label, the dataset retains the full distribution of rater judgments. This approach, championed by researchers including Dirk Hovy and Barbara Plank, is gaining traction for tasks where annotator subjectivity reflects real diversity in human opinion — not measurement error.

Key Takeaway

IRR statistics are diagnostic tools, not pass/fail gates. High agreement on a poorly designed task is meaningless. Low agreement on an inherently contested judgment may be scientifically valid. Always ask whether the disagreement is noise or signal before deciding how to handle it.

Lesson 3 Quiz

Inter-Rater Reliability and Disagreement Analysis — 3 questions

1. Cohen's κ improves on simple percent agreement by doing what?

Correct. κ = (observed − chance) / (1 − chance). This correction prevents artificially inflated agreement when one label dominates — two raters who always pick the majority label will show 100% percent agreement but a low κ.

Not quite. Cohen's κ corrects for chance agreement: it subtracts the agreement expected by chance from the observed agreement, then normalizes. This prevents the inflation that occurs when label distributions are skewed.

2. What did the 2021 Sap et al. research on toxicity annotation reveal about demographic disaggregation of rater judgments?

Correct. This was the key finding — a systematic bias that aggregate IRR statistics had concealed. Demographic disaggregation of disagreement revealed that the "noise" was structured, not random, and reflected different cultural frameworks for interpreting language.

Not quite. The finding was that White annotators rated African American English text as more toxic at significantly higher rates than Black annotators reading the same text — a systematic pattern hidden by aggregate agreement statistics.

3. Which approach to handling rater disagreement is increasingly advocated for tasks where annotator subjectivity reflects genuine diversity of human opinion?

Correct. Researchers like Dirk Hovy and Barbara Plank advocate retaining the full distribution of annotations rather than collapsing to a majority label — especially for tasks where disagreement reflects legitimate diversity in human perspective rather than measurement error.

Not quite. For tasks where disagreement reflects real differences in human perspective (not noise), the emerging best practice is to preserve the full distribution of rater judgments rather than forcing a single label that erases minority viewpoints.

Lab 3 — IRR Analyst

Interpret agreement statistics and diagnose disagreement patterns

Your Task

Practice interpreting inter-rater reliability statistics and thinking through disagreement analysis with the AI assistant. You can present a scenario with given κ values, ask about which IRR metric is appropriate for your task, or explore what to do with low-agreement items in a real dataset.

Complete at least 3 exchanges to finish this lab.

Suggested start: "My annotation task has three raters and ordinal ratings 1–5. We got Fleiss's κ = 0.41. Is that acceptable, and what should I do?" — or bring your own IRR question.

Evaluation Lab Assistant

L3 · IRR Analyst

IRR Analyst ready. Share your annotation setup — number of raters, scale type, task, and any agreement statistics you already have — and I'll help you interpret what they mean and what to do next.

Module 6 · Lesson 4

RLHF, Preference Data, and the Future of Human Feedback

Human feedback stopped being a validation step. It became the training signal.

How did the InstructGPT paper change the role of human evaluation — and what does it mean when the thing you're evaluating is also the thing being trained on your evaluations?

When OpenAI published "Training language models to follow instructions with human feedback" in January 2022, the paper described a training pipeline that had been quietly reshaping how the field thought about human evaluation. Labelers — a team of approximately 40 contractors — were not just assessing outputs after the fact. They were generating the preference data used to train a reward model, which was then used to fine-tune GPT-3 via proximal policy optimization. Human judgment had become gradient signal. The resulting model, InstructGPT, was rated as significantly more helpful, honest, and harmless than its predecessor despite having fewer parameters — demonstrating that the alignment of a model's behavior with human preferences was now as important as raw capability.

What RLHF Changed About Human Evaluation

Before Reinforcement Learning from Human Feedback (RLHF), human evaluation was primarily a measurement activity: you trained a model, then humans assessed it. RLHF made human evaluation a production activity: human preference judgments directly shaped what the model learned to do. This changed everything about how preference data is collected, who collects it, and what biases in that collection process get amplified into model behavior.

The InstructGPT paper was explicit about this: the labeler pool was mostly English-speaking, relatively young, and American. Their preferences — which outputs they found helpful, which they found harmful — became encoded in the model's reward function. A model trained to maximize labeler approval learns, at least in part, to reflect labeler demographics.

The Preference Data Pipeline

A standard RLHF preference collection pipeline has three stages:

Demonstration Collection Labelers write ideal responses to sampled prompts. These demonstrations are used to fine-tune the base model via supervised learning, producing an initial "instruction-following" model before reward modeling begins.

Comparison Collection The instruction-following model generates multiple outputs for each prompt; labelers rank them from best to worst. These ranked comparisons become the training data for the reward model — a separate neural network that learns to predict labeler preference.

RL Fine-Tuning The instruction-following model is updated using PPO (Proximal Policy Optimization) to maximize the reward model's score on new outputs, with a KL-divergence penalty to prevent the model from drifting too far from the base model's distribution.

Constitutional AI and Scalable Oversight

A critical limitation of RLHF as described above is that it requires human labelers to evaluate every comparison — an increasingly impractical requirement as models become more capable. If a model can write code that humans cannot easily evaluate, human preference labels for that code are unreliable. Anthropic's Constitutional AI (CAI), described in a 2022 paper, addressed this partly by using the AI model itself to critique and revise its own outputs according to a set of stated principles (a "constitution"), reducing the reliance on human judgment for the revision step while preserving human-authored principles as the normative foundation.

OpenAI's "scalable oversight" research program takes a related approach: using AI assistance to help human evaluators assess outputs in domains where their unaided judgment would be unreliable, and studying whether this human-AI collaborative evaluation maintains the integrity of the feedback signal.

The Reward Hacking Problem

Reward hacking occurs when a model learns to maximize the reward model's score through means that don't reflect genuine quality — essentially overfitting to the reward model's imperfections. It is the RLHF equivalent of Goodhart's Law applied to automated metrics. Monitoring for reward hacking requires ongoing human spot-checks of highly-rewarded outputs, not just automated reward scores.

What Practitioners Should Know

For teams deploying or evaluating RLHF-trained models, several practical implications follow from this architecture:

Labeler demographics matter. Who is in your annotation pool shapes what the reward model learns. This is not speculation — it is documented in the InstructGPT paper and subsequent research. Diverse, representative labeler pools produce more broadly aligned models.

Reward model validity decays. A reward model trained on data from six months ago may not capture current labeler preferences, especially as the model's capabilities and the distribution of user queries evolve. Reward models need periodic re-evaluation and retraining.

Human evaluation remains the audit layer. Even with a reward model in the loop, periodic human evaluation of model outputs — especially high-reward outputs and edge cases — is required to detect reward hacking and distributional drift.

Key Takeaway

RLHF transformed human evaluation from a quality-assurance activity into a training data production activity. This raises the stakes for every design decision in the annotation pipeline — because those decisions no longer just affect how you measure model quality. They affect what the model becomes.

Lesson 4 Quiz

RLHF, Preference Data, and the Future of Human Feedback — 3 questions

1. In the InstructGPT RLHF pipeline, what is the role of the "comparison collection" stage?

Correct. In comparison collection, human labelers rank multiple model-generated outputs from best to worst for each prompt. These ranked pairs become the training data for the reward model — the neural network that will score outputs during RL fine-tuning.

Not quite. Comparison collection is the stage where human labelers rank multiple model outputs for the same prompt. These ranked comparisons are what the reward model learns from — not automated comparisons or constitutional principles.

2. What is "reward hacking" in the context of RLHF, and which well-known principle does it instantiate?

Correct. Reward hacking is the RLHF instantiation of Goodhart's Law: the reward model is a proxy for human preference, and the policy model learns to exploit the reward model's imperfections rather than genuinely improving. Regular human spot-checks of high-reward outputs are the primary detection mechanism.

Not quite. Reward hacking is when the model learns to score highly on the reward model without genuinely improving quality — it exploits the reward model's flaws. This is Goodhart's Law applied to RLHF: when the reward model becomes the target, it ceases to be a good measure.

3. Anthropic's Constitutional AI (CAI) approach addressed a core RLHF scalability problem by doing what?

Correct. CAI uses the model to critique and revise its own outputs according to a stated "constitution" of principles written by humans. This reduces the need for labelers to evaluate every comparison during the revision phase, addressing the scalability bottleneck while keeping human values as the normative source.

Not quite. Constitutional AI has the model critique and revise its own outputs using a set of human-authored principles — reducing labeler workload for the revision step while preserving human values as the foundation. It's a form of scalable oversight, not a replacement of human judgment with automation.

Lab 4 — RLHF Design Consultant

Work through real RLHF pipeline design challenges

Your Task

Consult with the AI assistant on designing or auditing an RLHF preference data collection pipeline. Bring a scenario — a domain, a model capability, a labeler pool challenge — and the assistant will help you reason through reward model validity, labeler demographics, reward hacking risks, and scalable oversight tradeoffs.

Complete at least 3 exchanges to finish this lab.

Suggested start: "I'm building an RLHF pipeline for a medical information chatbot. What are the most critical decisions I need to make about my labeler pool?" — or bring your own RLHF design question.

Evaluation Lab Assistant

L4 · RLHF Design

RLHF Design Consultant ready. Tell me about the model and domain you're working with — I'll help you think through labeler pool design, reward model validity, and how to build in safeguards against reward hacking.

Module 6 — Module Test

Human Evaluation Methods · 15 questions · 80% to pass

1. Which of the following best describes Goodhart's Law as applied to AI evaluation metrics?

Correct. Goodhart's Law: optimizing a proxy measure decouples it from the underlying quality it was meant to track — seen in BLEU optimization and reward hacking alike.

Not quite. Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure — models learn to game the metric rather than improving the underlying quality.

2. What evaluation property measures whether a model's output accurately reflects information in a provided source document?

Correct. Faithfulness (also called groundedness or factual consistency) measures whether model outputs stay true to source material — hallucination detection is a central faithfulness evaluation task.

Not quite. Faithfulness is the property covering accuracy relative to source material. Fluency is grammar and style; coherence is logical flow; safety is harm avoidance.

3. When GPT-3 launched in 2020, which method first identified its hallucination and factual error problems?

Correct. SuperGLUE showed near-human parity; human red-teamers and users discovered fabricated citations, inverted historical claims, and other hallucination patterns that automated benchmarks had completely missed.

Not quite. Automated benchmarks like SuperGLUE showed high scores. Human red-teamers and real users were the ones who surfaced hallucination and factual accuracy failures.

4. LMSYS Chatbot Arena collects human preference data using which evaluation design?

Correct. The Arena shows users two anonymous model responses; they vote for the better one; pairwise votes are converted to Elo ratings — producing an ecologically valid ranking grounded in genuine user preferences.

Not quite. The Arena uses pairwise comparison from real users and Elo aggregation — not Likert scales, not BWS, not automated scoring.

5. What is the purpose of "anchor examples" in a human annotation protocol?

Correct. Anchor examples give raters a concrete reference for each point on the scale, aligning their internal standards so that a "4" from one rater means the same thing as a "4" from another.

Not quite. Anchor examples are pre-labeled outputs at specific scale points that calibrate rater judgment — without them, different raters may operate with completely different mental models of what each rating level means.

6. Cohen's κ corrects for what limitation of simple percent agreement?

Correct. κ = (observed − chance) / (1 − chance). This prevents two raters who always choose the dominant label from appearing to have high agreement when they are actually providing no discriminative signal.

Not quite. Cohen's κ corrects for chance agreement. If labels are skewed, two raters who both always pick the majority label show 100% percent agreement but a very low κ.

7. Which IRR statistic is most appropriate when you have more than two raters each annotating a subset of items?

Correct. Fleiss's κ extends Cohen's κ to multiple raters and is the standard IRR statistic in large-scale NLP annotation projects where different subsets of items are rated by different annotators.

Not quite. Cohen's κ is for exactly two raters. Fleiss's κ handles multiple raters annotating different subsets, which is the typical large-scale annotation scenario.

8. The 2021 Sap et al. toxicity study found that aggregate IRR statistics had concealed what kind of pattern?

Correct. Demographic disaggregation of disagreements revealed a systematic pattern: White annotators rated African American English text as toxic at significantly higher rates than Black annotators, a bias invisible in aggregate agreement statistics.

Not quite. The concealed pattern was systematic demographic bias — specifically that White annotators rated AAE text as more toxic than Black annotators did — invisible until rater demographics were disaggregated.

9. In the InstructGPT paper, approximately how large was the labeler team that produced the comparison and demonstration data?

Correct. The InstructGPT paper described a team of approximately 40 contractors — a notably small pool whose preferences became encoded in the reward model and thus into the model's behavior at scale.

Not quite. The InstructGPT labeler team was approximately 40 contractors — a small pool, which is one reason the paper's discussion of labeler demographics and their potential influence on model behavior is significant.

10. What is "reward hacking" in RLHF, and which principle does it instantiate?

Correct. Reward hacking is Goodhart's Law in the RLHF setting: the policy learns to maximize the reward proxy through means that don't reflect genuine quality improvement. Regular human spot-checks of high-reward outputs are the primary detection mechanism.

Not quite. Reward hacking is when the model learns to score highly on the reward model without genuinely improving — an instance of Goodhart's Law applied to the reward proxy.

11. Anthropic's Constitutional AI (CAI) approach reduces the need for human labelers at which stage of the RLHF pipeline?

Correct. CAI has the model critique and revise its own outputs according to a human-authored "constitution" of principles, reducing labeler workload for the revision step while keeping human values as the normative foundation.

Not quite. CAI reduces labeler load at the critique and revision step — the model critiques and revises its own outputs using stated principles. Human judgment remains the source of those principles.

12. Best-Worst Scaling (BWS) is considered statistically efficient because it does what compared to Likert ratings?

Correct. By selecting both the best and worst item from a set of four, each BWS judgment provides two ordering constraints simultaneously — more information per annotation than a single Likert rating on a single item.

Not quite. BWS is efficient because each judgment identifies both the best and worst item, yielding two ordering constraints from a single annotation task — more information per rater effort than a Likert rating.

13. Which approach to resolving rater disagreement is most appropriate when annotator subjectivity reflects genuine diversity in human opinion rather than measurement error?

Correct. When disagreement reflects genuine diversity of perspective (as in subjective tasks like sentiment or offensiveness), forcing a majority-vote gold label erases minority viewpoints. Preserving the annotation distribution is the recommended approach.

Not quite. For tasks where disagreement reflects legitimate diversity of perspective, the emerging best practice is to preserve the full distribution of rater judgments — majority vote erases minority perspectives that may be equally valid.

14. A research team reports Fleiss's κ = 0.35 on their annotation task. According to conventional interpretation, what does this indicate?

Correct. By convention, κ below 0.4 indicates poor agreement. A κ of 0.35 suggests the annotation protocol — likely the operationalization, instructions, or anchor examples — needs substantial revision before the data can be trusted.

Not quite. Fleiss's κ = 0.35 falls below 0.4, the conventional threshold for poor agreement. This indicates the annotation protocol needs revision — the task is likely under-operationalized or lacks adequate anchor examples.

15. Which statement best describes how RLHF fundamentally changed the role of human evaluation in AI development?

Correct. RLHF made human preference judgments into gradient signal — not just a quality check after training, but the data that shapes what the model learns to do. Every annotation protocol decision now has direct consequences for model behavior at scale.

Not quite. RLHF's fundamental change was making human evaluation a training data production activity. Preference judgments became gradient signal — the protocol choices that govern how those judgments are collected now directly determine what the model learns.