When OpenAI released GPT-3 in June 2020, automated benchmarks on SuperGLUE placed it near human parity on reading comprehension tasks. Yet the moment real users interacted with it, they surfaced failures that no automated test caught: the model confidently fabricated legal citations, wrote plausible-sounding but factually inverted historical claims, and produced text that was grammatically perfect yet logically circular. Perplexity scores had said nothing about truthfulness. OpenAI's own red-teamers — human evaluators — were the ones who eventually catalogued these failure modes and forced a rethinking of what evaluation should measure.
Automated metrics dominate AI evaluation because they are cheap, reproducible, and scalable. BLEU scores a translation in milliseconds; perplexity runs on a single GPU pass; accuracy on a benchmark is a single number that fits in a leaderboard cell. But each of these metrics measures a proxy — a signal that correlates with quality under controlled conditions — not quality itself.
The problem is that proxies break when models game them. In 2022, researchers at Google published a paper showing that models fine-tuned directly on BLEU scores produced translations that maximized n-gram overlap while simultaneously becoming harder for human readers to parse. The metric had been optimized; comprehension had degraded. This is an instance of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
Human evaluation does not suffer from this specific failure because human judges assess the actual experience of reading and using a model's output — not a mathematical abstraction of it. A human can notice that a response is technically accurate but condescending in tone, or that it answers the literal question while ignoring the obvious intent behind it.
Automated metrics are scalable but gameable. Human evaluation is valid but expensive. Most serious evaluation pipelines use both, treating automated scores as a filter and human judgment as the ground truth.
Human evaluation protocols are designed to elicit judgments about properties that resist formalization. The most common target properties include:
One of the most important functions of human evaluation is establishing what "human-level performance" actually means on a given task. When researchers claimed GPT-4 achieved "human parity" on certain benchmarks in 2023, the claim was contested precisely because the human baseline had been collected under conditions — time pressure, single-pass annotation — that differed radically from how the model was being used. Human evaluation is required not just to assess model outputs, but to properly characterize the human performance bar those outputs are being compared against.
This is why organizations like Anthropic, OpenAI, and Google DeepMind maintain large-scale human evaluation programs alongside their automated testing infrastructure. The automated metrics tell you whether something changed; the humans tell you whether the change was good.
Human evaluation is not a fallback for when automated metrics fail. It is the epistemic foundation against which automated metrics are validated. Metrics are useful only to the extent that they predict what humans would judge — and that relationship must be periodically re-established through direct human assessment.
You're working with an AI evaluation assistant who can help you reason through cases where automated metrics and human raters disagree. Bring a scenario, a question about a specific evaluation challenge, or ask the assistant to walk you through a concrete example from NLP evaluation history.
Complete at least 3 exchanges to finish this lab.
In 2011, Rion Snow and colleagues at Stanford published a foundational study showing that aggregated Amazon Mechanical Turk annotations could match expert quality — under the right conditions. But subsequent research revealed the caveat in brutal detail. A 2017 analysis of crowd-sourced sentiment datasets found that when instructions were ambiguous, inter-annotator agreement dropped below 60%, making the resulting labels statistically unreliable as a training signal. The labels had been collected; the protocol had not been designed. The data looked clean; it was not.
An annotation protocol is a measurement instrument in the scientific sense. Like a thermometer or a survey scale, it has reliability (does it produce consistent results across raters and time?) and validity (does it measure what it claims to measure?). Most annotation failures are failures of instrument design, not rater incompetence.
The single most consequential design choice is the operationalization of the target property. "Rate the quality of this response" is not an operationalization; it is an open invitation for raters to apply their own idiosyncratic theories of quality. A proper operationalization specifies: quality along what dimension, using what scale, with what anchors, assessed relative to what baseline.
Most human evaluation uses one of three scale types:
The LMSYS Chatbot Arena, launched in 2023 by researchers at UC Berkeley, is the most prominent live deployment of pairwise human evaluation at scale. Users submit a prompt, receive responses from two anonymous models, and vote for the better response. By mid-2024 the platform had collected over one million human preference votes. The resulting Elo leaderboard has become one of the most influential rankings in the field precisely because it is grounded in genuine user preferences rather than benchmark scores.
The Arena's design makes several deliberate protocol choices: raters are the actual users (high ecological validity), tasks are self-selected (realistic distribution of queries), and the pairwise format minimizes scale-anchoring artifacts. Its limitation is that raters are self-selected and uncontrolled — a protocol trade-off that the researchers openly acknowledge.
Before collecting a single annotation: specify the target dimension precisely, define the scale anchors with concrete examples, write a rater guide with worked examples, establish a qualification test, and determine your minimum acceptable inter-annotator agreement before analysis begins.
Even well-designed scales fail if rater instructions are ambiguous. The industry best practice is to provide anchor examples: actual outputs that have been pre-labeled as "1", "3", and "5" (or equivalent) by domain experts, so raters can calibrate their internal scale against a shared reference. Without anchors, two raters using the same 5-point scale may be operating with completely different mental models of what a "4" looks like.
Google's internal evaluation guidelines for large language model outputs — portions of which have been described in published research — include multiple worked examples per rating level per dimension, precisely to address this calibration problem. The investment in instruction design is substantial, but the alternative is data that cannot be aggregated or compared.
Annotation protocol design is not an administrative task that precedes the real work of evaluation. It is the real work. Every hour spent sharpening operationalizations and calibrating anchor examples saves dozens of hours of unusable or misleading data downstream.
Work with the AI assistant to design a human evaluation protocol for a task you specify. The assistant will push you to operationalize your target dimension precisely, choose an appropriate scale, and draft anchor examples. You'll receive critique and suggestions grounded in real annotation research.
Complete at least 3 exchanges to finish this lab.
In 2021, Sap and colleagues published research showing that annotators from different demographic backgrounds systematically disagreed on which text samples were toxic — particularly on African American English (AAE) text. White annotators labeled AAE text as toxic at significantly higher rates than Black annotators reading the same text. The inter-rater agreement statistics looked acceptable in aggregate; the systematic bias was invisible without demographic disaggregation of rater judgments. The IRR number had hidden the problem. The research forced the field to rethink what disagreement means — and whose disagreement counts.
Inter-rater reliability (IRR) quantifies the degree to which independent raters produce the same judgments. Several statistics are in common use:
Low IRR is often treated as a problem to be fixed — by rewriting instructions, retraining raters, or discarding borderline items. But this framing can be wrong. Some disagreements are genuine: the underlying judgment is inherently ambiguous, or different raters are applying legitimately different but equally valid perspectives. Forcing artificial consensus in these cases produces a dataset that misrepresents the actual distribution of human judgments.
A more productive approach is disagreement analysis: systematically examining which items produce disagreement, which rater characteristics predict the direction of disagreement, and whether the disagreement pattern reveals something important about the task or about the population of potential users. This is exactly what Sap et al. did in their 2021 toxicity work — the demographic analysis of disagreement was the scientific contribution, not a quality-control failure.
Most published NLP evaluation work reports Fleiss's κ and uses 0.6 as an informal acceptance threshold. Below 0.4 is generally considered poor agreement; 0.4–0.6 is moderate; 0.6–0.8 is good; above 0.8 is near-perfect. But these thresholds are conventions, not laws — some tasks (e.g., sarcasm detection) are inherently harder to agree on than others (e.g., grammaticality).
When raters disagree, three resolution strategies are standard:
Majority vote: the label chosen by the plurality of raters becomes the gold label. Simple and scalable, but discards minority perspectives and can systematically exclude the views of demographic minorities who are outnumbered in the annotation pool.
Expert adjudication: a domain expert reviews disagreed items and assigns the gold label. More expensive but preserves quality on difficult edge cases. The expert's decisions should be logged and auditable.
Preserving disagreement: rather than forcing a single gold label, the dataset retains the full distribution of rater judgments. This approach, championed by researchers including Dirk Hovy and Barbara Plank, is gaining traction for tasks where annotator subjectivity reflects real diversity in human opinion — not measurement error.
IRR statistics are diagnostic tools, not pass/fail gates. High agreement on a poorly designed task is meaningless. Low agreement on an inherently contested judgment may be scientifically valid. Always ask whether the disagreement is noise or signal before deciding how to handle it.
Practice interpreting inter-rater reliability statistics and thinking through disagreement analysis with the AI assistant. You can present a scenario with given κ values, ask about which IRR metric is appropriate for your task, or explore what to do with low-agreement items in a real dataset.
Complete at least 3 exchanges to finish this lab.
When OpenAI published "Training language models to follow instructions with human feedback" in January 2022, the paper described a training pipeline that had been quietly reshaping how the field thought about human evaluation. Labelers — a team of approximately 40 contractors — were not just assessing outputs after the fact. They were generating the preference data used to train a reward model, which was then used to fine-tune GPT-3 via proximal policy optimization. Human judgment had become gradient signal. The resulting model, InstructGPT, was rated as significantly more helpful, honest, and harmless than its predecessor despite having fewer parameters — demonstrating that the alignment of a model's behavior with human preferences was now as important as raw capability.
Before Reinforcement Learning from Human Feedback (RLHF), human evaluation was primarily a measurement activity: you trained a model, then humans assessed it. RLHF made human evaluation a production activity: human preference judgments directly shaped what the model learned to do. This changed everything about how preference data is collected, who collects it, and what biases in that collection process get amplified into model behavior.
The InstructGPT paper was explicit about this: the labeler pool was mostly English-speaking, relatively young, and American. Their preferences — which outputs they found helpful, which they found harmful — became encoded in the model's reward function. A model trained to maximize labeler approval learns, at least in part, to reflect labeler demographics.
A standard RLHF preference collection pipeline has three stages:
A critical limitation of RLHF as described above is that it requires human labelers to evaluate every comparison — an increasingly impractical requirement as models become more capable. If a model can write code that humans cannot easily evaluate, human preference labels for that code are unreliable. Anthropic's Constitutional AI (CAI), described in a 2022 paper, addressed this partly by using the AI model itself to critique and revise its own outputs according to a set of stated principles (a "constitution"), reducing the reliance on human judgment for the revision step while preserving human-authored principles as the normative foundation.
OpenAI's "scalable oversight" research program takes a related approach: using AI assistance to help human evaluators assess outputs in domains where their unaided judgment would be unreliable, and studying whether this human-AI collaborative evaluation maintains the integrity of the feedback signal.
Reward hacking occurs when a model learns to maximize the reward model's score through means that don't reflect genuine quality — essentially overfitting to the reward model's imperfections. It is the RLHF equivalent of Goodhart's Law applied to automated metrics. Monitoring for reward hacking requires ongoing human spot-checks of highly-rewarded outputs, not just automated reward scores.
For teams deploying or evaluating RLHF-trained models, several practical implications follow from this architecture:
Labeler demographics matter. Who is in your annotation pool shapes what the reward model learns. This is not speculation — it is documented in the InstructGPT paper and subsequent research. Diverse, representative labeler pools produce more broadly aligned models.
Reward model validity decays. A reward model trained on data from six months ago may not capture current labeler preferences, especially as the model's capabilities and the distribution of user queries evolve. Reward models need periodic re-evaluation and retraining.
Human evaluation remains the audit layer. Even with a reward model in the loop, periodic human evaluation of model outputs — especially high-reward outputs and edge cases — is required to detect reward hacking and distributional drift.
RLHF transformed human evaluation from a quality-assurance activity into a training data production activity. This raises the stakes for every design decision in the annotation pipeline — because those decisions no longer just affect how you measure model quality. They affect what the model becomes.
Consult with the AI assistant on designing or auditing an RLHF preference data collection pipeline. Bring a scenario — a domain, a model capability, a labeler pool challenge — and the assistant will help you reason through reward model validity, labeler demographics, reward hacking risks, and scalable oversight tradeoffs.
Complete at least 3 exchanges to finish this lab.