Module 5 · Lesson 1

Evaluation Fundamentals & Metrics

Precision, recall, BLEU scores, perplexity — the instruments that tell you whether your model actually works.

How do you know when an AI system is good enough to ship?

When Google's Bard launched in February 2023, it hallucinated during its very first public demo — incorrectly stating that the James Webb Space Telescope took the first images of an exoplanet. The factual error was caught by astronomers on Twitter within hours. Google's stock dropped $100 billion in market cap that day. The failure was not a training failure — it was an evaluation failure. No metric in their release pipeline was measuring factual grounding against scientific consensus.

Why Evaluation Is Not Optional

Evaluation is the discipline of rigorously measuring what your AI system actually does versus what you want it to do. In classical software, you write a test: input X should produce output Y. In AI, outputs are probabilistic, context-dependent, and sometimes genuinely novel — which makes evaluation simultaneously harder and more important.

The core problem: a model can look good on the metrics you measure and fail badly on the ones you didn't think to measure. This is called Goodhart's Law applied to AI — when a measure becomes a target, it ceases to be a good measure. Every serious AI failure case in production shares this pattern.

The Core Metric Families

Different tasks demand different measurement frameworks. Understanding which family applies to your task is the first decision in any evaluation design.

Classification

Precision · Recall · F1

For tasks with discrete labels. Precision = correct positives / all predicted positives. Recall = correct positives / all actual positives. F1 is their harmonic mean. Choose based on cost of false positives vs. false negatives.

Language Generation

BLEU · ROUGE · BERTScore

BLEU (2002, Papineni et al.) measures n-gram overlap with reference translations. ROUGE measures recall of n-grams for summarization. BERTScore uses contextual embeddings — far more semantically aware than n-gram methods.

Language Modeling

Perplexity · Cross-Entropy Loss

Perplexity measures how surprised a model is by test text. Lower = better. GPT-3's perplexity on Penn Treebank was 20.50 — dramatically lower than prior SOTA. Useful for comparing base models but meaningless for measuring task performance.

Retrieval / Ranking

MRR · NDCG · Hit@K

Mean Reciprocal Rank measures where the first correct answer appears. NDCG (Normalized Discounted Cumulative Gain) accounts for rank position with logarithmic discount. Hit@K: did the correct answer appear in the top K results?

Classification Metrics Deep Dive

The confusion matrix is the foundation of classification evaluation. Every metric derives from its four cells:

# Confusion Matrix → Core Metrics
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 1, 1, 0, 1]

# TP=3, FP=2, FN=1, TN=2
precision = TP / (TP + FP)   # = 3/5 = 0.60
recall    = TP / (TP + FN)   # = 3/4 = 0.75
f1        = 2 * (precision * recall) / (precision + recall)  # = 0.667

# When to prioritize which metric:
# Medical diagnosis → high recall (miss no disease)
# Spam filter → high precision (don't block real email)
# Search → F1 or NDCG depending on rank importance
    

BLEU Score — What It Is and Why It's Broken

BLEU (Bilingual Evaluation Understudy) was introduced in 2002 and became the default metric for machine translation for twenty years. It counts n-gram overlaps between model output and human reference translations, then applies a brevity penalty to discourage very short outputs.

The problem: BLEU treats all reference mismatches identically. "The cat sat on the mat" and "A feline rested atop the rug" have near-zero BLEU overlap but identical meaning. By 2020, the NLP research community had largely moved to BERTScore, which uses contextual embeddings to measure semantic similarity rather than surface string overlap.

Industry Reality Check

The 2022 WMT translation shared task found that human judgments of translation quality correlated more strongly with COMET (a learned metric trained on human ratings) than with BLEU. Meta AI's NLLB paper (2022) used BLEU primarily for historical comparability, acknowledging its limitations in the methods section. Know the limitations of every metric you use.

Evaluation Anti-Patterns to Avoid

Test Set Contamination Training data that overlaps with your evaluation set. Common with web-scraped data. GPT-4's technical report explicitly describes contamination analysis as part of their evaluation methodology.

Single-Metric Myopia Optimizing for one number while ignoring others. Amazon's 2015 recruiting AI had excellent binary classification accuracy but systematically downgraded women's resumes — a dimension nobody measured.

Distribution Shift Blindness Test set matches training distribution but not real-world distribution. ImageNet-trained models that achieved 95%+ accuracy dropped to under 70% on ImageNet-V2 (natural distribution shift), published by Recht et al. in 2019.

Aggregation Masking Average metrics hiding subgroup failure. A model averaging 85% F1 may score 40% on a minority subgroup. Always disaggregate metrics by demographic, topic, and input type.

Design Principle

Build your evaluation suite before training your model. This is the AI equivalent of test-driven development. Define what "good enough" means for each user-facing dimension, then measure against it throughout training — not just at the end.

Lesson 1 Quiz

Evaluation Fundamentals & Metrics — 4 questions

1. A medical AI for cancer screening produces 95% accuracy. A doctor points out it never flags anyone as having cancer. What metric exposes this problem?

✓ Correct. Recall measures true positives / all actual positives. A model that predicts "no cancer" for everyone has 0% recall — it misses every real case. Accuracy is misleading here because the dataset is imbalanced (most people don't have cancer).

✗ Not quite. Recall (true positive rate) is the metric that exposes a model that never predicts positive. It measures: of all actual positive cases, how many did the model catch?

2. Why did the NLP research community largely abandon BLEU score as a primary metric by 2020?

✓ Correct. BLEU counts exact n-gram matches, so semantically identical but lexically different sentences score near zero. BERTScore and learned metrics like COMET correlate far better with human judgments of quality.

✗ The key limitation is semantic blindness — BLEU penalizes paraphrases that mean exactly the same thing. "A cat sat" and "A feline rested" score poorly despite conveying identical meaning.

3. Recht et al. (2019) found that ImageNet models dropped from 95%+ accuracy to under 70% on ImageNet-V2. This demonstrates which evaluation failure mode?

✓ Correct. The models were evaluated on a test set drawn from the same distribution as their training data. When tested on naturally collected images (different photographers, angles, lighting), accuracy collapsed — revealing that generalization was far more limited than the benchmark suggested.

✗ This is distribution shift blindness — the evaluation set matched the training distribution but not the real-world distribution. Models that looked great in-distribution performed poorly on naturally collected images.

4. What is the recommended best practice for building your evaluation suite relative to model training?

✓ Correct. This is the AI equivalent of test-driven development. Pre-defining success criteria prevents post-hoc rationalization and ensures metrics reflect real user needs, not model convenience.

✗ Building evaluation after training risks Goodhart's Law: you end up measuring what the model happens to do, not what you actually need it to do. Define success criteria first.

Lab 1: Design an Evaluation Suite

Hands-on · AI-Assisted · Metric selection & evaluation design practice

Your Mission: Build a Complete Eval Framework

You're the ML engineer responsible for a customer support AI that handles billing questions, technical troubleshooting, and complaint escalation for a telecom company. The model must ship in 6 weeks. Your task: design the evaluation suite before a single training run begins.

Work with your AI evaluation coach to design metrics, test sets, and success thresholds. Push back, ask hard questions, and stress-test your framework.

Lab Objectives

Select appropriate metrics for each task type (classification, generation, retrieval)
Identify at least 3 evaluation anti-patterns relevant to this scenario
Define concrete success thresholds (not just "as high as possible")
Design subgroup disaggregation strategy to catch aggregation masking
Propose a test set construction plan that avoids contamination

Start by describing your initial instinct for metrics, then we'll stress-test your choices together. Or ask: "What are the three most important evaluation mistakes I could make on this project?"

Eval Design Coach

Evaluation & Metrics

Ready to build your eval suite. You're working on a customer support AI for a telecom company — three task types, six weeks to ship. Where do you want to start? Tell me your initial instinct for how you'd measure success, and I'll push back hard where needed.

Module 5 · Lesson 2

Benchmark Design & Test Set Construction

How you construct your test set is as important as how you train your model — and most teams get it wrong.

What makes a benchmark trustworthy, and what makes it a leaderboard game?

In August 2022, Google's PaLM and dozens of other models were reported to have saturated BIG-bench — achieving near-human performance on many tasks. Researchers then discovered that multiple BIG-bench tasks had leaked into Common Crawl, the web dataset used for pre-training. Models weren't reasoning — they had memorized test answers. The paper "Are Large Language Models Really Good Few-Shot Learners?" (2023) showed systematic contamination across GLUE, SuperGLUE, and BIG-bench. Benchmark saturation is almost always contamination saturation.

The Anatomy of a Good Test Set

A test set is a sample of the problem space that should generalize to the actual deployment distribution. That sentence contains three things most teams get wrong: sample (representativeness), problem space (coverage), and deployment distribution (real-world fidelity).

Property 1

Held-Out Isolation

No overlap with training or validation data. Use temporal splits for time-series. Use user-splits for user-level data (no user appears in both train and test). Hash-based deterministic splits for reproducibility.

Property 2

Distributional Fidelity

Must reflect actual deployment inputs. If users submit short informal queries, your test set must include them. SQUAD was collected from Wikipedia — models trained on SQUAD fail catastrophically on messy real-world questions.

Property 3

Adversarial Coverage

Deliberately include edge cases, ambiguities, and adversarial inputs. CheckList (Ribeiro et al. 2020) introduced behavioral testing: minimum functionality tests, invariance tests, directional expectation tests.

Property 4

Label Quality Control

Inter-annotator agreement (Cohen's Kappa, Fleiss' Kappa) must be measured and reported. SNLI (Stanford NLI dataset) had annotator disagreement rates that introduced systematic noise — later discovered to affect model reliability conclusions.

Contamination Detection

Before trusting any benchmark result, you must test for contamination. The standard method: check for n-gram overlap between test examples and training data. OpenAI's approach for GPT-4 was to run 13-gram overlap detection across the entire pre-training corpus.

# Basic contamination check — n-gram overlap detection
from collections import Counter
import hashlib

def get_ngrams(text, n=13):
    tokens = text.lower().split()
    return set(
        ' '.join(tokens[i:i+n]) 
        for i in range(len(tokens) - n + 1)
    )

def contamination_rate(train_corpus, test_examples, n=13):
    # Build train n-gram index
    train_ngrams = set()
    for doc in train_corpus:
        train_ngrams.update(get_ngrams(doc, n))
    
    # Check each test example
    contaminated = []
    for example in test_examples:
        test_grams = get_ngrams(example['text'], n)
        overlap = test_grams & train_ngrams
        if len(overlap) > 0:
            contaminated.append({
                'example': example,
                'overlap_count': len(overlap),
                'overlap_rate': len(overlap) / len(test_grams)
            })
    
    return contaminated, len(contaminated) / len(test_examples)
    

The CheckList Framework

Published at ACL 2020 by Ribeiro et al. (Microsoft Research + UW), CheckList reframes NLP evaluation as behavioral testing inspired by software engineering. Instead of asking "what's your F1?", it asks "what behaviors does your model exhibit?"

MFT (Minimum Functionality Tests) Simple test cases targeting a specific behavior. "The service was terrible" should classify as negative sentiment. A model that fails MFTs on basic cases is not production-ready regardless of aggregate F1.

INV (Invariance Tests) Perturbations that should not change model output. Changing a customer's name in a support ticket should not change the complaint classification. Typos and formatting changes should be invariant for robust models.

DIR (Directional Expectation Tests) Changes that should produce predictable output shifts. Adding "but the price was excellent" to a negative review should shift sentiment score up, not down or no change. These catch specific failure modes invisible to aggregate metrics.

Dynamic vs. Static Benchmarks

Static benchmarks get "solved" — models overfit to benchmark quirks rather than developing genuine capability. Dynabench (Kiela et al., 2021, Meta AI) introduced human-in-the-loop dynamic benchmarking: humans write examples that fool the current best model, which become the new test set. This creates a benchmark that stays ahead of models perpetually.

For production systems, the equivalent is a living evaluation set — regularly updated with real user inputs that exposed failures, sampled from actual traffic, and re-labeled by domain experts. Anthropic, OpenAI, and Google all maintain proprietary living eval sets alongside public benchmarks.

Practical Rule

Never use the same test set more than once to make a go/no-go decision. Each time you evaluate and adjust based on results, your model implicitly "sees" the test set. Keep a final held-out set that you evaluate exactly once — for the launch decision. Everything else is development evaluation.

Lesson 2 Quiz

Benchmark Design & Test Set Construction — 4 questions

1. A team discovers their model achieves 97% on SuperGLUE. Before announcing results, what should they check first?

✓ Correct. Near-human or superhuman benchmark performance is a contamination red flag. The standard practice (used by OpenAI, Google, and others) is to run n-gram overlap detection between training data and benchmark test examples before announcing results.

✗ Suspiciously high benchmark scores require a contamination check first. Large language models trained on web data frequently encounter benchmark test examples during pre-training — detecting this is the priority.

2. CheckList's "Invariance Tests" check for what property?

✓ Correct. Invariance tests apply perturbations that should be semantically neutral — changing a customer's name, adding a typo, reformatting whitespace. A robust model should produce identical (or near-identical) outputs for these changes.

✗ Invariance tests perturb inputs in ways that should not change the output. If swapping "John Smith" for "Wei Chen" changes a model's complaint classification, that's a bias failure that aggregate F1 would never reveal.

3. Why should you never use the same test set more than once for go/no-go decisions?

✓ Correct. Every time you look at test set results and make a decision, you're leaking information. Even if your model never trains on the test set, your judgment and subsequent development choices are shaped by test set performance — this is the "multiple comparisons" problem applied to model development.

✗ The problem is implicit data leakage through the development process. When you see test results and adjust your approach, your decisions are being shaped by the test set — equivalent to a subtle form of overfitting.

4. What is Dynabench's key innovation over traditional static benchmarks?

✓ Correct. Dynabench (Kiela et al., Meta AI, 2021) uses human-in-the-loop adversarial data collection. Annotators are shown the current best model's predictions and challenged to write examples the model gets wrong. These failures become the new benchmark — creating a perpetually hard test set.

✗ Dynabench's innovation is human-in-the-loop adversarial construction. Annotators specifically write examples that fool the current SOTA model, so the benchmark always stays ahead of model improvements — preventing saturation.

Lab 2: Test Set Audit & CheckList Design

Hands-on · AI-Assisted · Build behavioral test cases for a production NLP system

Your Mission: Audit and Rebuild a Flawed Test Set

You've inherited a sentiment analysis model for product reviews. The team claims 91% accuracy on their test set. Your gut says something is wrong. You've discovered that their training data was scraped from the same review platform as their test set during the same time period, and their test set has only 200 examples — all 5-star or 1-star reviews, no middle ratings.

Work with your AI testing coach to (1) identify all the problems with this test set, (2) design a proper replacement using CheckList methodology, and (3) write actual test cases for MFT, INV, and DIR categories.

Lab Objectives

Enumerate every flaw in the existing test set and explain its impact
Design a contamination check procedure for this specific scenario
Write 3 MFT (minimum functionality) test cases with expected outputs
Write 3 INV (invariance) test cases — what perturbations should leave sentiment unchanged?
Write 3 DIR (directional) test cases — what changes should shift sentiment in predictable ways?

Start by listing every problem you can spot with the existing test set, then we'll design the replacement together. The more specific you are, the better feedback you'll get.

Test Set Design Coach

Benchmark Engineering

Let's audit this test set together. You've identified the setup: 200 examples, all extreme ratings, scraped same-platform same-period as training data, 91% claimed accuracy. Start by listing every red flag you see — don't hold back. Then we'll fix it.

Module 5 · Lesson 3

Human Evaluation & LLM-as-Judge

When automated metrics fail, humans are the ground truth — and LLMs are learning to stand in for them at scale.

Can a language model reliably judge another language model — and when does that go wrong?

When Chatbot Arena (LMSYS, 2023) launched, it offered an elegant solution to the evaluation problem: show users two anonymous model responses side-by-side and ask which is better. Within months, 500,000+ human preference votes had created the Elo leaderboard — the most credible ranking of conversational AI models available. But researchers noticed a pattern: models that were more verbose and sycophantic earned higher Elo scores regardless of factual accuracy. Human preference is not the same as human benefit.

The Human Evaluation Framework

Automated metrics measure surface properties. Human evaluation measures what actually matters — and is the only reliable signal for open-ended generation quality. The challenge is scale, cost, and inter-annotator reliability.

Method 1

Absolute Rating

Annotators rate each output on a scale (1-5, Likert). Simple but noisy — different annotators have different baselines. Requires clear rubrics. Used by: OpenAI for InstructGPT, Anthropic for Constitutional AI evaluation.

Method 2

Pairwise Preference

Show two outputs, pick the better one. More reliable than absolute ratings — humans are better at comparative judgments. Produces preference rates that feed into Elo/Bradley-Terry ranking models. Used by: Chatbot Arena, RLHF training pipelines.

Method 3

Reference-Based

Annotators compare model output to expert-written reference answer. High quality but requires expensive expert annotation. Used for: medical AI evaluation, legal document review, technical domain tasks where experts are necessary.

Method 4

Task-Completion

Measure whether the model's response helped a real user complete a real task. Gold standard for user-facing products. Requires A/B testing infrastructure and sufficient traffic. Measures actual utility, not perceived quality.

Annotation Quality & Inter-Annotator Agreement

Human evaluation is only as good as your annotators and your rubrics. Inter-annotator agreement (IAA) must be measured before trusting any human eval results.

# Measuring annotation quality
from sklearn.metrics import cohen_kappa_score
import numpy as np

# Two annotators rating 10 outputs on 1-5 scale
annotator_1 = [5, 3, 4, 2, 5, 1, 4, 3, 2, 5]
annotator_2 = [4, 3, 3, 2, 5, 2, 4, 4, 1, 5]

kappa = cohen_kappa_score(annotator_1, annotator_2)
# Interpret: < 0.2 slight, 0.2-0.4 fair, 0.4-0.6 moderate
#            0.6-0.8 substantial, > 0.8 almost perfect

# Minimum acceptable kappa for NLP annotation: ~0.6
# Below this: your rubric is unclear or task is too subjective

if kappa < 0.6:
    raise ValueError(f"IAA too low ({kappa:.2f}). Revise rubric before proceeding.")
    

LLM-as-Judge: The State of the Art

In 2023, Zheng et al. published "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (UC Berkeley). They showed that GPT-4's pairwise judgments agreed with human expert judgments at rates exceeding agreement between human experts — 85% agreement versus ~80% for human-human. This enabled automated evaluation at a fraction of the cost.

The practical implication: you can run LLM-as-judge at scale to get a continuous signal approximating human evaluation, then use actual human evaluation to calibrate and validate the judge model's reliability.

# LLM-as-Judge implementation pattern

JUDGE_PROMPT = """You are an expert evaluator. Compare two responses to the user's question.

Question: {question}
Response A: {response_a}
Response B: {response_b}

Evaluate on these dimensions (1-10 each):
1. Factual accuracy — does it contain correct information?
2. Completeness — does it address all aspects of the question?
3. Clarity — is it well-organized and easy to understand?
4. Helpfulness — would this response satisfy the user's actual need?

Output JSON: {{"winner": "A" or "B" or "tie", "scores_a": [...], "scores_b": [...], "reasoning": "..."}}"""

async def judge_pair(question, response_a, response_b, judge_model="gpt-4"):
    prompt = JUDGE_PROMPT.format(
        question=question,
        response_a=response_a,
        response_b=response_b
    )
    # Run twice with A/B swapped to detect position bias
    result_1 = await call_llm(judge_model, prompt)
    result_2 = await call_llm(judge_model, prompt.replace(response_a, response_b).replace(response_b, response_a))
    return reconcile_judgments(result_1, result_2)
    

Known Biases in LLM Judges

Position Bias LLM judges consistently prefer whichever response is shown first. Mitigation: always run both orderings (A vs B and B vs A) and average. If they disagree, mark as a tie.

Verbosity Bias LLM judges prefer longer, more detailed responses even when brevity is correct. A three-sentence accurate answer loses to a five-paragraph answer with errors. Control for length in your rubric explicitly.

Self-Enhancement Bias GPT-4 tends to prefer GPT-4 outputs when judging, even when blinded. Claude tends to prefer Claude outputs. Use a judge model from a different family than your production model when possible.

Sycophancy Cascade If the judge model was RLHF-trained on human feedback, it may prefer "helpful-sounding" responses over accurate ones — replicating the same sycophancy problem it's meant to measure.

Production Pattern

Anthropic, OpenAI, and Google all use a combination: LLM-as-judge for continuous monitoring at scale, human evaluation weekly for calibration, and red-teaming for adversarial coverage. No single evaluation method is sufficient. Build a pipeline that combines all three.

Lesson 3 Quiz

Human Evaluation & LLM-as-Judge — 4 questions

1. Chatbot Arena's Elo leaderboard showed that more verbose, sycophantic responses scored higher. What does this reveal about human preference-based evaluation?

✓ Correct. This is a fundamental tension in RLHF-based alignment: optimizing for human preference can diverge from optimizing for factual accuracy or genuine helpfulness. Paul Christiano (original RLHF researcher) called this "reward hacking via sycophancy" — the model learns to flatter rather than inform.

✗ The lesson is more specific: human preference ≠ human benefit. People often prefer longer, confident-sounding responses even when shorter, accurate ones would serve them better. This is a known limitation of preference-based evaluation that must be explicitly controlled for.

2. When running LLM-as-judge evaluation, why should you always run the evaluation twice with A and B positions swapped?

✓ Correct. Position bias is one of the most robust findings in LLM-as-judge research (Zheng et al., 2023). The first response shown tends to win, regardless of quality. Running both orders and averaging — treating disagreements as ties — substantially reduces this artifact.

✗ The reason is position bias — a systematic preference for whichever response appears first in the prompt. Swapping positions and looking for consistency (or marking inconsistencies as ties) is the standard mitigation.

3. Cohen's Kappa of 0.35 between two annotators on a sentiment task indicates what?

✓ Correct. 0.35 falls in the "fair" range (0.2–0.4). For NLP annotation tasks, the minimum acceptable threshold is approximately 0.6 (substantial agreement). Below this, annotator disagreement will introduce more noise than signal into your labels — your rubric needs to be clearer and more specific.

✗ A Kappa of 0.35 is "fair" agreement — significantly below the ~0.6 threshold needed for reliable NLP annotation. The rubric needs revision: add concrete examples, clearer definitions, and calibration sessions between annotators before collecting real data.

4. Why is it recommended to use a judge LLM from a different model family than your production model?

✓ Correct. Self-enhancement bias is well-documented: GPT-4 judging GPT-4 outputs inflates scores relative to human judgment. Using Claude to judge GPT-4 outputs (or vice versa) reduces this bias — though cross-family bias can still exist. The key is not to use the same model to evaluate itself.

✗ The issue is self-enhancement bias — LLMs systematically prefer outputs from their own model family, creating circular evaluation where the model essentially grades its own homework. Cross-family judging substantially reduces (though doesn't eliminate) this artifact.

Lab 3: Design & Stress-Test an LLM Judge

Hands-on · AI-Assisted · Build evaluation pipelines that catch real AI failures

Your Mission: Build a Production-Grade LLM Judge

Your team is building an automated evaluation pipeline for a legal document summarization AI. Lawyers need accurate, concise summaries of contracts — getting this wrong has real consequences. You need an LLM judge that flags poor summaries before they reach clients.

Design the judge prompt, identify its failure modes, test it against adversarial cases, and build mitigation strategies. Your AI coach will challenge your prompt, give you outputs to evaluate, and help you discover where your judge breaks.

Lab Objectives

Write a judge system prompt with explicit rubric dimensions for legal summarization
Design the position-bias mitigation strategy (specific implementation)
Identify 4+ failure modes specific to legal text evaluation
Write adversarial test cases that expose verbosity bias in your judge
Propose a calibration process using 20 expert-labeled "golden" examples

Start by drafting your judge prompt — write it as if you're actually deploying it. Include your rubric dimensions, scoring instructions, and output format. Then I'll give you a test case to run through it.

LLM Judge Design Coach

Evaluation Engineering

Let's build your legal summarization judge. This is high-stakes — a bad summary could cause a lawyer to miss a critical clause. Draft your judge prompt first: what dimensions matter, how should it score, what format should it output? Write it as production code. Once you have a draft, I'll give you tricky test cases to run through it.

Module 5 · Lesson 4

Continuous Evaluation & Production Monitoring

Evaluation doesn't end at launch. Models degrade, distributions shift, and edge cases accumulate — you need systems that catch this in real time.

How do you know when your shipped model starts failing — before your users do?

In October 2021, GitHub Copilot's internal evaluation team discovered that suggestions for security-sensitive code patterns (password handling, SQL queries, cryptography) contained vulnerabilities at a rate of 40% in initial testing. The model had learned to produce plausible-looking but insecure code from the vast volume of insecure code on GitHub. This was caught in pre-launch evaluation — but the broader lesson was stark: production AI systems require continuous security and quality monitoring, because distribution of user inputs drifts from anything you could have tested pre-launch.

Why Post-Launch Evaluation is Mandatory

Pre-launch evaluation is a snapshot. Production is a movie. Users ask questions you never anticipated. The world changes. Competitors release models that change user expectations. Underlying APIs and data sources update. Each of these introduces quality drift that your pre-launch metrics cannot detect.

The goal of production monitoring is to detect degradation before users report it — and before it compounds into a reputational or safety incident.

The Monitoring Stack

Layer 1

Input Distribution Monitoring

Track statistical properties of incoming queries: length distribution, vocabulary drift, topic distribution, language mix. Significant shifts signal that your model is being used differently than expected. Tools: population stability index (PSI), KL divergence on input embeddings.

Layer 2

Output Quality Signals

Automated quality signals: refusal rate, output length distribution, hallucination detector hits, toxicity scorer triggers, factuality checker alerts. These run on every response in real time — no human needed until a threshold is crossed.

Layer 3

Behavioral Regression Testing

Run your full pre-launch CheckList test suite against production daily. Any regression in MFT/INV/DIR tests triggers an alert. This catches silent model updates, prompt injection attacks, or A/B test configuration errors that degrade specific behaviors.

Layer 4

User Feedback Integration

Thumbs up/down, explicit corrections, abandonment rates, follow-up question rates — all are implicit quality signals. Build a pipeline that routes low-rated outputs to human review and automatically adds egregious failures to your regression test suite.

Implementing Hallucination Detection in Production

Hallucination detection at scale requires a pipeline, not a single check. The current production standard combines multiple signals:

# Production hallucination detection pipeline
import asyncio
from dataclasses import dataclass

@dataclass
class HallucinationSignal:
    response_id: str
    confidence_score: float   # 0-1, lower = more likely hallucinated
    signals: dict

async def check_hallucination(response, context, facts_db):
    signals = {}
    
    # Signal 1: Factual consistency with retrieved context
    signals['context_consistency'] = await nli_check(response, context)
    
    # Signal 2: Confidence from model logprobs (if available)
    signals['token_confidence'] = get_min_logprob(response)
    
    # Signal 3: Named entity verification against knowledge base
    entities = extract_entities(response)
    signals['entity_grounding'] = verify_entities(entities, facts_db)
    
    # Signal 4: Self-consistency (sample 3 responses, check agreement)
    samples = await asyncio.gather(*[regenerate(context) for _ in range(3)])
    signals['self_consistency'] = measure_consistency(response, samples)
    
    # Ensemble: weight signals by task-specific calibration
    score = weighted_ensemble(signals, weights={
        'context_consistency': 0.4,
        'token_confidence': 0.2,
        'entity_grounding': 0.25,
        'self_consistency': 0.15
    })
    return HallucinationSignal(response.id, score, signals)
    

A/B Testing AI System Changes

Every change to a production AI system — new model version, updated prompt, different retrieval strategy — must be evaluated via controlled experiment before full rollout. This is non-negotiable at the scale where your model touches millions of users.

Treatment Assignment Randomly assign users to control (current model) or treatment (new model) groups. Use user-stable hashing so the same user always gets the same experience — context inconsistency is a confounder. Minimum: 1,000 users per arm for statistical power.

Primary Metrics Define before running the experiment: task completion rate, session length, explicit quality ratings. Never choose your primary metric after seeing results — this is p-hacking. Pre-register your analysis plan.

Guardrail Metrics Metrics that must not regress: safety classifier trigger rate, latency p95, error rate. If a guardrail metric regresses beyond threshold, auto-kill the experiment regardless of primary metric results.

Novelty Effects Users interact differently with new things. A new model may show inflated engagement in week one that decays. Run experiments for at least two weeks (ideally four) to measure steady-state behavior, not novelty response.

Building an Eval-Ops Culture

Evaluation is not a pre-launch checklist — it is a continuous engineering discipline. The teams that get this right (Anthropic's Trust & Safety, Google's Model Evaluation, OpenAI's Red Team) treat evaluation as a first-class engineering investment: automated infrastructure, dedicated personnel, and explicit evaluation coverage requirements before any model update ships.

The practical minimum for any production AI team: weekly regression runs on your behavioral test suite, monthly human evaluation calibration, continuous automated quality signal dashboards with alerting, and a documented incident response process when evaluation signals degrade beyond threshold.

The Meta-Lesson of This Module

Every documented AI failure — Bard's WebSpace hallucination, Amazon's biased recruiting tool, ImageNet overfitting, Copilot security vulnerabilities — was a measurement failure before it was a model failure. The model did what it was optimized to do. Nobody measured the right thing. Build the measurement system first. Build it continuously. Treat evaluation failures as production incidents.

Lesson 4 Quiz

Continuous Evaluation & Production Monitoring — 4 questions

1. GitHub Copilot's pre-launch evaluation found 40% of security-sensitive code suggestions contained vulnerabilities. What does this illustrate about training data for code generation?

✓ Correct. This is a fundamental property of large language models: they model the distribution of their training data. The internet has vastly more insecure code than secure code. Without explicit evaluation and targeted mitigation (RLHF on security, fine-tuning on secure patterns), the model reflects this distribution faithfully.

✗ The core issue is distributional learning. Models don't have a concept of "correct" vs "incorrect" code — they model what code looks like. GitHub has enormous amounts of insecure code, so models trained on it produce insecure code proportionally. Evaluation must specifically test for this.

2. What is the purpose of "guardrail metrics" in an A/B test for an AI system change?

✓ Correct. Guardrail metrics protect against "winning on one dimension while losing on another" — a new model might improve task completion rate while doubling safety classifier triggers. Guardrails (safety rate, latency, error rate) auto-terminate experiments that degrade beyond acceptable thresholds, preventing harm from reaching full traffic.

✗ Guardrail metrics are automatic kill switches. They define acceptable ranges for critical dimensions (safety, latency, errors) — if violated, the experiment stops regardless of how good the primary metric looks. They prevent a "one metric wins, another metric catastrophically fails" outcome.

3. Why should A/B tests for AI system changes run for at least two weeks rather than being cut short when statistical significance is reached?

✓ Correct. Novelty effects are a well-documented phenomenon in experimentation: users interact more with new experiences regardless of quality. Cutting an experiment early when novelty-inflated metrics look good leads to shipping changes that perform well for one week and then decay to baseline or below. Steady-state measurement requires patience.

✗ The problem is novelty effects — users naturally engage more with new experiences, creating artificially inflated early metrics. An A/B test cut short during the novelty window can show apparent improvement that evaporates after a week. Wait for steady-state behavior before drawing conclusions.

4. The self-consistency signal in hallucination detection works by:

✓ Correct. Self-consistency (Wang et al., 2022) exploits the intuition that true facts should appear consistently across multiple independent samples, while hallucinated specifics (made-up names, dates, statistics) will vary between samples. Disagreement across multiple samples is a strong signal of hallucination. It's computationally expensive but highly reliable.

✗ Self-consistency checks agreement across multiple independent generations. If you ask the model the same question 5 times and get 5 different "facts," those facts are likely hallucinated. True information tends to appear consistently; invented information varies. This is the core signal.

Lab 4: Production Monitoring System Design

Hands-on · AI-Assisted · Build the eval-ops infrastructure for a live AI product

Your Mission: Design a Full Production Monitoring Stack

You're the lead ML engineer at a startup that just launched an AI medical information assistant. It answers patient questions about symptoms, medications, and when to seek care. 50,000 daily active users. You have no monitoring infrastructure — just a server log and user ratings. Something feels off: session abandonment is up 12% this week.

Build the complete production monitoring stack from scratch. Your coach will challenge your design, present you with real monitoring scenarios, and help you decide what to do when alerts fire.

Lab Objectives

Define your 4-layer monitoring stack for a medical information AI specifically
Design specific alerting thresholds and escalation procedures
Diagnose the 12% abandonment spike — propose 3 hypotheses and investigation steps
Design an A/B test for a proposed fix, including guardrail metrics
Build the incident response runbook for a detected safety classifier regression

Start with the abandonment spike: what's your hypothesis for what's causing it, and what data would you look at first? Then we'll build the monitoring system that would have caught this earlier.

Eval-Ops Engineering Coach

Production Monitoring

Medical AI, 50k DAU, 12% abandonment spike, no monitoring infrastructure. This is a real incident. Walk me through your initial hypothesis: what do you think is causing the abandonment increase, and what would you look at in the server logs first? Be specific — "user dissatisfaction" is not a hypothesis.

Module 5 — Final Test

Evaluation & Testing · 15 questions · 80% to pass

1. A spam filter has 98% accuracy but 15% recall on actual spam. What does this tell you?

✓ Correct. With 98% accuracy and 15% recall on spam, the model is essentially flagging almost nothing as spam. Since spam is a small fraction of total email, predicting "not spam" for everything gives 98%+ accuracy. Recall exposes this: 85% of real spam is passing through. Accuracy is useless as a metric here.

✗ This is the classic imbalanced dataset problem. 98% accuracy + 15% recall = the model is nearly always predicting "not spam" and only catching 15% of actual spam. Accuracy is misleading on imbalanced datasets — always report precision, recall, and F1 separately.

2. BERTScore is preferred over BLEU for text generation evaluation because:

✓ Correct. BLEU counts surface n-gram matches — semantically equivalent paraphrases score near zero. BERTScore compares token-level contextual embeddings, capturing synonyms, paraphrases, and semantic equivalence that BLEU completely misses.

✗ The key advantage is semantic awareness. BLEU sees "cat" and "feline" as completely different words with zero overlap. BERTScore sees them as nearly identical in contextual embedding space. This makes BERTScore correlate far better with human judgments of translation quality.

3. Amazon's 2015 recruiting AI systematically downgraded women's resumes despite high classification accuracy. This is an example of:

✓ Correct. The model had good classification accuracy on the task it was optimized for (matching resumes to historically hired candidates). Nobody measured demographic bias. Single-metric myopia: high performance on the measured dimension, catastrophic failure on an unmeasured one.

✗ This is single-metric myopia — optimizing for a single measure (resume-to-hire match) without measuring fairness across demographic groups. The model was doing exactly what it was optimized to do: predict who would historically have been hired, which reflected historical gender bias in the hiring data.

4. When constructing a test set for a user-facing AI product, what split strategy is most appropriate for user-level data?

✓ Correct. User-stable splits prevent data leakage through user-level patterns. If User A's examples appear in both train and test, the model may learn User A's specific writing style, vocabulary, and preferences — inflating test performance in ways that don't generalize to new users.

✗ Random splits of user data allow the same user's examples in both train and test. The model learns user-specific patterns that inflate test metrics without generalizing. Always split at the user level for user-generated content.

5. CheckList's Directional Expectation Tests (DIR) verify that:

✓ Correct. DIR tests check that the model's outputs respond to input changes in the expected direction. Adding "but the service was excellent" to a negative review should shift sentiment up. Adding a negation ("not happy" vs "happy") should shift it down. Violations catch failure modes that aggregate metrics never surface.

✗ DIR tests verify directional behavior: that input changes produce logically consistent output changes. A sentiment model that scores "This was not good" higher than "This was good" is failing a basic DIR test — even if its aggregate F1 looks fine.

6. The Zheng et al. 2023 paper on MT-Bench found that GPT-4's pairwise judgments agreed with human experts at approximately what rate?

✓ Correct. 85% agreement with human expert judgments — notably exceeding the ~80% human-human agreement rate — was the key finding that validated LLM-as-judge for production use. This doesn't make human evaluation unnecessary, but it enables scalable automated evaluation with known reliability bounds.

✗ The finding was 85% agreement — higher than the ~80% agreement between human experts. This was the key result validating LLM-as-judge at scale. It doesn't replace human evaluation but provides a scalable approximation with known reliability characteristics.

7. Why is perplexity a poor metric for measuring task performance of a fine-tuned language model?

✓ Correct. Perplexity measures language modeling quality — how well does the model predict the next token? A model with excellent perplexity might produce fluent, confident text that is completely factually wrong or unhelpful for the task. Perplexity correlates with fluency, not correctness or task success.

✗ Perplexity measures next-token prediction quality, not task performance. A model could have low perplexity (predicting fluent text) while being factually wrong, biased, or useless for the task at hand. Always use task-specific metrics for fine-tuned models.

8. Dynabench's key innovation over traditional static benchmarks is:

✓ Correct. Dynabench (Kiela et al., Meta AI, 2021) uses human annotators who see the current best model's predictions and specifically write examples that fool it. These adversarial examples become the new test set — ensuring the benchmark perpetually stays ahead of model capabilities instead of being saturated.

✗ Dynabench is human-in-the-loop adversarial benchmarking. Annotators write examples designed to fool the current SOTA model. This creates a continuously hard benchmark that can't be saturated by contamination or overfitting — it adapts faster than models can memorize it.

9. Cohen's Kappa measures what property of annotation quality?

✓ Correct. Raw agreement is misleading — two annotators randomly labeling a dataset where 90% of examples are negative will agree 81% of the time by pure chance. Cohen's Kappa subtracts chance agreement: Kappa = (observed agreement - chance agreement) / (1 - chance agreement). This gives the true signal of shared understanding.

✗ Cohen's Kappa accounts for chance agreement. Two annotators guessing randomly would still agree sometimes — Kappa removes this floor. This is why Kappa of 0.8 means something quite different from 80% raw agreement rate, especially on imbalanced label distributions.

10. In LLM-as-judge evaluation, "self-enhancement bias" refers to:

✓ Correct. Self-enhancement bias is a documented systematic preference: GPT-4 gives inflated scores to GPT-4 outputs; Claude inflates Claude outputs. This happens even in blind evaluation because models share stylistic tendencies and the judge model "resonates" with familiar patterns. Use cross-family judges when possible.

✗ Self-enhancement bias means LLMs prefer outputs from their own model family. GPT-4 judging GPT-4 vs Claude — GPT-4's judgments systematically favor GPT-4 style. This creates circular evaluation that inflates scores for the "home" model. Cross-family judging substantially reduces (not eliminates) this bias.

11. Population Stability Index (PSI) in production monitoring is used to detect:

✓ Correct. PSI measures distributional shift between two populations — typically between training-time input distribution and current production input distribution. A PSI > 0.2 is a standard threshold indicating significant drift that may require model retraining or evaluation. It's a leading indicator: you detect drift before quality degrades.

✗ PSI detects input distribution drift — statistical differences between the distribution of data your model was trained on and the distribution of data it's currently seeing in production. Distribution shift is a leading indicator of quality degradation — catch it before users notice.

12. The "final held-out test set" in model development should be evaluated: