L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 7 · Lesson 1

Pipeline Architecture and Core Components

From ad-hoc spot-checks to reproducible, automated evaluation infrastructure
What separates a real eval pipeline from a spreadsheet of vibes?

In May 2023, Google's Bard launched with a factual error in its very first public demo — the model incorrectly stated that the James Webb Space Telescope had taken the first images of an exoplanet outside our solar system. The error, caught immediately by astronomers on social media, erased roughly $100 billion in Alphabet's market cap within two days. Google's evaluation process had not surfaced this specific failure before release. The incident became a landmark case study in what happens when pre-deployment evals lack systematic coverage of factual claims in high-stakes subject matter.

Building a formal eval pipeline does not guarantee zero errors. But it converts the question "did we check this?" from a matter of memory into a matter of record.

What Is an Eval Pipeline?

An eval pipeline is an automated, versioned, repeatable system that takes a model (or model version), runs it against a defined dataset of prompts, measures outputs against defined criteria, and produces structured reports — all without requiring manual review of every output.

The key distinction from ad-hoc testing is automation plus reproducibility. You can re-run yesterday's pipeline on today's model and trust the comparison is apples-to-apples. You can run it in CI/CD before every deployment. You can hand it to a colleague who has never seen your codebase and they will get the same numbers.

Dataset
Registry
Model
Runner
Scorer /
Judge
Aggregator
Report &
Alerting
The Five Core Components

Every serious eval pipeline — whether it is Anthropic's internal Claude evals, OpenAI's Evals framework (open-sourced in March 2023), or Microsoft's PromptFlow — implements variations on the same five components.

Component 1
Dataset Registry
Versioned storage of prompt/expected-output pairs. Tracks provenance — who created each item, when, from what source. Supports splits (train/dev/test) and tags (domain, difficulty, safety-category).
Component 2
Model Runner
Handles API calls, batching, rate-limit retry logic, and deterministic sampling settings (temperature=0 or fixed seed). Outputs raw completions with metadata: latency, token counts, model version string.
Component 3
Scorer / Judge
Compares completions against criteria. May be exact-match, regex, reference-based (BLEU, ROUGE), embedding similarity, or an LLM-as-judge call that returns a structured score. This is where most pipeline design work lives.
Component 4
Aggregator
Collects per-item scores, computes summary statistics (mean, p10/p90, pass-rate per tag), detects regressions against a stored baseline, and flags statistically significant changes.
Component 5
Report & Alerting
Generates human-readable outputs (dashboard, JSON artifact, Slack alert) and optionally blocks deployment via a CI gate if a regression threshold is exceeded.
OpenAI Evals: A Reference Implementation

When OpenAI open-sourced its Evals framework on March 15, 2023, it provided a public reference architecture. An eval is defined as a YAML file specifying the dataset path, the completion function (which model and parameters to use), and the eval class (e.g., Match, ModelGradedEval, FuzzyMatch). The framework handles the runner, scorer, and aggregator automatically.

The community quickly contributed over 300 evals within the first month — covering subjects from SQL generation accuracy to medical triage prioritization. This illustrated that the hard part of a pipeline is not the plumbing but the dataset curation and scoring criteria design.

# Minimal OpenAI Evals-style YAML structure eval_id: factual-qa-v1 class: evals.eval:ModelGradedEval args: completion_fns: [gpt-4o] samples_jsonl: data/factual_qa/samples.jsonl eval_prompt: prompts/factual_grader.txt choice_strings: [A, B, C] # correct / partial / wrong
Design Principle

The pipeline is only as trustworthy as its slowest-changing component. Dataset registries drift when examples are edited without versioning. Scorers drift when the judge model is swapped without documentation. Build every component with the assumption that someone will need to reproduce a specific historical run eighteen months from now.

Key Terms
Eval harnessThe surrounding infrastructure that orchestrates running a model against test cases — distinct from the test cases themselves. EleutherAI's lm-evaluation-harness is a widely used open-source example.
Completion functionIn pipeline terminology, the callable that takes a prompt and returns a model output — abstracting over API provider, local model, or cached response.
CI gateA pipeline step that blocks a build or deployment if eval scores fall below a defined threshold — converting evaluation from informational to blocking.
Regression baselineThe stored eval results from a prior model version used as the comparison point for detecting capability degradation.

Lesson 1 Quiz

Pipeline Architecture and Core Components
Which incident demonstrated the cost of inadequate pre-deployment factual evaluation at scale?
Correct. The Bard demo error — falsely claiming Webb took the first exoplanet images — erased roughly $100B in Alphabet's market cap within 48 hours, making it a canonical example of inadequate factual eval coverage before a high-visibility launch.
Not quite. Review the Google Bard launch incident from February 2023 and the market impact of a single factual error in a public demo.
What is the primary distinction between an eval pipeline and ad-hoc spot-checking?
Correct. Automation plus reproducibility is the defining property — the same pipeline on the same dataset and model version must produce the same result regardless of who runs it or when.
Not quite. The core distinction is automation and reproducibility, not dataset size, scoring method, or environment.
In a five-component eval pipeline (Dataset Registry → Model Runner → Scorer → Aggregator → Report), where does "detecting regressions against a stored baseline" occur?
Correct. The Aggregator collects per-item scores, computes summary statistics, and compares them against a stored baseline to detect regressions — before passing results to the Report & Alerting component.
Not quite. The Aggregator is responsible for regression detection. The Scorer just produces per-item scores; the Aggregator is what compares them to historical baselines.
What did the community response to OpenAI Evals' open-sourcing in March 2023 reveal about eval pipeline design?
Correct. Over 300 community evals were contributed in the first month, demonstrating that the infrastructure framework is reusable — the bottleneck is always thoughtful dataset construction and precise scoring criteria.
Not quite. The key insight from the rapid community adoption was that building the harness is the easy part; defining what good looks like (dataset + criteria) is where expertise is required.
A "CI gate" in an eval pipeline context means:
Correct. A CI gate converts evaluation from informational to blocking — the deployment simply cannot proceed if the model fails to meet defined performance thresholds, removing the option to ship and hope.
Not quite. A CI gate is an automated quality gate in a continuous integration pipeline that prevents deployment when eval scores fall below thresholds.

Lab 1: Designing Your Pipeline Architecture

Conversation practice · minimum 3 exchanges to complete

Your Scenario

You are a senior ML engineer at a fintech company. Your team ships a customer-facing summarization model that condenses transaction histories into plain-English spending reports. The model is updated monthly. Currently, evaluation is informal — a product manager reads 20 outputs before each release and gives a thumbs-up or thumbs-down.

Your task: design the five core components of a proper eval pipeline for this specific use case. Discuss each component with your AI mentor below.

Start by describing what your Dataset Registry should contain for a transaction summarization eval. What kinds of prompt/output pairs should be included, and how should they be tagged?
Eval Pipeline Mentor
Lab 1
Welcome to Lab 1. You're building an eval pipeline for a transaction summarization model at a fintech company. Let's work through the five core components together, starting with the Dataset Registry. What should go into it for your specific use case — what kinds of examples, and how would you tag them?
Module 7 · Lesson 2

Dataset Construction and Curation

Building test sets that actually find failures before your users do
How do you build an eval dataset that stays honest as your model improves?

When Google Brain published the BIG-Bench benchmark in 2022 — 204 tasks contributed by 450 researchers — one of its central findings was sobering: for many tasks, models that had likely seen benchmark data during training outperformed their true capability. The BIG-Bench authors introduced a dedicated "canary string" mechanism: a unique token sequence embedded in evaluation data, designed to let practitioners check whether a given model's training corpus contained the test set.

The lesson was not that benchmarks are useless. It was that dataset construction requires explicit contamination hygiene — tracking data provenance from the moment of creation, not retroactively after a model surprises you with suspicious scores.

The Three Failure Modes of Eval Datasets

Most eval datasets fail in one of three ways: they are too easy (the model always passes, giving no signal), contaminated (the model has seen the data during training, inflating scores), or misaligned (the tasks don't reflect what the model actually does in production). Each requires a different remedy.

Failure Mode 1
Saturation
Model scores ≥95% on the dataset. The eval no longer distinguishes between model versions. Fix: add harder adversarial examples, increase difficulty stratification, retire easy items.
Failure Mode 2
Contamination
Test items appear in training data, inflating scores. Fix: use canary strings, hold-out splits with strict access controls, and dynamic eval generation with post-training cutoff dates.
Failure Mode 3
Distributional Mismatch
Dataset prompts don't reflect real production traffic. Fix: sample directly from production logs (with PII scrubbing), use realistic user personas, and audit for format and length distribution.
Dataset Construction Principles

Stratified coverage. A good eval dataset is not a random sample — it is a stratified one. Define axes of variation that matter for your use case: task type, domain, length, complexity, and edge-case categories (ambiguous inputs, adversarial phrasing, multilingual). Ensure each stratum has adequate representation.

Difficulty calibration. Include items at multiple difficulty levels. The industry norm — used by HELM, MMLU, and others — is to ensure the model scores somewhere between 20% and 80% at each difficulty tier, preserving maximum signal. Items with pass rates below 5% or above 95% across model versions should be flagged for replacement.

Golden references vs. reference-free. For tasks with clear correct answers (factual QA, code execution), maintain a golden reference answer per item. For open-ended generation, reference-free scoring (fluency, coherence ratings) or LLM-as-judge is preferable — but requires calibration of the judge itself against human ratings.

Real Practice — Anthropic's Approach

Anthropic has publicly described maintaining "red-team datasets" that are never used for training, only for eval — analogous to a held-out test set in a traditional ML workflow. These datasets are updated quarterly with new adversarial examples generated by human red-teamers. The update cadence matters: a static safety eval dataset becomes stale as models learn to handle the specific attack patterns it contains.

Data Labeling and Annotation Quality

Ground truth labels in eval datasets are only as reliable as the annotation process that produced them. The Scale AI / RLHF contamination disclosures of 2023 — where contractors admitted to completing tasks without genuine review — highlighted that annotation quality is a first-class infrastructure problem, not a preprocessing footnote.

Standard practice for high-stakes eval datasets includes inter-annotator agreement measurement (Cohen's κ ≥ 0.6 is a common threshold for acceptable consistency), annotation guidelines with worked examples, and calibration tasks seeded throughout annotation batches to detect annotator drift.

For LLM-generated labels — increasingly common for scale — measure agreement between the LLM judge and a human gold-standard set before deploying the judge at scale. A judge that disagrees with humans on 25% of examples produces unreliable aggregate scores.

Versioning and Governance

Treat eval datasets as first-class versioned artifacts. Every item should carry: a unique ID, a creation timestamp, the author or source, a split assignment (dev/test), tag metadata, and a version history of any edits. This enables auditing — if a model's score on "safety-category:jailbreak" jumps 8 points between releases, you can determine whether the improvement reflects genuine capability gain or dataset changes.

Canary stringA unique token sequence embedded in eval data to detect whether the data appears in a model's training corpus.
SaturationWhen a model achieves near-perfect scores on a benchmark, making it unable to distinguish between model versions — the benchmark has lost discriminative power.
Inter-annotator agreementA statistical measure of how consistently different annotators assign the same label to the same item. Cohen's κ is the most common metric.
Distributional mismatchA gap between the distribution of eval prompts and real production traffic, making eval scores a poor predictor of real-world performance.

Lesson 2 Quiz

Dataset Construction and Curation
What mechanism did the BIG-Bench benchmark introduce to address training data contamination?
Correct. BIG-Bench introduced canary strings — unique token sequences embedded in the benchmark data — so practitioners could grep model training corpora to check for contamination.
Not quite. BIG-Bench's specific contamination mechanism was the canary string: a distinctive token sequence that makes it possible to detect whether test data ended up in a training corpus.
A model scores 97% on your eval dataset, and the score is identical across the last four model versions. What does this indicate?
Correct. A near-perfect score that doesn't vary across versions means the dataset has saturated — it has lost discriminative power and needs harder examples to provide useful signal.
Not quite. When scores are very high and don't vary across model versions, this is a saturation problem — the eval can no longer tell the versions apart. It needs harder items.
What is the standard minimum acceptable Cohen's κ for inter-annotator agreement in high-stakes eval datasets?
Correct. Cohen's κ ≥ 0.6 is the widely-cited threshold for "substantial agreement" — below this, annotation inconsistency introduces enough noise to undermine the reliability of the eval dataset's ground truth labels.
Not quite. The standard threshold for acceptable inter-annotator agreement (Cohen's κ) in high-stakes eval annotation work is 0.6.
According to Anthropic's publicly described approach, how frequently are red-team eval datasets updated?
Correct. Anthropic has described updating safety red-team datasets quarterly — the rationale being that a static dataset becomes stale as models learn to handle the specific attack patterns it contains.
Not quite. Anthropic's described cadence is quarterly updates, driven by the insight that models adapt to known attack patterns and static eval datasets lose signal over time.
What is the primary remedy for a distributional mismatch between your eval dataset and production traffic?
Correct. The most direct fix for distributional mismatch is grounding the eval dataset in real production traffic — sampling actual user prompts (with PII removed) to ensure the eval reflects what the model encounters in the real world.
Not quite. Distributional mismatch is best addressed by sourcing eval items from actual production logs — making the eval distribution match the deployment distribution.

Lab 2: Dataset Audit and Repair

Conversation practice · minimum 3 exchanges to complete

Your Scenario

You've inherited an eval dataset for a legal document classification model. The dataset was built 18 months ago by a contractor. Initial audit reveals: 450 items total, no tagging, no version history, a 98% pass rate on your current model, and the original prompts were sourced from a legal Q&A website that was scraped into a popular open LLM training set.

Diagnose what's wrong with this dataset and propose a remediation plan, component by component.

Start by diagnosing which of the three failure modes (saturation, contamination, distributional mismatch) are present in this dataset, and explain your reasoning for each.
Dataset Audit Mentor
Lab 2
You've inherited a problematic eval dataset: 450 untagged items, no version history, 98% pass rate, and likely training contamination. Let's work through a systematic diagnosis. Which failure modes do you see, and how would you argue for each based on the evidence?
Module 7 · Lesson 3

Scoring Systems and LLM-as-Judge

Choosing the right measurement tool for each evaluation task
When should you trust a model to grade another model — and when is that a catastrophic idea?

In June 2023, researchers at UC Berkeley published MT-Bench — a multi-turn conversational benchmark evaluated entirely by GPT-4 as the judge. The paper, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," measured not just model quality but judge quality: how well GPT-4's ratings correlated with human expert ratings across eight task categories. The finding was striking: GPT-4's agreement with human experts (85.1% average) exceeded the agreement between two different groups of human experts (81.2%). The paper also documented systematic biases — GPT-4 exhibited verbosity bias (preferring longer responses) and self-enhancement bias (rating GPT-4 outputs higher than equally good outputs from other models).

MT-Bench established that LLM-as-judge is viable at scale — but only when its biases are measured, documented, and mitigated.

The Scoring Spectrum

Scoring methods exist on a spectrum from fully deterministic to fully model-based. The choice depends on the task type, the cost of false positives, and whether reference answers exist.

Most Deterministic
Exact Match / Regex
Binary: the output either contains the expected string or it doesn't. Zero ambiguity. Best for: closed-form answers, specific entity extraction, code that must produce exact output. Limitation: fragile to paraphrase and formatting variation.
Reference-Based
BLEU / ROUGE / BERTScore
Measures overlap between output and a reference answer. BLEU: n-gram precision (machine translation). ROUGE: recall-oriented (summarization). BERTScore: semantic similarity via embeddings. Better than exact match for natural language; worse than human judgment for nuance.
Execution-Based
Code Execution / Unit Tests
Run generated code and check functional correctness. Used by HumanEval (OpenAI, 2021) and MBPP. Pass@k metric: does at least one of k samples solve the problem? Highly reliable for code; requires a sandbox execution environment.
Model-Based
LLM-as-Judge
A separate LLM scores outputs on defined criteria. Flexible for open-ended generation, multi-dimensional quality. Requires: structured output format, calibration against human labels, documented bias audit, versioned judge model. Most expensive; highest coverage.
Designing an LLM-as-Judge System

Single-answer grading vs. pairwise comparison. Single-answer grading asks the judge to score one response on a rubric (e.g., 1–5 on accuracy, helpfulness, safety). Pairwise comparison shows the judge two responses and asks which is better. Pairwise comparison generally produces more reliable relative rankings; single-answer grading is more useful for tracking absolute quality over time.

Structured output is mandatory. A judge prompt that asks for free-text reasoning followed by a score is unreliable at scale — parsing errors accumulate. The industry standard (used by both MT-Bench and Anthropic's eval infrastructure) is to require JSON output with explicit field names and a chain-of-thought reasoning field before the score field. This forces the model to reason before committing, improving calibration.

Position bias mitigation. In pairwise comparisons, LLM judges show a significant preference for the response that appears first in the prompt — a documented artifact of attention patterns. The standard mitigation is to run each pair twice with positions swapped and average the results, discarding pairs where the judge contradicts itself.

# LLM-as-Judge prompt template (structured output pattern) JUDGE_PROMPT = """ You are evaluating an AI assistant's response to the following query. Query: {query} Response: {response} Score the response on each dimension from 1-5: - accuracy: Does it state only true, verifiable facts? - helpfulness: Does it directly address what was asked? - safety: Does it avoid harmful or misleading content? Respond ONLY with valid JSON: { "reasoning": "...", "accuracy": <int 1-5>, "helpfulness": <int 1-5>, "safety": <int 1-5> } """
Known Judge Biases and Mitigations

The MT-Bench paper catalogued several systematic biases that appear consistently across LLM judges. Being aware of them is not sufficient — mitigation must be built into the scoring infrastructure.

Bias Type
Verbosity Bias
Judge prefers longer responses regardless of content quality. Mitigation: normalize score by response length; add explicit rubric instruction: "Length alone does not indicate quality."
Bias Type
Self-Enhancement Bias
A judge model rates outputs from its own model family higher. Mitigation: use a different judge model than the model under evaluation; cross-validate with multiple judge models.
Bias Type
Position Bias
In pairwise, judge favors the first-presented response. Mitigation: run each comparison twice with swapped positions; discard contradictory results.
Bias Type
Sycophancy Bias
Judge rewards responses that match its own style or "agree" with implicit preferences. Mitigation: calibrate judge against human gold labels on a held-out set before deploying at scale.
When NOT to Use LLM-as-Judge

For safety-critical evaluations — detecting harmful outputs, measuring refusal rates, testing for jailbreak success — LLM-as-judge introduces unacceptable risk of false negatives. A judge model can be manipulated by the same adversarial prompts that fool the primary model. Safety evals should use deterministic classifiers (fine-tuned classifiers on labeled harm data), human review, or both.

Pass@kA code evaluation metric: the probability that at least one of k model-generated samples for a given problem is functionally correct. Used by HumanEval and related benchmarks.
Verbosity biasThe tendency of LLM judges to rate longer responses more favorably, independent of actual quality — documented in the MT-Bench study.
Position biasIn pairwise LLM-as-judge evaluations, the tendency to prefer the response presented first in the prompt.
Self-enhancement biasAn LLM judge's tendency to rate outputs from its own model family higher than objectively equivalent outputs from other models.

Lesson 3 Quiz

Scoring Systems and LLM-as-Judge
In the MT-Bench study, GPT-4 as judge showed 85.1% agreement with human experts. What was the human expert agreement with each other?
Correct. The MT-Bench paper found that GPT-4's 85.1% agreement with human experts actually exceeded the 81.2% agreement between two separate groups of human experts — a key finding supporting the viability of LLM-as-judge methodology.
Not quite. Human-expert-to-human-expert agreement was 81.2% — lower than GPT-4's 85.1% agreement with humans. This was a central finding of the MT-Bench paper.
What is the standard mitigation for position bias in pairwise LLM-as-judge comparisons?
Correct. The standard mitigation is to run each comparison twice with positions swapped and average — and to discard pairs where the judge contradicts itself (giving inconsistent verdicts across the two orderings).
Not quite. The standard mitigation for position bias is to run each comparison twice with the response order swapped, then average results and flag contradictions.
Which scoring method is most appropriate for evaluating code generation tasks?
Correct. Execution-based scoring — running the generated code against unit tests and measuring pass@k — is the gold standard for code evaluation. It measures functional correctness directly, not surface similarity to a reference solution.
Not quite. For code generation, execution-based scoring (pass@k) is the most reliable method — it tests whether the code actually works, not how similar it looks to a reference.
Why should LLM-as-judge NOT be used as the primary scorer for safety evaluations?
Correct. The core vulnerability is that an LLM judge shares the same attack surface as the primary model — adversarial prompts designed to bypass safety guardrails can also fool the judge into rating unsafe outputs as safe.
Not quite. The critical problem is that adversarial inputs that successfully bypass the primary model's safety filters often also fool the LLM judge — making it an unreliable detector of exactly the failures you most need to catch.
What is the primary advantage of requiring structured JSON output from an LLM judge over free-text scoring?
Correct. Structured JSON output with a required reasoning field before the score field prevents parsing errors that accumulate at scale, and the chain-of-thought pattern (reason first, then score) improves calibration by forcing deliberation.
Not quite. The two key benefits of structured JSON are: (1) eliminating parsing failures at scale, and (2) a reasoning field before the score field forces the model to deliberate before committing, improving calibration.

Lab 3: Designing a Judge Prompt

Conversation practice · minimum 3 exchanges to complete

Your Scenario

Your team evaluates a customer service chatbot that handles billing disputes, technical support, and account management. You need to design an LLM-as-judge scoring system for response quality. The judge must score on accuracy, tone (professional but empathetic), and resolution effectiveness.

You must also identify which interactions should NOT use LLM-as-judge at all, and design mitigations for at least two known judge biases.

Start by drafting the JSON output schema for your judge prompt. What fields do you need, and why does the reasoning field need to come before the score fields?
Scoring Design Mentor
Lab 3
Let's design an LLM-as-judge system for a customer service chatbot. You need to score accuracy, tone, and resolution effectiveness. Start with the output schema — what JSON fields do you need, and why should the reasoning field precede the numerical scores?
Module 7 · Lesson 4

Regression Detection and CI/CD Integration

Making evaluation blocking, automated, and actionable in production deployment workflows
How do you turn eval results from a report into a deployment decision?

In June 2023, a widely-discussed paper from Stanford and UC Berkeley documented that GPT-4's performance on specific tasks had measurably declined between March and June 2023. On a "is this number prime?" classification task, GPT-4's accuracy fell from 97.6% to 2.4% over that period. On a code generation task, the percentage of directly executable code dropped from 52% to 10%. OpenAI's public API served different model versions under the same "gpt-4" identifier without versioned API endpoints, making it impossible for external users to pin to a specific checkpoint.

The incident — later disputed by OpenAI in terms of its interpretation but never fully resolved — illustrated two non-negotiable pipeline requirements: model versioning must be explicit in eval records, and continuous monitoring must run against pinned model checkpoints, not floating aliases.

Regression Detection Fundamentals

A regression is a measurable decline in a metric relative to a defined baseline — usually the last released model version or a pinned "golden" checkpoint. Detecting regressions requires three things: a baseline score stored with full metadata (model version, eval dataset version, date), a current score on the same dataset version, and a statistical test to determine whether the difference is meaningful.

The naive approach — flag any negative delta — produces excessive false alarms. A 0.3% accuracy drop on a 100-item dataset is almost certainly within sampling noise. The standard approach is to use a McNemar's test for paired binary outcomes or a bootstrap confidence interval for continuous metrics, and to set a minimum absolute delta threshold (e.g., ≥2 percentage points AND p < 0.05).

Statistical Significance in Eval Results

Eval datasets are finite samples. A dataset of 200 items has a standard error on a 75% pass rate of approximately ±3 percentage points. This means a measured decline from 75% to 73% is not statistically distinguishable from noise at conventional significance levels. Teams that don't account for this end up either ignoring real regressions (threshold too high) or blocking good releases on noise (threshold too low).

Best practice, used in production eval systems at companies including Cohere and AI21 Labs, is to:

1. Set primary alert thresholds at 2–3× the standard error of the metric.

2. Require regression confirmation on a separate holdout slice before blocking deployment.

3. Track multiple metrics simultaneously — a model that regresses on one metric while improving on two others may still represent a net improvement.

CI/CD Integration Patterns

The standard integration pattern for LLM eval in a CI/CD pipeline uses three gates at different stages of the deployment pipeline:

PR Merge
(fast eval)
Staging Deploy
(full eval suite)
Canary Release
(online eval)
Full Production
(monitoring)

Gate 1 — PR merge (fast eval): A lightweight eval suite running 50–100 high-signal items with deterministic scoring. Target: under 2 minutes. Catches obvious regressions before code merges. Used by teams at Anthropic (described in their model card methodology) and widely adopted in open-source LLM development.

Gate 2 — Staging deploy (full eval suite): The complete eval pipeline including LLM-as-judge scoring. Target: under 30 minutes. This gate sees every eval dimension and produces the formal regression report. Blocks staging-to-production promotion if regressions exceed thresholds.

Gate 3 — Canary + production monitoring: Online eval on live traffic, comparing the new model version against the baseline on a sample of real requests (with consent and PII controls). This catches distributional shift — regressions on query types that don't appear in offline eval datasets.

Real Implementation — Weights & Biases Evaluations

Weights & Biases launched its LLM evaluation feature in 2023 with explicit support for the CI gate pattern: eval runs are tracked as W&B runs with model version metadata, and the W&B SDK provides a native GitHub Actions integration that reads eval results and returns a pass/fail exit code to the CI runner. This is the pattern many teams adopt for integrating eval into existing MLOps infrastructure without building custom tooling.

Alerting and Escalation Design

Not all regressions are equal. A 5% regression on "formatting correctness" in a summarization model is operationally different from a 5% regression on "refusal rate for harmful requests" in a safety-critical deployment. Alerting systems must map metric categories to severity levels and escalation paths.

The practical standard is a three-tier alert system: informational (logged, no action required — small metric movements within noise), warning (Slack notification to the model team — significant movement requiring investigation), and blocking (deployment halt — regression on safety or core-capability metrics above defined thresholds). The mapping of metrics to tiers must be documented and reviewed at least quarterly.

Model Version Pinning

The GPT-4 capability regression incident highlighted that floating model aliases — identifiers that map to different underlying checkpoints over time — are incompatible with reproducible eval pipelines. Every eval record must store the precise model version string, the API endpoint, any system prompt version, and the temperature/sampling parameters used. These four pieces together define a "model configuration" — and regressions can only be meaningfully attributed to model changes if the configuration is fully pinned.

McNemar's testA statistical test for paired binary outcomes — appropriate for comparing two model versions on the same set of binary (pass/fail) eval items.
Bootstrap confidence intervalA resampling method for estimating the uncertainty in an eval metric — provides confidence bounds without assuming a specific distribution.
Canary releaseA deployment pattern where a new model version receives a small percentage of production traffic (e.g., 5%) while online metrics are monitored before full rollout.
Floating aliasA model identifier (e.g., "gpt-4") that maps to different underlying checkpoints over time — incompatible with reproducible eval comparison because it makes it impossible to determine which model produced historical scores.

Lesson 4 Quiz

Regression Detection and CI/CD Integration
In the Stanford/UC Berkeley study on GPT-4 capability changes (June 2023), what happened to accuracy on the prime number classification task between March and June 2023?
Correct. The paper documented an extraordinary drop from 97.6% to 2.4% on prime number classification — one of the most dramatic specific examples of model capability drift documented in that period, illustrating why pinned model versioning is essential.
Not quite. The paper documented a dramatic decline from 97.6% to 2.4% accuracy on the prime classification task — a 95+ percentage point drop that highlighted the risks of floating model aliases and lack of version pinning.
Why is flagging any negative metric delta a poor approach to regression detection?
Correct. A 200-item dataset with 75% pass rate has a standard error of ~±3 percentage points. A measured drop of 2% is well within this noise range — flagging it as a regression would produce constant false alarms and erode trust in the alerting system.
Not quite. The problem with flagging any negative delta is statistical: small eval datasets have meaningful sampling noise, and tiny negative movements are often not distinguishable from random variation rather than genuine capability regression.
In a three-gate CI/CD eval pipeline (PR merge → staging → canary), what is the target runtime for the Gate 1 (PR merge) fast eval?
Correct. The PR-merge fast eval targets under 2 minutes — using a small set of high-signal, deterministically-scored items. If it takes longer, developers will disable or bypass it. Gate 2 (staging) can afford the full 30-minute suite since it runs less frequently.
Not quite. Gate 1 (PR merge fast eval) must complete in under 2 minutes to be viable in a normal development workflow. Anything slower will be disabled or ignored by the development team.
What four pieces of information together define a "model configuration" for reproducible eval records?
Correct. A reproducible eval requires pinning all four: the exact model version string, the specific API endpoint (not a floating alias), the system prompt version (which affects behavior significantly), and the sampling parameters (temperature, top-p, seed).
Not quite. The four components of a pinned "model configuration" are: the precise model version string, the API endpoint, the system prompt version, and the temperature/sampling parameters. Together they uniquely identify what generated the outputs.
In a three-tier alert system for eval regressions, which tier should a 5% regression in "refusal rate for harmful requests" trigger?
Correct. Safety metrics like refusal rates for harmful requests should map to the blocking tier — a regression here represents a direct increase in harm risk and is categorically different from a regression in formatting quality or stylistic metrics.
Not quite. A regression in harmful request refusal rate is a safety-critical metric and must map to the blocking tier — this is exactly the type of metric that justifies a hard deployment halt regardless of improvements elsewhere.

Lab 4: Building a Regression Detection Spec

Conversation practice · minimum 3 exchanges to complete

Your Scenario

You are the ML infrastructure lead at a healthcare information company. Your AI assistant answers patient questions about symptoms, medications, and care instructions. The model is updated bi-weekly. You need to design a regression detection and CI/CD gating specification before the next update ships.

You must define: your three CI gates and their pass/fail criteria, your three-tier alert system with metric-to-tier mappings, and how you will handle floating alias risk given you use an API-based model provider.

Start with the most safety-critical component: your blocking gate criteria. What specific metrics, thresholds, and statistical tests define a "do not ship" decision for a healthcare AI assistant?
CI/CD Gating Mentor
Lab 4
You're designing a regression detection specification for a healthcare AI assistant — a high-stakes deployment where false negatives (missed regressions) can cause real patient harm. Let's start with your blocking gate criteria. What metrics, at what thresholds, with what statistical tests, should constitute a hard "do not ship" decision?

Module 7 Test

Building an Eval Pipeline — 15 questions · 80% to pass
1. Which component of an eval pipeline is responsible for storing versioned prompt/expected-output pairs with provenance metadata?
Correct. The Dataset Registry is the versioned, provenance-tracked store of prompt/expected-output pairs — the foundation that all other pipeline components depend on.
Incorrect. The Dataset Registry stores versioned eval items with provenance. Review Module 7 Lesson 1.
2. Google Bard's February 2023 launch error — incorrectly stating which telescope first imaged an exoplanet — erased approximately how much market value from Alphabet?
Correct. The Bard demo error erased roughly $100 billion in Alphabet market cap within 48 hours — the landmark case illustrating the cost of inadequate pre-deployment factual eval coverage.
Incorrect. The Bard error erased approximately $100 billion from Alphabet's market cap. Review Lesson 1.
3. What innovation did BIG-Bench introduce specifically to address training data contamination?
Correct. BIG-Bench introduced canary strings — distinctive token sequences that let practitioners check whether benchmark data appears in training corpora.
Incorrect. BIG-Bench's contamination solution was canary strings. Review Lesson 2.
4. A model scores 98% on an eval dataset and scores haven't changed across five model versions. This describes which dataset failure mode?
Correct. When scores are consistently near-perfect and don't differentiate between model versions, the dataset has saturated — it has no remaining discriminative power.
Incorrect. Near-perfect, non-differentiating scores indicate saturation. Review Lesson 2's three failure modes.
5. In the MT-Bench study, which two systematic biases were documented in GPT-4 as a judge?
Correct. MT-Bench specifically documented verbosity bias (preferring longer responses) and self-enhancement bias (rating GPT-4 family outputs higher) as the two primary systematic biases.
Incorrect. MT-Bench documented verbosity bias and self-enhancement bias as the two main systematic biases in GPT-4 as judge. Review Lesson 3.
6. The Pass@k metric in code evaluation measures:
Correct. Pass@k is the probability that at least one of k sampled solutions is functionally correct — used by HumanEval and other code benchmarks to account for stochastic generation.
Incorrect. Pass@k measures whether at least one of k samples solves the problem — a probabilistic measure for stochastic code generation. Review Lesson 3.
7. What statistical test is appropriate for comparing two model versions on paired binary (pass/fail) eval items?
Correct. McNemar's test is specifically designed for paired binary outcomes — comparing two classifiers (or model versions) on the same set of items, where each item has a binary pass/fail outcome.
Incorrect. McNemar's test is the appropriate test for paired binary outcomes in model comparison. Review Lesson 4.
8. In a three-gate CI/CD eval pipeline, which gate runs the FULL eval suite including LLM-as-judge scoring?
Correct. Gate 2 (staging deploy) runs the complete eval suite, targeting under 30 minutes. Gate 1 runs a fast 50–100 item deterministic subset; Gate 3 runs online eval on live traffic.
Incorrect. The full eval suite including LLM-as-judge runs at Gate 2 (staging). Review Lesson 4's CI/CD integration patterns.
9. Which approach does OpenAI Evals use to define an evaluation?
Correct. OpenAI Evals uses YAML configuration files specifying the dataset path, the completion function (model + parameters), and the eval class (Match, ModelGradedEval, etc.).
Incorrect. OpenAI Evals uses YAML files to define evaluations declaratively. Review Lesson 1's reference implementation section.
10. The GPT-4 capability regression study highlighted what specific infrastructure requirement for reproducible eval comparison?
Correct. The inability to pin to specific checkpoints under a floating alias ("gpt-4") made it impossible to determine exactly which model version was responsible for observed capability changes.
Incorrect. The lesson was explicit model version pinning — not floating aliases. Review Lesson 4's model versioning section.
11. For which category of eval task should LLM-as-judge absolutely NOT be the primary scorer?
Correct. Safety evals must not rely primarily on LLM-as-judge because the judge shares the same adversarial attack surface as the primary model — adversarial prompts that bypass safety filters can also fool the judge.
Incorrect. Safety evaluations (jailbreak detection, harm classification) are the category where LLM-as-judge is most dangerous. Review Lesson 3's callout on when NOT to use LLM-as-judge.
12. What is the minimum acceptable Cohen's κ for inter-annotator agreement in high-stakes eval dataset annotation?
Correct. Cohen's κ ≥ 0.6 ("substantial agreement") is the standard threshold — below this, annotation inconsistency introduces enough noise to make ground truth labels unreliable as eval benchmarks.
Incorrect. The standard threshold for acceptable inter-annotator agreement (Cohen's κ) is 0.6. Review Lesson 2.
13. A "floating alias" (like "gpt-4") in model versioning is problematic for eval pipelines because:
Correct. When the model behind an alias changes without notification, you cannot determine whether a score change in your eval reflects a genuine capability shift or a provider-side model update — destroying reproducibility.
Incorrect. Floating aliases break eval reproducibility because the underlying model can change silently. Review Lesson 4's model version pinning section.
14. What is the primary purpose of including a "reasoning" field BEFORE the score fields in an LLM-as-judge JSON output schema?
Correct. Requiring the reasoning field first implements a chain-of-thought pattern — the model must articulate its evaluation reasoning before committing to numerical scores, which consistently improves calibration against human ratings.
Incorrect. The reasoning-first pattern forces deliberation before commitment, improving score calibration. Review Lesson 3's scoring system design section.
15. In a three-tier alert system for eval regressions, which regression should trigger a deployment halt (blocking tier)?
Correct. Refusal rate for harmful requests is a safety-critical metric that maps to the blocking tier. A regression here directly increases the probability of harm to users and warrants an unconditional deployment halt.
Incorrect. Safety metrics like refusal rates for harmful content must map to the blocking tier — they represent direct harm risk, not quality-of-life degradation. Review Lesson 4's alerting design section.