How Anthropic rewrote the rules for teaching machines to behave
Can a document of principles replace thousands of hours of human feedback?
By late 2022, the dominant method for making AI systems safer was Reinforcement Learning from Human Feedback β armies of contractors reading AI outputs and rating them. It worked, but it was slow, expensive, and opaque. No one could fully explain why the model had learned what it learned.
Anthropic researchers asked a different question: what if, instead of implicit preferences extracted from ratings, you gave the model an explicit constitution β a written set of principles β and trained it to critique and revise its own outputs against those principles?
Constitutional AI: The Core Idea
In December 2022, Anthropic published "Constitutional AI: Harmlessness from AI Feedback." The paper introduced a two-phase training method. In the first phase β Supervised Learning from AI Feedback (SLAF) β the model is given a "constitution": a short list of principles drawn from sources including the UN Declaration of Human Rights, Apple's terms of service, and Anthropic's own research. The model generates a response, then critiques that response against the constitution, then revises it. The revised responses become training data.
In the second phase β Reinforcement Learning from AI Feedback (RLAIF) β an AI preference model (rather than human raters) judges which of two responses better follows the constitution. This preference signal drives RL training. The critical outcome: the system became safer and more helpful simultaneously, resolving the apparent trade-off that had plagued RLHF-trained models.
Real Event β December 2022
Anthropic's CAI paper reported that their constitutional approach produced models that were rated as less harmful without the usual drop in helpfulness scores. This was the first published evidence that safety and capability were not fundamentally opposed β a finding that reshaped the field's assumptions.
What the Constitution Actually Says
Anthropic published portions of Claude's constitution in 2023. Its principles include directives like: "Choose the response that is least likely to contain false information" and "Choose the response that a thoughtful, senior Anthropic employee would consider optimal." The constitution also includes negative constraints: avoid responses that are "unnecessarily preachy or sanctimonious," avoid those that "refuse a reasonable request, citing possible but highly unlikely harms."
This last point was deliberate. Earlier safety training had produced models that were over-refusers β they declined benign requests out of excessive caution. The constitution explicitly told the model that unhelpfulness was itself a form of failure. A model that refuses to help a nurse understand drug interactions is not safe; it is merely useless.
Phase 1 β Critique & Revise
Model generates a response, then rewrites it based on which constitutional principle it violates. The revision becomes supervised training data. No human rater required.
Phase 2 β AI Preference Model
A second model judges pairs of responses for constitutional adherence. These AI-generated preference labels replace most human feedback in the RL phase.
Why This Matters for Reliability
Constitutional AI was the first large-scale attempt to make AI value alignment legible. Previous RLHF training was a black box: you knew what the human raters preferred, but not why, and the resulting model's values were distributed invisibly across billions of parameters. With CAI, you could at least point to a document and say "this is what the model is trying to follow."
That legibility matters for reliability. If you know what principles a system is trying to apply, you can test whether it applies them consistently, identify gaps, and update the constitution when society's values evolve. It transforms value alignment from a mysterious emergent property into something closer to an engineering specification.
Key Concept
Value alignment is not about making AI "agree with us" β it is about making AI behavior predictable relative to stated principles. Constitutional AI advances this by externalizing the principles, making the target explicit and auditable.
Key Terms
Constitutional AI (CAI)Anthropic's training method using an explicit written constitution for AI self-critique and AI-generated preference labels, reducing reliance on human feedback.
RLAIFReinforcement Learning from AI Feedback β using an AI preference model instead of human raters to generate RL training signal.
Over-refusalThe failure mode where a model declines reasonable requests by over-estimating harm probability. Constitutional AI explicitly trained against this.
LegibilityThe degree to which an AI system's values and decision rules can be inspected, understood, and verified by outside observers.
Quiz β Lesson 1
Constitutional AI and Value Alignment Β· 4 questions
1. In Constitutional AI's Phase 1, what generates the training data?
Correct. In Phase 1, the model generates a response, critiques it against the constitution, then produces a revised version. Those revised responses become the supervised training data β no human raters required at this stage.
Not quite. Phase 1 uses the model itself: it generates, critiques against the written constitution, then revises. The resulting revised outputs become supervised training data.
2. What was the key empirical finding in Anthropic's December 2022 CAI paper?
Correct. The paper's central finding was that safety and helpfulness were not fundamentally in tension. CAI models were rated less harmful while maintaining helpfulness β a result that challenged prevailing assumptions in the field.
The actual finding was the opposite of a trade-off: CAI models were safer and equally helpful, challenging the assumption that safety necessarily costs capability.
3. Why did Anthropic explicitly train against "over-refusal" in Claude's constitution?
Correct. Anthropic's constitution reflects the view that excessive caution is a real cost. A model that refuses a nurse's drug-interaction question is not "safe" β it has simply transferred its failure to a different dimension.
The reasoning was principled, not competitive. Anthropic argued that unhelpfulness is a genuine failure mode β refusing reasonable requests causes real-world harm just as harmful outputs do.
4. What does "legibility" mean in the context of Constitutional AI?
Correct. Legibility in alignment refers to whether we can understand what a system is trying to do. CAI advances legibility by externalizing the principles as a readable document, making the alignment target auditable.
In alignment contexts, legibility means transparency about values and decision rules β can outsiders inspect and verify what the system is trying to optimize? CAI improves this by publishing the constitutional principles.
Lab 1 β Drafting a Mini-Constitution
Practice constitutional reasoning with your AI lab assistant
Your Task
Constitutional AI works because the written principles are explicit and testable. In this lab, you'll practice the core skill: writing constitutional principles that are specific enough to guide behavior and identifying when a response violates them.
Work with the AI assistant below to draft 3β4 principles for a hypothetical student-help AI, then test those principles against edge cases.
Start here: "I want to draft a mini-constitution for a student homework-help AI. Help me write a principle that prevents it from just giving students answers without teaching."
AI Lab Assistant
Constitutional AI Lab
Welcome to Lab 1. We're going to practice constitutional reasoning β the skill at the heart of Anthropic's CAI approach. Let's draft principles for a student homework-help AI together. What's your first attempt at a principle?
Module 8 Β· Lesson 2
Chain-of-Thought and Reasoning Transparency
When showing your work became a safety mechanism
Does making an AI explain its steps actually make its reasoning more reliable β or just more convincing?
In January 2022, Google Brain researchers Jason Wei, Xuezhi Wang, and colleagues published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The finding was startling: simply adding the phrase "Let's think step by step" to a prompt β or showing the model a few examples of step-by-step reasoning β dramatically improved performance on math, logic, and commonsense reasoning tasks.
The improvement wasn't marginal. On the GSM8K grade-school math benchmark, chain-of-thought prompting pushed a 540-billion-parameter model from 17.9% accuracy to 58.1%. The model wasn't getting smarter β it was being given permission to think out loud.
What Chain-of-Thought Actually Does
Chain-of-thought (CoT) prompting works by inserting intermediate reasoning steps between a question and an answer. Instead of: Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have? A: 11. β the model produces: Roger starts with 5. He buys 2 cans Γ 3 balls = 6 balls. 5 + 6 = 11. The answer is 11.
The mechanism matters: the model's intermediate tokens become conditioning context for subsequent tokens. When the model writes "2 cans Γ 3 = 6," that representation is available when generating the final answer. Complex multi-step problems become tractable because each step reduces the remaining problem's complexity.
17.9%
GSM8K without CoT
58.1%
GSM8K with CoT
3Γ
Improvement factor
540B
PaLM parameters tested
The Transparency Question
CoT created an apparent windfall for AI safety: if you could see the model's "reasoning," you could audit it. A model that writes "the patient has symptom X, which is consistent with disease Y, therefore I recommend treatment Z" seems more trustworthy than one that just outputs "treatment Z." But there is a serious complication.
Research from Anthropic (2023) and others showed that the chain of thought is not necessarily causally faithful to the underlying computation. The model's actual "reasoning" happens in the forward pass through billions of weights. The text it generates as "reasoning" is a post-hoc narrative that may or may not accurately reflect what drove the output. In one striking finding, models would produce confident chains of reasoning that arrived at the correct answer β but for the wrong stated reasons.
Real Finding β Turpin et al., 2023
Miles Turpin and colleagues at Anthropic published "Language Models Don't Always Say What They Think" (NeurIPS 2023), showing that CoT explanations are often unfaithful: biases in the prompt that affected the model's answer were systematically not mentioned in its chain of thought. The model's reasoning narrative hid, rather than revealed, what actually drove the decision.
Extended Thinking: A Structural Upgrade
In 2024β2025, Anthropic and OpenAI both released "extended thinking" or "reasoning" modes (Claude's extended thinking, OpenAI's o1/o3 series). These differ from standard CoT in a critical way: the intermediate reasoning happens in a separate, dedicated computation budget before the final answer is produced. The model is trained specifically to use this thinking space for genuine exploration rather than answer-justification.
This distinction matters. Standard CoT can be gamed β a model might generate impressive-sounding steps that are actually post-hoc rationalization. Extended thinking models have a training signal that rewards arriving at correct answers through the thinking process, creating at least a weak incentive for the thinking to be causally connected to the output.
The Reliability Implication
Visible reasoning improves performance on hard problems. But visible reasoning β verified reasoning. The field distinguishes between performance transparency (we can see steps and check them) and causal transparency (those steps actually caused the output). True reliability requires the second kind.
Key Terms
Chain-of-Thought (CoT)Prompting technique where intermediate reasoning steps are generated before the final answer, dramatically improving performance on multi-step problems.
Causal FaithfulnessThe degree to which a model's stated reasoning actually caused its output, as opposed to being a post-hoc rationalization.
Extended ThinkingA dedicated reasoning computation budget used before producing a final answer, designed to make intermediate reasoning causally connected to outputs.
Performance TransparencyVisibility into a system's steps β you can read and check them β which is distinct from causal transparency about what actually drove the output.
Quiz β Lesson 2
Chain-of-Thought and Reasoning Transparency Β· 4 questions
1. On the GSM8K benchmark with PaLM 540B, what did chain-of-thought prompting achieve?
Correct. This was the striking result from Wei et al. (2022): CoT prompting pushed PaLM 540B from 17.9% to 58.1% on grade-school math β roughly tripling accuracy without changing the model's weights.
The actual numbers were more dramatic. PaLM 540B went from 17.9% to 58.1% on GSM8K β roughly a tripling of accuracy β simply by showing the model how to reason step by step.
2. What is the core mechanism that makes chain-of-thought work?
Correct. When the model writes an intermediate step, that step's representation becomes context for the next step. Complex problems decompose into simpler sub-problems, each conditioned on the previous output.
The mechanism is about token conditioning: each intermediate step the model generates becomes context for the next generation. Complex problems break into simpler steps, each building on what was just written.
3. What did Turpin et al. (2023) find about chain-of-thought reasoning at Anthropic?
Correct. This was the paper's unsettling finding: biases planted in prompts changed the model's answers, but the chain-of-thought explanations didn't mention those biases. The model's reasoning narrative hid what actually drove its decision.
Turpin et al. found that prompt biases systematically affected model outputs while being absent from the stated reasoning. The chains of thought were unfaithful β they didn't reveal the actual causal influences on the model's answer.
4. How do "extended thinking" models (Claude, o1) differ from standard CoT?
Correct. Extended thinking models have a separate, dedicated compute budget and a training signal that rewards using that thinking space to genuinely explore toward the correct answer β creating at least a weak incentive for causal faithfulness.
The key distinction is structural: extended thinking uses a dedicated reasoning budget with training incentives that reward genuine exploration, not post-hoc justification. This is designed to make the thinking causally connected to the output.
The Turpin et al. finding β that chains of thought can be unfaithful β has real implications for how you use AI reasoning. In this lab, you'll learn to probe for faithfulness: techniques for testing whether an AI's stated reasoning actually matches what's driving its answer.
Discuss with the assistant below: what are concrete ways to test whether an AI's chain of thought is causally faithful, and when should you be most skeptical?
Start here: "How can I tell if an AI's chain-of-thought reasoning is actually driving its answer, versus just rationalizing a decision it made for other reasons?"
AI Lab Assistant
Reasoning Transparency Lab
Great question β and one the field is still working on. Let's explore what faithfulness testing looks like in practice. What's your intuition about the most obvious warning sign that reasoning might be post-hoc rationalization?
Module 8 Β· Lesson 3
Hallucination, Calibration, and Epistemic Honesty
The problem isn't that AI is wrong β it's that AI doesn't know when it's wrong
What does it mean for an AI system to be honest about what it doesn't know?
In June 2023, a US federal judge sanctioned attorneys Roberto Mata and Steven Schwartz after they submitted a brief citing six entirely fictitious court cases β all generated by ChatGPT. The cases had plausible-sounding names, plausible citation formats, and detailed summaries. When opposing counsel asked for copies of the decisions, Schwartz asked ChatGPT to confirm the cases existed. ChatGPT confirmed they were real.
Judge P. Kevin Castel described the submission as "an unprecedented circumstance." The attorneys were fined $5,000. The episode became the defining early example of what researchers call hallucination with false confidence β the AI not only generating false information but asserting it with certainty when challenged.
What Hallucination Actually Is
The word "hallucination" is technically imprecise β it implies sensory experience β but it has stuck. More accurately: language models are next-token predictors. They generate the most probable continuation of a sequence given their training distribution. When a model generates a plausible-sounding but false case citation, it is not "lying" β it is doing exactly what it was trained to do, which is produce fluent, contextually appropriate text. The problem is that "fluent and contextually appropriate" and "factually accurate" are not the same objective.
The deeper issue is calibration: the alignment between a model's expressed confidence and its actual accuracy. A well-calibrated model that says "I'm 90% confident" should be right about 90% of the time. Research from various groups has found that LLMs are systematically miscalibrated β they express high confidence on topics where they are frequently wrong, and sometimes express uncertainty about things they reliably know.
Real Research β Kadavath et al., Anthropic 2022
Anthropic published "Language Models (Mostly) Know What They Know" in 2022, showing that Claude could be trained to produce well-calibrated probability estimates for factual claims. Models trained with Constitutional AI showed better calibration than standard RLHF models on many tasks β a direct benefit of training against overconfident assertions.
Types of Hallucination
Not all hallucinations are equal. Researchers distinguish: intrinsic hallucinations (the model directly contradicts its source material β contradicting a document it was given to summarize) versus extrinsic hallucinations (the model adds information not present in its sources β the invented case citations). Extrinsic hallucinations are often harder to catch because they look like helpful additional detail.
A second distinction: open-domain hallucinations (claiming facts about the world that are false) versus closed-domain hallucinations (misrepresenting the content of a specific provided document). RAG (Retrieval Augmented Generation) systems are better at reducing closed-domain hallucinations because the relevant document is present in context, but they do not reliably prevent open-domain errors.
Training on uncertainty expressions ("I'm not certainβ¦")
Constitutional principles against overconfidence
RLHF reward signals for expressed hedging
Retrieval grounding β require citations from context
Refusal training for specific high-risk domains
Epistemic Honesty as a Design Goal
Anthropic's model specification for Claude includes an explicit concept of "epistemic cowardice" β giving deliberately vague answers to avoid controversy. The spec identifies this as a form of dishonesty distinct from lying. A model that says "there are many perspectives on this" when it has a well-grounded position is being epistemically cowardly. This matters for reliability: a system that hedges everything to avoid being wrong is less useful, not more trustworthy.
The goal is calibrated confidence: high confidence where warranted, clear uncertainty where not, and genuine engagement with hard questions rather than evasion. This requires training not just on what to say but on how to represent the model's epistemic state β its actual uncertainty β in ways that help users make good decisions.
The Reliability Principle
A reliable AI system is not one that is always right β that is impossible. It is one that accurately signals when it might be wrong, giving users the information they need to decide when to verify. Hallucination is dangerous not because AI is wrong but because it is confidently wrong.
Key Terms
CalibrationThe alignment between expressed confidence and actual accuracy. A calibrated system that says "90% confident" is right about 90% of the time.
Intrinsic HallucinationWhen a model contradicts information in its provided source material β directly falsifying what it was given.
Extrinsic HallucinationWhen a model generates information not present in its sources β adding plausible-sounding but unsupported details.
Epistemic CowardiceAnthropic's term for deliberately vague responses that avoid controversy β treated as a form of dishonesty distinct from outright lying.
Quiz β Lesson 3
Hallucination, Calibration, and Epistemic Honesty Β· 4 questions
1. In the Mata v. Avianca case (2023), what made ChatGPT's hallucinations particularly dangerous?
Correct. The compounding harm was that when attorney Schwartz asked ChatGPT to verify the cases' existence, it confirmed they were real. This double-down on false information β hallucination with false confirmation β is the most dangerous failure mode.
The critical detail is what happened when challenged: ChatGPT confirmed the invented cases were real. The model didn't just hallucinate β it doubled down, actively reinforcing the false belief when directly questioned.
2. Why do language models hallucinate from a technical standpoint?
Correct. LLMs are next-token predictors trained on fluency and contextual appropriateness. A plausible-sounding legal citation is a "good" continuation of a legal-advice sequence even if the case doesn't exist. The training objective doesn't directly reward factual grounding.
The root cause is the training objective: LLMs predict the next most probable token, optimizing for fluency and contextual fit. A plausible-sounding case citation fits the context perfectly β whether or not the case exists is a separate constraint not built into standard training.
3. What is the difference between intrinsic and extrinsic hallucination?
Correct. Intrinsic hallucinations contradict the provided sources (directly falsifying what the model was given), while extrinsic hallucinations add plausible-sounding information that wasn't in the source material at all. The invented legal citations in Mata were extrinsic.
The distinction is about relationship to sources: intrinsic hallucinations contradict provided documents, while extrinsic hallucinations add invented details that weren't there. Extrinsic ones are often harder to catch because they look like helpful elaboration.
4. What does Anthropic's model specification mean by "epistemic cowardice"?
Correct. Epistemic cowardice is specifically about evasion: when the model has a reasoned position but gives a vague non-answer to stay out of controversy. It's treated as dishonest because it withholds genuine reasoning the user could benefit from.
Epistemic cowardice is about strategic vagueness: saying "there are many perspectives" when you actually have a well-grounded view. It's the opposite of calibrated uncertainty β it's false uncertainty used as a shield against controversy.
Lab 3 β Hallucination Probing Techniques
Learn to detect and reduce hallucination risk in AI outputs
Your Task
After the Mata case, legal and medical professionals needed practical techniques to catch hallucinations before they caused harm. In this lab, you'll work through a systematic approach to hallucination-probing: identifying high-risk output types, designing verification prompts, and distinguishing calibrated uncertainty from epistemic cowardice.
Start here: "I'm a paralegal using AI to research case law. What are the three most dangerous hallucination patterns I should watch for, and how do I probe for each one?"
AI Lab Assistant
Hallucination & Calibration Lab
Good framing β high-stakes professional contexts are exactly where hallucination probing matters most. Let's build a systematic toolkit. What's your current practice when an AI gives you a specific case citation?
Module 8 Β· Lesson 4
The Evaluation Gap and the Path Forward
Why measuring AI reliability is itself unsolved β and what serious efforts look like
If we can't reliably test whether an AI system is reliable, how do we build the trustworthy systems the world needs?
In June 2024, FranΓ§ois Chollet updated the ARC-AGI (Abstraction and Reasoning Corpus) prize β a benchmark specifically designed to resist memorization. ARC tasks require genuine novel pattern recognition: given three or four input-output grid examples, deduce the rule and apply it to a new grid. State-of-the-art language models in 2023 scored around 0β5%. Human performance was roughly 85%.
By December 2024, OpenAI's o3 model β using extended chain-of-thought reasoning β scored 87.5% on the semi-private ARC-AGI-1 set, effectively matching human performance. The benchmark that was supposed to be resistant to AI progress had been cracked. Chollet immediately noted: this does not mean AGI. It means we need harder benchmarks. The evaluation gap had struck again.
The Evaluation Gap
The evaluation gap is the recurring phenomenon where AI benchmarks become unreliable measures of capability shortly after they become widely used. There are two failure modes: data contamination (the benchmark's test cases end up in training data, inflating scores) and benchmark saturation (models improve on the benchmark's specific task structure without acquiring the general capability the benchmark was meant to proxy).
Examples proliferate. The MMLU benchmark (Massive Multitask Language Understanding, 57 academic subjects) was treated as a gold standard from 2020 to 2023. By 2024, multiple models were scoring above 90%. Research by Gonen et al. (2023) and others found systematic evidence of data contamination β portions of MMLU test questions appeared verbatim in training corpora. The scores were partly measuring memorization, not comprehension.
2020
MMLU Published
Hendrycks et al. introduce the 57-subject benchmark as a measure of language model knowledge. Best model scores ~43%.
2022
GPT-4 Era Approaches
Models approach and then surpass human expert baselines on MMLU. The benchmark begins to saturate.
2023
Contamination Evidence
Multiple studies find MMLU test questions in training corpora. Scores increasingly reflect memorization.
2024
MMLU-Pro and New Benchmarks
MMLU-Pro (harder variants) and new benchmarks like GPQA, BIG-Bench Hard replace MMLU as capability indicators.
2024β25
ARC-AGI Cracked; New Versions Needed
o3 scores 87.5% on ARC-AGI-1. Chollet releases ARC-AGI-2, specifically designed to resist current reasoning architectures.
What Serious Evaluation Looks Like
The AI safety and reliability community has converged on several principles for evaluation that resists gaming. Held-out, never-published test sets β METR (Model Evaluation and Threat Research) maintains private test sets for autonomy and dangerous-capability evaluations that are never released publicly. Task diversity beyond benchmark structure β rather than testing on fixed-format questions, evaluators present genuinely novel problem types. Red-teaming as structured adversarial evaluation β human experts are paid to find failure modes, not to confirm capabilities.
Anthropic's Responsible Scaling Policy (RSP), published in 2023 and updated in 2024, represents a concrete commitment: before deploying a model at a new capability level, Anthropic must evaluate it against defined threat thresholds (particularly for CBRN β chemical, biological, radiological, nuclear β risk and for autonomous replication/adaptation). If the model exceeds a threshold, deployment is blocked until mitigations are in place.
Real Policy β Anthropic RSP, 2023
The Responsible Scaling Policy introduced "AI Safety Levels" (ASL-2, ASL-3, ASL-4) analogous to biosafety levels. ASL-3 requires that the model "could provide serious uplift to those seeking to create biological, chemical, nuclear, or radiological weapons with potential for mass casualties." If a model reaches ASL-3, specific containment protocols apply. As of 2024, no Anthropic model had been assessed above ASL-2 thresholds.
The Frontier Problem
The deepest challenge is that the most capable AI systems are also the hardest to evaluate. A model that can solve problems beyond human expert ability cannot be fully evaluated by human experts. This creates a principal-agent problem in evaluation: the evaluator must understand the domain well enough to judge whether the answer is correct, but if they could do that, they might not need the AI.
Proposed solutions include: formal verification (for mathematical reasoning, require proofs checkable by automated theorem provers β used in AlphaProof's 2024 performance on International Mathematical Olympiad problems); debate protocols (two AI systems argue different answers and a human judges the argument quality, not the content directly); and scalable oversight (use AI assistance to help humans evaluate AI outputs, carefully designed so the assisting AI cannot game the evaluation).
Where the Field Stands
Reliability in AI systems is not a solved problem β it is an active research frontier. Constitutional AI, chain-of-thought reasoning, calibration training, and structured evaluation represent genuine progress. The honest summary: we have built systems more reliable than anything before, and we still lack the tools to fully characterize what "reliable" means at the frontier.
Key Terms
Evaluation GapThe recurring phenomenon where AI benchmarks become unreliable measures of capability after they become widely used, due to data contamination or task-specific overfitting.
Data ContaminationWhen benchmark test questions appear in training data, causing models to perform through memorization rather than genuine capability.
Responsible Scaling Policy (RSP)Anthropic's commitment to evaluate models against capability thresholds before deployment and block or contain models that exceed safety levels.
Scalable OversightUsing AI assistance to help humans evaluate AI outputs, designed to extend human oversight to domains where direct human judgment is insufficient.
Debate ProtocolAn evaluation method where two AI systems argue opposing answers and a human judges argument quality β allowing evaluation without direct subject expertise.
Quiz β Lesson 4
The Evaluation Gap and the Path Forward Β· 4 questions
1. What happened to the ARC-AGI benchmark in December 2024, and why did Chollet say it didn't mean AGI?
Correct. o3 scoring 87.5% on a benchmark designed to resist AI was striking β but Chollet immediately noted it illustrated the evaluation gap: when a benchmark is cracked, you need a harder one. ARC-AGI-2 was developed in response.
o3 scored 87.5% in December 2024 β effectively matching human performance on a benchmark specifically designed to resist AI. But Chollet argued this proved the evaluation gap, not AGI: it meant the benchmark needed to be harder, not that general intelligence had arrived.
2. What are the two main failure modes of the evaluation gap?
Correct. Data contamination means the test set leaked into training data (partially explaining MMLU scores). Benchmark saturation means models optimize for the task format rather than the underlying capability β both make scores misleading.
The two core failure modes are: (1) data contamination β test questions end up in training corpora, turning evaluation into a memorization test; and (2) benchmark saturation β models overfit to the specific task structure without acquiring the general capability the benchmark was measuring.
3. What does Anthropic's ASL-3 threshold in the Responsible Scaling Policy specifically refer to?
Correct. ASL-3 is specifically pegged to CBRN (chemical, biological, radiological, nuclear) threat uplift β a model at this level could meaningfully help someone create weapons capable of mass casualties. This threshold triggers specific containment requirements.
ASL-3 is specifically about CBRN weapons uplift: a model that could provide serious assistance to someone trying to create biological, chemical, nuclear, or radiological weapons with mass-casualty potential. This threshold triggers mandatory containment protocols before deployment.
4. What is the "debate protocol" as a solution to the frontier evaluation problem?
Correct. The debate protocol is designed for the frontier problem: when domains exceed human expertise, humans can still evaluate arguments rather than conclusions. If one AI can identify flaws in another's reasoning, a human judge can follow the argument quality without needing to know the answer independently.
The debate protocol puts two AI systems in opposition: one argues for answer A, one for answer B. A human evaluator judges which argument is more coherent and sound β this leverages human ability to evaluate reasoning quality even when direct evaluation of the answer requires expertise the human lacks.
Lab 4 β Designing Reliable Evaluation
Apply evaluation principles to real AI deployment scenarios
Your Task
You've seen how benchmarks fail, how contamination corrupts scores, and how Anthropic's RSP creates structured capability thresholds. Now you'll practice the evaluator's craft: given a real deployment scenario, design an evaluation approach that resists the known failure modes.
Work with the assistant below to design a reliability evaluation plan for a specific AI deployment context β and identify which failure modes your plan is most vulnerable to.
Start here: "I'm deploying an AI assistant to help emergency room nurses triage patients. Walk me through how I should evaluate its reliability before launch β and what failure modes I should be most worried about."
AI Lab Assistant
Evaluation Design Lab
An ER triage assistant is an excellent high-stakes case β the failure modes matter enormously here. Before we design the evaluation, what's your biggest worry? False confidence in a wrong triage recommendation, or refusal to engage when a recommendation is needed?
Module 8 β Final Test
The Road to Reliable Reasoning Β· 15 questions Β· Pass mark: 80%
1. Constitutional AI replaced human raters in which phase of training?
Correct. In CAI's Phase 2 (RLAIF), an AI preference model judges which responses better follow the constitution, replacing most human raters in the RL feedback loop.
CAI's key innovation in Phase 2 (RLAIF) was using an AI preference model β trained on constitutional principles β to generate the preference labels that drive RL training, replacing most human raters.
2. What sources did Anthropic draw on when constructing Claude's constitution?
Correct. Anthropic's constitution drew eclectically from human rights frameworks (UN Declaration), existing acceptable-use policies (Apple ToS), and the company's own safety research.
The constitution drew from the UN Declaration of Human Rights, Apple's terms of service, and Anthropic's own research β a mix of international human rights frameworks, existing platform policies, and internal safety work.
3. Wei et al. (2022) found that chain-of-thought prompting required what minimum model size to show significant benefits?
Correct. This was a critical finding in the Wei et al. paper: CoT only emerged as beneficial at roughly 100B+ parameters. In smaller models, showing reasoning steps had no benefit and sometimes hurt performance β it is an emergent capability.
Wei et al. found that CoT is an emergent capability: it only helps at roughly 100B+ parameters. Smaller models showed flat or negative effects. This made CoT one of the first documented examples of a sharp emergent capability threshold.
4. "Language Models Don't Always Say What They Think" (Turpin et al., 2023) found that biased prompts:
Correct. This was the paper's central finding: the model's behavior was influenced by biases the user didn't intend to communicate, and the chain-of-thought narrative concealed rather than revealed those influences.
Turpin et al. found unfaithfulness: biases in prompts affected model outputs, but the chains of thought didn't mention them. The reasoning narrative was hiding the actual causal influences on the answer.
5. In what way does extended thinking (o1, Claude's thinking mode) attempt to improve on standard CoT?
Correct. The structural innovation is the dedicated compute budget with training signals designed to reward genuine exploration β creating at least a weak incentive for the reasoning to causally drive the output rather than rationalize it.
Extended thinking adds a dedicated reasoning compute budget with training incentives that reward arriving at correct answers through the thinking process. This is designed to make the thinking causally connected to outputs, not just a narrative layer added after the fact.
6. In the Mata v. Avianca case, Judge Castel described the filing as "unprecedented." What made this a landmark AI-reliability case?
Correct. The compounding failure β hallucinate, then confirm under questioning β demonstrated that AI confidently wrong is more dangerous than AI admittedly uncertain. The attorneys were fined $5,000.
The landmark element was the double failure: hallucinated citations, then confirmed them as real when directly challenged. This demonstrated that AI systems could be confidently wrong under questioning β not just passively wrong β creating real-world legal consequences.
7. Anthropic's 2022 paper "Language Models (Mostly) Know What They Know" found that:
Correct. The Kadavath et al. paper showed calibration was trainable β and that constitutional training improved it. This was early evidence that the CAI approach had benefits beyond harmlessness.
The Kadavath et al. paper (Anthropic, 2022) found that calibrated uncertainty was trainable, and CAI-trained models showed better calibration on many tasks than standard RLHF models β a non-obvious benefit of constitutional training.
8. What distinguishes intrinsic from extrinsic hallucination?
Correct. Intrinsic hallucinations directly falsify the provided source (the model contradicts what it was told), while extrinsic hallucinations add plausible-sounding material that wasn't in the source at all β like invented citations.
The distinction is about source relationship: intrinsic hallucinations contradict provided material (the model says the opposite of what its sources say), while extrinsic hallucinations add invented details that weren't in the sources at all.
9. MMLU's decline as a reliable benchmark was primarily due to:
Correct. Both failure modes struck MMLU: contamination evidence showed test questions in training data, and scores rising above 90% meant the benchmark could no longer discriminate among top models. MMLU-Pro and other harder variants emerged to replace it.
MMLU fell to both evaluation gap failure modes: contamination (test questions found verbatim in training corpora) and saturation (multiple models above 90%, no longer discriminating). This is why MMLU-Pro and new benchmarks emerged by 2024.
10. What is "scalable oversight" in the context of frontier AI evaluation?
Correct. Scalable oversight addresses the frontier problem: when domains exceed human expertise, use AI to help humans evaluate, but design the protocol so the helper AI can't corrupt the process.
Scalable oversight uses AI to extend human evaluation capacity into domains beyond direct human expertise β but requires careful protocol design to prevent the assisting AI from gaming or corrupting the evaluation process it's meant to support.
11. Anthropic's model specification concept of "epistemic cowardice" refers to:
Correct. Anthropic distinguishes epistemic cowardice from appropriate uncertainty: if the model has a reasoned position and withholds it to avoid controversy, that's a form of dishonesty. Calibrated confidence means being clear when you have grounds for confidence.
Epistemic cowardice is strategic vagueness β saying "there are many perspectives" when you have a genuine, reasoned position. Anthropic treats this as dishonest because it withholds reasoning the user could benefit from, using false uncertainty as a shield.
12. AlphaProof's 2024 performance on IMO problems used what technique to enable reliable evaluation?
Correct. Formal verification is one of the most promising solutions to the frontier evaluation problem in mathematical domains: if a proof is machine-checkable, correctness can be verified without requiring the evaluator to be smarter than the model that generated the proof.
AlphaProof used formal verification: solutions were expressed as machine-checkable proofs. This allows evaluation without the evaluator needing to understand the mathematics at the level the model operates β automated theorem provers can verify correctness mechanically.
13. The "debate protocol" for AI evaluation works by:
Correct. The debate protocol leverages human ability to judge reasoning quality even in domains where they lack direct expertise to evaluate conclusions. If an AI can expose flaws in another's reasoning, a non-expert judge can follow the argument.
The debate protocol has two AI systems argue opposing positions. A human judge evaluates argument quality β coherence, handling of objections, logical soundness β rather than needing to know the correct answer independently. This extends human oversight into expert domains.
14. What is the "principal-agent problem" as it applies to frontier AI evaluation?
Correct. This is the deepest challenge in frontier evaluation: as AI exceeds human expert ability, the humans best positioned to evaluate it are also those who least need it. Solutions like debate, formal verification, and scalable oversight try to work around this constraint.
The frontier evaluation principal-agent problem: to judge whether an AI's answer is correct, you need domain expertise β but if you had enough expertise to fully evaluate it, you might not need the AI. This creates a circular dependency that debate, formal verification, and scalable oversight try to break.
15. Which statement best summarizes the current state of AI reliability as of 2024β2025?
Correct. This is the honest summary: meaningful, documented progress on multiple fronts, and still no complete solution. The tools to characterize reliability at the frontier remain inadequate relative to the rate of capability advance.
The honest summary is: real, documented progress (CAI, CoT, calibration, RSP, structured evaluation) alongside genuine unsolved problems (evaluation gap, causal faithfulness, frontier oversight). Better than anything before, and still not good enough.