Module 3 · Lesson 1

Prompting at Scale: Instruction-Following Pipelines

How a carefully crafted prompt becomes thousands of training examples

If a model can follow instructions, can it write its own instruction manual?

Stanford researchers released Alpaca on March 13, 2023 — a 7-billion-parameter model fine-tuned on just 52,000 instruction-response pairs. The pairs were generated entirely by GPT-3.5 over a weekend, at a cost of under $500. The result matched GPT-3 on many benchmarks. The AI community's reaction was roughly: wait, you can just do that?

What Is an Instruction-Following Pipeline?

An instruction-following pipeline is a systematic process for generating large quantities of (instruction, response) pairs that can be used to fine-tune a base language model. The core idea is deceptively simple: you already have a capable model — use it to produce the training data that will make another (often smaller, cheaper) model capable too.

The canonical workflow has three stages. First, a set of seed tasks — typically 100–200 human-written examples — establishes the distribution of instructions: their length, domain, complexity, and style. Second, a generation model (usually a frontier API) is prompted to produce many more instructions in the same style. Third, the same or a different model generates responses to those instructions. The result is a dataset that can be tens of thousands of examples large.

Stanford's Alpaca paper formalized this approach in early 2023. Its authors prompted text-davinci-003 with a meta-prompt: "Can you generate 20 more diverse task instructions and their outputs, in the style of these examples?" The model complied, and 52,000 pairs later, Alpaca existed.

Key Finding — Alpaca, 2023

The Alpaca team reported a total compute and API cost of approximately $600 to generate the full 52,000-example dataset. Fine-tuning the LLaMA-7B base model cost roughly $100 on cloud GPUs. The resulting model was judged by human evaluators to match GPT-3.5 on 90 of 252 test cases — a striking result given the cost differential.

The Three Shapes of Synthetic Instructions

Not all instruction-following data looks the same. Researchers have settled on three broad categories that shape what a fine-tuned model learns to do:

Open-ended Generation

Tasks with no fixed answer: Write a haiku about entropy. Summarize this article. Explain quantum entanglement to a 10-year-old. These dominate datasets like Alpaca and Dolly.

Closed-ended Classification

Tasks with a small answer space: Classify the sentiment of this review. Is this email spam? What language is this text? Responses are short, verifiable, and easy to validate automatically.

Dialogue Continuation

Multi-turn exchanges that teach conversational behavior: User: Can you help me debug this code? Assistant: Sure, paste it here. Used in ShareGPT-derived datasets collected from real human-ChatGPT conversations.

Self-Instruct: The Generator Teaches Itself

The Self-Instruct paper (Wang et al., 2022, published on arXiv before Alpaca) identified something even more striking: a model can generate instructions for itself. The pipeline feeds the model a small set of seed instructions, asks it to generate new instructions it hasn't seen, then asks it to complete those instructions, then uses the resulting (instruction, completion) pairs as fine-tuning data. The fine-tuned model is then used as the generator for the next round.

Self-Instruct applied to GPT-3 (text-davinci-001) produced a 52K-example dataset — the direct precursor to Alpaca — and improved GPT-3's instruction-following by 33% on their evaluation benchmark without any human-labeled data beyond the initial 175 seed tasks.

The critical engineering detail: a ROUGE-L similarity filter discards any generated instruction that overlaps too heavily with existing ones. This diversity filter is what prevents the pipeline from collapsing into the same five instruction types repeated 10,000 times.

Key Term — ROUGE-L Filter

ROUGE-L measures the longest common subsequence between two text strings. In Self-Instruct pipelines, any newly generated instruction with a ROUGE-L score above 0.7 against any existing instruction is discarded. This enforces diversity in the training distribution — a critical quality control step that human labelers perform intuitively but machines require explicit guidance to do.

Why This Works (and When It Doesn't)

The mechanism underlying instruction-following pipelines depends on the base model already possessing latent capability. GPT-3 could perform many tasks it was never explicitly trained to do — the knowledge was present in weights shaped by internet-scale pretraining. Fine-tuning on instruction-response pairs doesn't add new knowledge so much as it unlocks access to existing knowledge by teaching the model the format of being helpful.

This is also the failure mode. When the base model lacks genuine capability in a domain — say, advanced mathematics or verified medical diagnosis — instruction-following fine-tuning doesn't fix that. It makes the model sound helpful in that domain while producing confidently wrong answers. The Alpaca paper itself warned that the model "can produce false information" and "is not fit for real-world use" in high-stakes domains.

The downstream lesson: synthetic instruction data is a delivery mechanism for capability already present, not a mechanism for creating new capability from nothing.

Self-Instruct A bootstrapping method where a language model generates its own instruction-following training data, starting from a small seed set of human-written examples. Proposed by Wang et al. (2022), it became the basis for the Alpaca dataset.

Seed Tasks The small set of human-written (instruction, response) pairs that define the distribution of a synthetic dataset. Quality and diversity of seed tasks directly determines quality and diversity of generated data.

Instruction Tuning Fine-tuning a pretrained language model on (instruction, response) pairs to improve its ability to follow natural language directions. Distinct from RLHF — instruction tuning uses supervised learning on formatted examples rather than reward modeling.

Lesson 1 Quiz

Instruction-Following Pipelines · 4 questions

Stanford's Alpaca model was fine-tuned on how many instruction-response pairs, and who generated them?

Correct. The Alpaca dataset consisted of 52,000 (instruction, response) pairs generated by prompting text-davinci-003 at a total API cost under $500. This was a central finding of the Stanford Alpaca paper (March 2023).

Not quite. Alpaca used 52,000 pairs generated by GPT-3.5 via the OpenAI API — no human labelers were involved in generating the pairs, and the cost was under $500 total.

What is the primary purpose of a ROUGE-L similarity filter in Self-Instruct pipelines?

Correct. The ROUGE-L filter discards any generated instruction with a similarity score above 0.7 against existing instructions. This prevents the dataset from collapsing into repetitive variations of the same few task types.

Not quite. The ROUGE-L filter is specifically a diversity enforcement tool — it removes instructions that are too similar to ones already in the dataset. It does not check factual accuracy or grammar.

Which statement best describes what instruction-tuning actually does to a base language model?

Correct. Instruction tuning doesn't inject new knowledge — it teaches the model to surface knowledge already present in its pretrained weights in the format of helpful responses to instructions. This is why it can fail in domains where the base model lacks genuine capability.

Not quite. Instruction tuning works by teaching the model to express capabilities already embedded during pretraining, not by adding new knowledge or eliminating hallucination.

The Self-Instruct paper (Wang et al., 2022) reported that bootstrapping GPT-3 with self-generated instruction data improved its instruction-following performance by approximately how much?

Correct. The Self-Instruct paper showed a ~33% improvement on their instruction-following evaluation benchmark using only 175 human-written seed tasks and entirely model-generated training data for the rest.

Not quite. The Self-Instruct paper reported a ~33% improvement — substantial, especially given that only 175 human-written seed tasks were used and all remaining data was synthetically generated.

Lab 1: Design a Seed Task Set

Practice building instruction-following pipelines with an AI tutor

Your Objective

You are designing a Self-Instruct pipeline to create training data for a customer-service assistant for an e-commerce platform. The AI tutor will help you think through how to construct your seed task set — the 100–200 human-written examples that will shape everything generated after them.

Start by telling the tutor what domain you want your seed tasks to cover, and ask how to maximize diversity within a single domain. Or ask: "What makes a good seed task for a customer service instruction pipeline?"

AESOP Lab Tutor

Instruction Pipeline Design

Welcome to Lab 1. We're going to design a seed task set for a Self-Instruct pipeline. Seed tasks are the foundation — their diversity and quality directly determine the quality of the thousands of examples that will be generated from them. What domain or use case are you targeting, and what kinds of instructions do you want your fine-tuned model to handle well?

Module 3 · Lesson 2

Constitutional AI: Models That Critique Themselves

How Anthropic taught Claude to revise its own outputs using a written constitution

What happens when you give a model a list of values and ask it to judge its own answers?

In December 2022, Anthropic published a paper describing a new training method they called Constitutional AI. The paper described a model — an early version of Claude — that had been trained partly by critiquing and revising its own outputs against a written set of principles. Instead of relying solely on human feedback to identify harmful responses, the model was given a constitution and asked to be its own editor.

The Two Phases of Constitutional AI

Constitutional AI (CAI) as described by Anthropic in their December 2022 paper operates in two distinct phases, each using the model's own generation capability to produce training signal.

Supervised Learning Phase (SL-CAI). A "helpful-only" model — one trained to follow instructions without any harmlessness constraints — is prompted with potentially harmful requests. It generates responses. Then, using the constitutional principles, the model is asked to critique each response ("Does this response contain harmful content? Which principle does it violate?") and then to revise the response to be less harmful. The original harmful-prompt / revised-response pairs become supervised training data.

RL from AI Feedback Phase (RLAIF). The SL-CAI model generates pairs of responses to prompts. A separate "preference model" is asked to choose which response better conforms to the constitution. These AI-generated preference labels — rather than human labels — are used to train a reward model. The reward model then drives reinforcement learning fine-tuning. This is what makes the approach scalable: humans write the constitution, but AI generates the preference labels.

What Is In the Constitution?

Anthropic's published constitution for CAI is not a simple list of banned topics. It is a set of natural language principles, each framed as a question the model is asked to evaluate its response against. Examples from the published paper include:

Sample Constitutional Principles (Anthropic, 2022)

"Choose the response that is least likely to contain harmful or unethical content."

"Choose the response that is most supportive of life, liberty, and personal security."

"Which response is less likely to contain content that would be objectionable to a thoughtful, senior Anthropic employee?"

"Choose the response that is most consistent with the Universal Declaration of Human Rights."

The principles are deliberately high-level and value-laden, not narrow technical rules. This means the model must do genuine interpretation when applying them — a feature, not a bug. Narrow rules can be gamed; abstract principles require reasoning about intent.

Anthropic also published what they called a "helpfulness constitution" alongside the harmlessness principles — the model is also asked to critique its responses for being unnecessarily unhelpful or paternalistic. This dual-sided critique is intended to prevent the model from becoming so cautious it stops being useful.

RLAIF vs RLHF: What Changes?

In standard Reinforcement Learning from Human Feedback (RLHF — the method used to train InstructGPT and the first versions of ChatGPT), human labelers read pairs of model responses and mark which one is better. Thousands of such preference pairs train a reward model. The reward model then guides RL fine-tuning.

RLAIF replaces the human labelers with another model. The preference model reads the constitution, reads two candidate responses, and produces a preference label. In Anthropic's 2022 experiments, RLAIF-trained models performed comparably to RLHF-trained models on harmlessness evaluations, while requiring far fewer human-labeled comparisons.

The implication is significant: scaling harmlessness training no longer requires linearly scaling human labeling effort. The bottleneck shifts from human time to the quality of the written constitution and the capability of the preference model.

Documented Outcome — CAI Paper, December 2022

Anthropic's experiments showed that CAI-trained models were rated as less harmful than RLHF-trained models on the same prompts, while being rated as equally or more helpful. Critically, CAI models were less likely to refuse benign requests — the "dual newspaper test" failure mode where a model appears in the headline "AI Refuses to Help with Innocuous Request."

The Critique-Revise Loop as Data Generator

The most novel aspect of CAI from a synthetic data perspective is the critique-revise loop. The model generates a draft response, then is prompted to identify problems with it, then is prompted to rewrite it fixing those problems. Each revision cycle produces a new (prompt, response) training pair.

This is recursive self-improvement applied to alignment rather than capability. The model's improved outputs become training data for a more improved model. The process converges because each revision is anchored to an external document — the constitution — rather than to the model's own unconstrained preferences.

Constitutional AI (CAI) Anthropic's training method (December 2022) in which a model critiques and revises its own outputs against a written set of principles, generating training data without human labelers in the loop for harmlessness evaluation.

RLAIF Reinforcement Learning from AI Feedback. A variant of RLHF where another model (rather than human labelers) generates the preference labels used to train a reward model. Reduces the human labeling bottleneck in alignment training.

Critique-Revise Loop The iterative process in CAI where a model generates a response, critiques it against constitutional principles, and rewrites it to better conform. The revised outputs serve as supervised training data.

Lesson 2 Quiz

Constitutional AI · 4 questions

In Phase 1 of Constitutional AI (SL-CAI), what does the model use the constitution for?

Correct. In SL-CAI, the model generates a response, then is prompted to critique that response against constitutional principles, then rewrites it. The harmful-prompt / revised-response pairs become supervised training data.

Not quite. In SL-CAI, the model reads its own generated response, uses the constitution to identify problems, and then revises the response. The critique-revise loop produces the supervised training data.

What is the key difference between RLHF and RLAIF?

Correct. RLAIF's defining feature is using a model to generate the preference labels that in RLHF would require human annotators. The constitution gives the AI preference model its evaluation criteria.

Not quite. Both use reinforcement learning. The key distinction is who generates the preference labels: humans (RLHF) versus another AI model reading a written constitution (RLAIF).

Why does Anthropic's CAI constitution include principles about helpfulness alongside harmlessness principles?

Correct. The dual-sided critique — checking both for harm and for unnecessary unhelpfulness — prevents over-refusal. Anthropic described this as the "dual newspaper test": the model should not appear in either a story about AI harm or a story about AI refusing innocuous requests.

Not quite. The helpfulness principles exist to balance harmlessness training and prevent over-refusal — making the model so cautious it becomes useless. This is the "dual newspaper test" problem CAI explicitly tries to solve.

According to Anthropic's 2022 CAI paper, how did RLAIF-trained models compare to RLHF-trained models on harmlessness evaluations?

Correct. The CAI paper reported that RLAIF-trained models were rated less harmful than RLHF-trained counterparts on the same prompts, while maintaining equal or better helpfulness scores — a key result demonstrating the viability of AI-generated feedback.

Not quite. The CAI paper showed RLAIF models were rated as less harmful and equally or more helpful than RLHF models — a result that supported the viability of replacing human preference labelers with constitutionally-guided AI feedback.

Lab 2: Write a Mini-Constitution

Apply Constitutional AI principles to a real design problem

Your Objective

You are designing a Constitutional AI system for a mental health support chatbot. This is a high-stakes domain where both excessive harmlessness (refusing to engage) and insufficient safety (providing harmful advice) create real-world risk. Work with the tutor to draft a small set of constitutional principles — and then test them against edge cases.

Start by drafting 2–3 constitutional principles for your mental health chatbot. Then ask the tutor to help you test those principles against edge cases: situations where your principles might conflict with each other.

AESOP Lab Tutor

Constitutional AI Design

Welcome to Lab 2. You're writing a mini-constitution for a mental health support chatbot. This is one of the hardest domains for constitutional design — safety and helpfulness can directly conflict. For example: a principle saying "never provide advice about medications" could prevent you from telling someone to take their prescribed medication as directed. Let's start: what 2–3 principles would you propose, and how would you phrase them as natural language evaluation criteria?

Module 3 · Lesson 3

Process Reward Models and Step-Level Verification

How OpenAI taught models to check their own reasoning one step at a time

What if instead of grading the final answer, you graded every step along the way?

In May 2023, OpenAI published a paper called Let's Verify Step by Step. The finding was unexpected in its clarity: for mathematical reasoning, process reward models — which judge each step of a solution rather than just the final answer — dramatically outperformed outcome reward models on MATH benchmark problems. The gap was large enough that a model using process supervision reached 78% accuracy on MATH, compared to 72% for the same model using outcome supervision. Not a trivial difference on a hard benchmark.

Outcome Rewards vs. Process Rewards

The distinction between outcome and process reward models is straightforward in concept but has large practical consequences.

An outcome reward model (ORM) looks at the final answer and says: right or wrong. This is how most early RLHF reward models worked — and it creates an obvious problem. A model can arrive at a correct answer via incorrect or lucky reasoning. On mathematical problems, a model might guess "42" and be rewarded, even if its chain of thought is incoherent. The reward signal doesn't distinguish how the answer was reached.

A process reward model (PRM) evaluates each intermediate step. For a multi-step math problem, the PRM might score each line of algebra: Step 1 is correct. Step 2 is correct. Step 3 contains an error — this value should be negative. The reward signal is attached to the reasoning process, not just the conclusion.

Outcome Reward Model (ORM)

Evaluates only the final output. Simple to train — just check if the answer matches the ground truth. Vulnerable to reward hacking: models find paths to correct answers through incorrect reasoning.

Process Reward Model (PRM)

Evaluates each intermediate step. Harder to train — requires step-level labels. More robust: a model can't get credit for a correct answer reached by faulty reasoning. OpenAI's PRM800K dataset annotated 800,000 reasoning steps.

PRM800K: Building Step-Level Training Data

The bottleneck for process reward models is obvious: you need human labelers to evaluate individual reasoning steps, not just final answers. For mathematics, OpenAI created PRM800K — a dataset of 800,000 step-level human labels on reasoning chains for MATH benchmark problems.

Labelers were shown each step of a model-generated solution and asked to mark it as: positive (correct step), negative (incorrect step), or neutral (the step is correct but doesn't meaningfully advance the solution). This granular labeling allows the PRM to learn which reasoning moves are legitimate and which are errors, even within solutions that happen to reach correct final answers.

The critical insight from the OpenAI paper: the PRM trained on PRM800K could be used as a verifier during inference. Instead of generating one answer and accepting it, the model generates many candidate solutions, the PRM scores each step of each solution, and the highest-scoring complete solution is selected. This "best-of-N" approach with a PRM verifier significantly outperforms best-of-N with an ORM.

Key Result — Let's Verify Step by Step (OpenAI, May 2023)

Using a process reward model for best-of-N selection, the GPT-4 class model reached 78.2% accuracy on the MATH benchmark. The same model using an outcome reward model reached 72.4%. Human-level performance on MATH is estimated at approximately 40% (among contest mathematicians, higher). The PRM approach represents state-of-the-art for automated mathematical reasoning verification at the time of publication.

Synthetic Data for PRM Training

Human step-level annotation is expensive at scale. The natural extension — pursued by later work including DeepMind's work on AlphaProof and OpenAI's o1-series models — is to generate step-level training data synthetically.

One approach: Monte Carlo step verification. Given a solution up to step N, generate many completions from step N onward. If the majority of completions reach the correct final answer, step N is labeled positive. If most completions fail from step N, step N is labeled negative. The label is assigned without human input — just by running the model forward many times and aggregating outcomes.

This is one of the techniques believed to underlie the training of OpenAI's o1 and o3 models, which demonstrate substantially stronger multi-step reasoning than GPT-4. The key idea: use outcome verification (which requires only ground-truth answers) to infer process-level labels, then train a PRM on those inferred labels.

Connection — From PRM to o1

OpenAI's o1 model (released September 2024) is believed to use extended "thinking" chains where the model reasons through problems step by step before answering. The process reward modeling research from 2023 likely forms part of the training infrastructure that makes o1's extended reasoning reliable rather than wandering. The PRM ensures that additional thinking steps improve rather than degrade answer quality.

The Generalization Question

Process reward models trained on mathematics don't automatically generalize to other domains. Mathematical reasoning has a key property that makes PRM training tractable: steps are verifiable. Algebra steps can be checked symbolically. The correct answer to most competition math problems is known.

For domains like scientific reasoning, legal analysis, or coding, step-level ground truth is harder to establish. A "step" in a legal argument may be a judgment call rather than a verifiable fact. This is an active research frontier: how to extend process supervision to domains where correctness is ambiguous rather than binary.

Process Reward Model (PRM) A reward model that evaluates the quality of each individual reasoning step rather than only the final answer. Demonstrated to outperform outcome reward models on mathematical reasoning tasks in OpenAI's "Let's Verify Step by Step" (2023).

PRM800K OpenAI's dataset of 800,000 human-labeled reasoning steps on MATH benchmark problems, used to train the process reward models described in their 2023 paper. Each step labeled positive, negative, or neutral.

Monte Carlo Step Verification A synthetic data generation method for PRM training: generate many completions from a given reasoning step, infer whether the step is correct based on majority-vote outcomes, and assign labels without human input.

Lesson 3 Quiz

Process Reward Models · 4 questions

What is the core problem with outcome reward models (ORMs) that process reward models (PRMs) are designed to address?

Correct. An ORM can't distinguish between a model that reasoned correctly and one that guessed correctly. A PRM evaluates each step, ensuring reward is tied to the quality of the reasoning process, not just the final output.

Not quite. The key problem with ORMs is that they reward correct answers regardless of reasoning quality — a model can reach the right answer through faulty logic and still receive full reward, which doesn't teach better reasoning.

What accuracy did the OpenAI process-supervised model achieve on the MATH benchmark in the "Let's Verify Step by Step" paper?

Correct. The PRM-supervised model reached 78.2% on MATH, compared to 72.4% for the ORM-supervised model. This ~6 percentage point gap on a difficult benchmark is considered substantial.

Not quite. The process-supervised model reached 78.2% on MATH, versus 72.4% for outcome supervision — a meaningful gap that demonstrated the value of step-level feedback for mathematical reasoning.

How does Monte Carlo step verification generate process-level labels without human annotation?

Correct. Monte Carlo step verification generates many completions from step N onward. If most completions reach the correct answer, step N is labeled positive (it's likely a good step). If most fail, step N is labeled negative. This uses outcome verification to infer process labels.

Not quite. Monte Carlo step verification works by sampling many completions from a given step. If the majority reach the correct final answer, the step is labeled positive; if most completions fail, the step is labeled negative. No human annotation or symbolic solver needed.

Why do process reward models trained on mathematics not automatically generalize to domains like legal analysis or scientific reasoning?

Correct. The tractability of PRM training in math depends on the fact that reasoning steps are verifiable — each algebraic manipulation is either correct or not. In legal or scientific domains, what counts as a "correct step" is often contested, making step-level labeling (human or synthetic) much harder.

Not quite. The fundamental challenge is verifiability: mathematical steps have clear right/wrong status. A step in a legal argument may be a reasonable judgment call that experts disagree on — there's no ground truth to verify against.

Lab 3: Evaluate Reasoning Steps

Practice identifying valid and invalid reasoning steps like a PRM

Your Objective

You're going to think like a process reward model. The tutor will walk you through multi-step reasoning problems and ask you to evaluate each step — marking it correct, incorrect, or neutral. Then you'll discuss why step-level evaluation is harder than it looks, especially outside pure mathematics.

Ask the tutor to give you a multi-step reasoning problem with a flawed solution — one where the final answer happens to be correct but one of the intermediate steps contains an error. Try to identify which step is wrong.

AESOP Lab Tutor

Process Reward Modeling

Welcome to Lab 3. You're going to think like a Process Reward Model — evaluating reasoning step by step, not just the final answer. This is surprisingly difficult, even for humans. I'll give you reasoning chains to evaluate. Some will have subtle errors in intermediate steps that still somehow lead to correct conclusions. Ready? Ask me to give you a flawed-but-correct reasoning chain to evaluate, or tell me what domain you'd like to practice in: math, logic, science, or something else.

Module 3 · Lesson 4

Scalable Oversight and Debate as Data Generation

When the AI is smarter than the evaluator, how do you train it to be honest?

If a model's outputs are too complex for humans to evaluate, who verifies the verifier?

As AI systems grew more capable, an uncomfortable problem became increasingly concrete: the models being trained were starting to produce outputs — complex proofs, intricate code, sophisticated arguments — that human evaluators could no longer reliably judge. If a model generates a subtle but wrong mathematical proof, a human reviewer may lack the expertise to catch the error. And if the reward model is trained on flawed human judgments, the AI learns to produce outputs that seem correct to humans rather than outputs that are correct. This gap between appearance and reality is the scalable oversight problem.

Scalable Oversight: The Core Problem

Scalable oversight is the challenge of maintaining reliable human control over AI systems as those systems become more capable than the humans evaluating them. It was identified explicitly by Christiano et al. (OpenAI) in a 2021 paper, and has since become a central research area at both OpenAI and Anthropic.

The problem has a specific structure: you need ground truth to train a reward model; you need the reward model to be accurate to train a capable AI; but the only reliable ground truth available is human judgment — and human judgment fails precisely in the domains where you most need capable AI to succeed (highly technical reasoning, complex strategy, long-horizon planning).

Two proposed solutions have generated significant research and, importantly, have been used as mechanisms for generating training data: debate and recursive reward modeling (RRM).

AI Debate as a Training Data Generator

The debate approach (Irving et al., DeepMind, 2018; further developed by Anthropic and OpenAI subsequently) works on the following principle: if two AI agents argue opposite sides of a question, and they are each trying to win the argument, a dishonest or incorrect argument is easier for the other agent to attack. Therefore, the outcome of a well-structured debate between AI agents is more reliably correct than a single AI's uncontested answer — even if neither debater is fully trusted.

In practice, debate generates training data as follows. Two models are prompted to argue opposing positions on a complex claim. A human judge — who may not have the domain expertise to evaluate the claim directly — watches the exchange and votes on which argument is more persuasive and internally consistent. The debate transcripts and votes become preference data that trains future models to argue more carefully and to identify errors in opponent arguments.

Anthropic's Scalable Oversight Research — 2022

Anthropic's team published "Measuring Progress on Scalable Oversight for Large Language Models" in 2022, describing an experiment in which models were asked complex questions from QuALITY (a reading comprehension benchmark requiring careful reasoning). Human evaluators with access to the full text answered baseline questions. Separately, models were used to assist evaluators by highlighting relevant passages and arguing for answers. The AI-assisted humans outperformed unassisted humans, suggesting AI debate/assistance can extend effective human oversight into harder domains.

Recursive Reward Modeling (RRM)

Recursive Reward Modeling addresses the oversight problem differently. Instead of having humans directly evaluate complex AI outputs, you decompose the evaluation task into subtasks that are within human evaluative capability. A model generates a complex output. That output is broken into components. Humans evaluate the components. A reward model aggregates the component scores into an overall score. The reward model is then used to evaluate future complex outputs without human input.

The key insight: humans may not be able to evaluate "Is this 2,000-line codebase correct?" but they can evaluate "Is this 20-line function doing what its docstring says?" Breaking evaluation into tractable pieces makes scalable oversight possible even when the aggregate task is too hard for direct human judgment.

This approach has a direct synthetic data implication: you can generate training data for difficult domains by generating many hard tasks, decomposing them into verifiable subtasks, having humans or simpler models verify the subtasks, and using the aggregated verdicts as labels for the harder tasks. The training data for complex reasoning is built from verifiable components.

Superalignment: OpenAI's 2023 Initiative

In July 2023, OpenAI announced the Superalignment initiative, dedicating 20% of its compute to the challenge of using AI to assist in the alignment of future, more capable AI systems. The core research agenda explicitly included using weaker AI models to supervise and evaluate stronger AI models — a form of scalable oversight that inverts the typical training hierarchy.

The Superalignment team's initial technical direction, as described in their public writing, included: using GPT-4-class models to generate automated evaluation data for hypothetical superintelligent outputs; developing interpretability tools to verify AI reasoning without relying on output plausibility; and exploring debate as a mechanism for catching errors that evaluators would otherwise miss.

The program has since undergone significant internal disruption — several key researchers, including co-lead Ilya Sutskever, departed OpenAI in 2024 — but the research agenda it articulated continues to influence how the broader field thinks about generating training data for capability domains that exceed human expertise.

The Synthesis Across This Module

Lessons 1–4 of this module trace a single arc: the progressive automation of training data generation. Instruction pipelines automated format. Constitutional AI automated preference labeling. Process reward models automated step-level verification. Debate and recursive reward modeling automate evaluation of complex outputs. Each step reduces the human labor required while extending the capability of what can be trained on. The open question — the subject of ongoing research — is whether this arc can continue safely as the systems being trained surpass the systems doing the training.

Scalable Oversight The challenge of maintaining reliable human supervision of AI systems as those systems become more capable than the humans evaluating them. Central research area at OpenAI, Anthropic, and DeepMind since 2021.

AI Debate A training paradigm where two AI agents argue opposing positions; a human judge evaluates the exchange. The assumption is that incorrect or dishonest arguments are easier to attack, so debate outcomes are more reliable than single-model answers even when neither model is fully trusted.

Recursive Reward Modeling Breaking complex evaluation tasks into subtasks within human evaluative capability, having humans or simpler models verify the subtasks, then aggregating verdicts into reward signals for harder tasks. Addresses the oversight problem by decomposing rather than ignoring evaluation difficulty.

Lesson 4 Quiz

Scalable Oversight and Debate · 4 questions

What is the "scalable oversight problem" as it applies to AI training?

Correct. Scalable oversight is specifically about the mismatch between evaluator capability and AI output complexity. As models produce more sophisticated outputs, human evaluators can no longer reliably distinguish correct from incorrect — meaning reward models trained on their judgments become unreliable.

Not quite. Scalable oversight is about the evaluation challenge: as AI outputs grow more sophisticated, human evaluators lose the ability to reliably assess their correctness, undermining the quality of training signal.

What is the core theoretical justification for using AI debate to generate reliable preference data?

Correct. The key assumption in debate is that truth has an asymmetric advantage: false claims are more attackable than true ones. Therefore, even if neither debater is fully trusted, the outcome of a well-structured debate is a more reliable signal than an uncontested assertion.

Not quite. The justification for debate is specifically the asymmetry of error: incorrect claims are more vulnerable to attack by a capable opponent than correct claims. This makes debate outcomes more reliable as preference signals even when neither individual model is trustworthy.

What does Recursive Reward Modeling (RRM) do differently from standard reward modeling to address the scalable oversight problem?

Correct. RRM's key mechanism is decomposition: "Is this 2,000-line codebase correct?" is intractable for human evaluators, but "Does this 20-line function match its docstring?" is not. By breaking complex evaluation into verifiable pieces, RRM extends effective human oversight.

Not quite. RRM solves scalable oversight by decomposing hard evaluation tasks into easier subtasks. Humans evaluate the subtasks they can handle, and the aggregated scores produce reward signals for complex outputs that humans couldn't evaluate directly.

What did Anthropic's 2022 scalable oversight experiment demonstrate using the QuALITY reading comprehension benchmark?

Correct. The Anthropic experiment showed that when AI models helped human evaluators by highlighting relevant passages and arguing for answers, the humans performed better than unassisted evaluators — a direct demonstration that AI assistance can extend human oversight capability.

Not quite. The experiment's key finding was positive: AI-assisted humans outperformed unassisted humans on the QuALITY benchmark, supporting the idea that AI can help maintain human oversight even in domains that exceed unaided human capability.

Lab 4: Scalable Oversight Strategy

Design an oversight mechanism for a domain that exceeds human expertise

Your Objective

You are advising a research lab building an AI system for drug discovery — identifying candidate molecules for novel antibiotics. The system produces complex outputs (molecular structures, binding affinity predictions, synthesis pathways) that expert chemists can evaluate slowly but not at the scale or speed the AI generates them. How do you maintain reliable oversight?

Tell the tutor which oversight approach you'd propose — debate, recursive reward modeling, process supervision, or a hybrid — and explain your reasoning. The tutor will challenge your design and help you identify its weaknesses and strengths.

AESOP Lab Tutor

Scalable Oversight Design

Welcome to Lab 4. You're designing a scalable oversight system for an AI drug discovery platform. The challenge: the AI generates thousands of candidate molecule evaluations per day; expert medicinal chemists can carefully review maybe 50. You need a way to maintain reliable quality control over outputs that exceed human review capacity. Which oversight approach would you propose — debate between AI agents, recursive decomposition of evaluation, process-level reward modeling, or something hybrid? Tell me your initial thinking and we'll stress-test it.

Module 3 Test

How Models Generate Training Data · 15 questions · 80% to pass

1. The Stanford Alpaca paper demonstrated that a capable instruction-following model could be produced for approximately what total cost?

Correct. The Alpaca paper reported approximately $500 in API costs and ~$100 in compute for fine-tuning — under $600 total.

Alpaca's total cost was under $600 — around $500 in API costs and $100 in GPU compute for fine-tuning LLaMA-7B.

2. In the Self-Instruct framework, what is the function of the ROUGE-L similarity filter applied during instruction generation?

Correct. The ROUGE-L filter enforces diversity by discarding instructions with similarity scores above 0.7 against any existing instruction in the dataset.

The ROUGE-L filter is a diversity enforcement mechanism — it discards instructions that are too similar to ones already in the dataset, measured by longest common subsequence overlap.

3. Which of the following best describes what instruction tuning actually accomplishes in a base language model?

Correct. Instruction tuning is a delivery mechanism — it unlocks access to latent capability without injecting new knowledge. This is why it fails when the base model genuinely lacks capability in a domain.

Instruction tuning teaches the model the format of being helpful, unlocking capabilities already embedded during pretraining. It doesn't add knowledge or fix hallucination at the source.

4. Constitutional AI's Phase 1 (SL-CAI) produces training data by:

Correct. SL-CAI uses the critique-revise loop: generate → critique against constitution → revise. The harmful-prompt / revised-response pairs become supervised training data without human labelers.

SL-CAI's data generation is: generate a response (potentially harmful), critique it using constitutional principles, revise it. The original prompt + revised response become training data. No human labelers are needed for this phase.

5. RLAIF differs from standard RLHF primarily in that:

Correct. The defining difference is who generates preference labels: humans (RLHF) versus an AI model guided by a written constitution (RLAIF). This removes the human labeling bottleneck from harmlessness training.

The key distinction is the source of preference labels. RLHF uses humans; RLAIF uses another AI model that evaluates candidate responses against a written constitution. Same RL framework, different feedback source.

6. Anthropic's published CAI constitution principles are best described as:

Correct. The constitutional principles are high-level, value-laden natural language questions ("Choose the response least likely to contain harmful content") that require genuine interpretation. Narrow rules can be gamed; abstract principles require reasoning about intent.

CAI principles are high-level natural language value statements requiring interpretation — not narrow rules or formal specifications. Examples include references to the Universal Declaration of Human Rights and "thoughtful senior employee" tests.

7. An outcome reward model (ORM) and a process reward model (PRM) are given the same flawed but accidentally correct math solution. How do they behave?

Correct. This is exactly the ORM/PRM distinction. The ORM sees a correct final answer and rewards it. The PRM evaluates each step, detects the error, and penalizes it — even though the conclusion happened to be right.

The ORM assigns high reward (correct answer) regardless of how it was reached. The PRM evaluates each step — detecting the error — and assigns a lower reward. This is the core problem with outcome-only supervision.

8. OpenAI's PRM800K dataset consists of:

Correct. PRM800K contains 800,000 step-level human labels — each individual reasoning step in a model-generated math solution was evaluated and marked positive (correct), negative (incorrect), or neutral (correct but non-advancing).

PRM800K is a step-level annotation dataset: 800,000 labels on individual reasoning steps, each marked positive, negative, or neutral by human evaluators. It enables training a reward model to evaluate reasoning quality at each intermediate step.

9. Monte Carlo step verification generates step-level labels by:

Correct. Monte Carlo step verification uses outcome verification (which only needs ground-truth answers) to infer process labels. Many completions from step N are sampled; majority-vote outcomes determine the step label.

Monte Carlo step verification runs many forward completions from a given step. If most completions reach the correct answer, the step is labeled positive; if most fail, negative. This converts outcome-level ground truth into step-level labels without human annotation.

10. Why does process supervision for step-level reward modeling generalize poorly beyond mathematics to domains like legal reasoning?

Correct. The tractability of PRM training in math depends on clear step-level verifiability. A legal argument step may be reasonable under some interpretations and not others — there's no binary ground truth to assign labels against.

The key issue is verifiability. Math steps are either correct or not. Legal reasoning steps are often contested judgment calls — experts disagree on what constitutes a valid inference. Without ground truth, step-level labeling (human or synthetic) becomes unreliable.

11. The scalable oversight problem becomes critical when:

Correct. Scalable oversight becomes critical precisely when the evaluation gap opens: human evaluators can no longer reliably assess AI outputs in the domains where AI capability is advancing fastest.

Scalable oversight is triggered by the evaluation gap — when AI outputs exceed human evaluative capability in the relevant domain. A reward model trained on unreliable human judgments will train the AI to appear correct rather than be correct.

12. The theoretical justification for AI debate as a source of reliable training data is:

Correct. The key assumption is the asymmetry of error under adversarial scrutiny: false claims are more attackable than true claims. Therefore, debate outcomes are more reliable signals than single-model assertions.

The justification is the asymmetry of error: incorrect arguments have more vulnerabilities for an opponent to exploit. Under adversarial debate, false positions are disproportionately likely to be defeated, making the outcome a more reliable signal than any single model's uncontested answer.

13. Recursive Reward Modeling (RRM) addresses scalable oversight by:

Correct. RRM's mechanism is decomposition: break tasks that humans can't evaluate into subtasks they can. This extends effective human oversight into complex domains by working at the component level rather than the aggregate level.

RRM decomposes the hard evaluation problem into easier subtasks. A 2,000-line codebase may be impossible for humans to evaluate holistically, but individual functions can be checked. Aggregating subtask scores produces a reliable reward signal for the complex task.

14. OpenAI's Superalignment initiative (announced July 2023) proposed using AI systems to assist in alignment of future AI by:

Correct. The Superalignment research agenda explicitly included using current-generation AI (GPT-4 class) to help evaluate and supervise hypothetical future, more capable AI systems — a form of scalable oversight that inverts the typical training hierarchy.

Superalignment's technical agenda included using weaker models (like GPT-4) to supervise stronger models — effectively automating the oversight function that humans can't perform at the capability level of future systems. It also included interpretability research as a complementary approach.

15. Across the four data-generation methods covered in this module (instruction pipelines, Constitutional AI, process reward models, and debate/RRM), what is the common underlying trend?

Correct. The arc is: instruction pipelines automated format generation → CAI automated preference labeling → PRMs automated step verification → debate/RRM automated evaluation of complex outputs. Each step reduces human labor while pushing the capability frontier of what can be reliably trained on.

The common thread is progressive automation of training data generation. Instruction pipelines → format. CAI → preference labels. PRM → step verification. Debate/RRM → complex output evaluation. Each step removes a human bottleneck while extending the capability ceiling of what can be trained on.