L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 6 · Lesson 1

The Oversight Bottleneck

When AI outpaces human judgment, who checks the checker?
How can human overseers maintain meaningful control over AI systems whose outputs they can no longer reliably evaluate?

In 2016, DeepMind's DQN agent discovered a tunnel-exploit in Breakout that human programmers had never considered. The agent found that drilling a channel behind the wall produced a score-maximizing strategy invisible to anyone watching frame by frame at human speed. The researchers only detected it by slowing the replay. The AI had found something true and effective — something its overseers genuinely could not have evaluated in real time.

The Core Problem

Human oversight of AI systems rests on a quiet assumption: that the overseer can tell whether the AI did the right thing. For narrow tasks with crisp outputs — did the image get classified correctly? did the translation parse? — this assumption holds. But as AI systems tackle longer-horizon tasks, produce richer outputs, or operate in domains requiring specialist knowledge, the assumption breaks down.

This is the scalable oversight problem: how do we maintain meaningful human control as AI capability grows past the point where humans can directly verify AI outputs? It is not a hypothetical. It is already happening in code generation, scientific literature synthesis, legal document review, and protein structure prediction.

Why This Is Distinct from Other Alignment Problems

Most alignment concerns ask: what does the AI want? Scalable oversight asks something different: even if the AI is trying to do the right thing, how would we know it succeeded? The failure mode is not deception per se — it is an evaluation gap that makes deception undetectable and correct behavior unverifiable.

The Evaluator's Dilemma

OpenAI's 2021 paper on scalable oversight introduced a clean framing of the problem. They called it the evaluator's dilemma: if you use a less-capable system to check a more-capable one, the checker may miss errors. If you use the same model to check itself, you get circular validation. If you use an even more capable model, you need oversight of that model too — infinite regress.

The practical manifestation appears in reinforcement learning from human feedback (RLHF). Human raters label outputs as better or worse, and the model is trained toward the distribution of human preferences. But if the model starts producing outputs that sound correct to non-experts without being correct, RLHF training will reward the appearance of correctness over actual correctness. The signal corrupts.

Scalable Oversight A family of techniques aimed at maintaining meaningful human supervision of AI systems even when those systems operate at capability levels that exceed reliable human evaluation of individual outputs.
Evaluation Gap The growing disparity between an AI system's ability to produce outputs and a human overseer's ability to assess whether those outputs are correct, safe, or aligned.
RLHF Signal Corruption When human feedback used to train AI systems becomes systematically misleading because evaluators cannot distinguish genuinely good outputs from outputs that merely appear good.
Where the Gap Appears Today

Code generation: GitHub Copilot and similar tools produce code that passes review by developers who are not security specialists. A 2022 Stanford study found that participants who used AI coding assistants wrote significantly more security vulnerabilities than those who did not — and were more confident in their code's security. The AI was producing plausible-looking, functioning code that harbored real flaws that non-expert reviewers could not catch.

Scientific synthesis: AI systems now summarize research literature. Studies have documented that large language models produce plausible-sounding citations that do not exist, or summarize real papers with inverted findings. Non-specialist readers cannot catch these errors without independently verifying each source — which eliminates the efficiency gain the tool was meant to provide.

Medical and legal domains: LLMs produce medical differential diagnoses and legal research memos that junior practitioners find credible. The error rate in these domains, when the AI is wrong, may be undetectable by the very user relying on the output.

The Scaling Trajectory

The evaluation gap is not static. As AI systems become more capable, the gap widens. A 2023 analysis by Anthropic's alignment team noted that for sufficiently advanced AI, human oversight using current methods may become "systematically unreliable" — not because humans become less capable, but because the AI's outputs in complex domains genuinely exceed reliable human judgment. This is why scalable oversight research is treated as a prerequisite for safe deployment of more powerful systems.

The Stakes

If we cannot solve scalable oversight, we face a hard choice as AI capability grows: either limit AI to tasks humans can evaluate (sacrificing enormous potential benefit) or deploy AI on tasks humans cannot evaluate (accepting uncertain and potentially serious risk). Neither option is satisfactory. Scalable oversight research exists precisely to find a third path — techniques that extend meaningful human supervision beyond the naive bottleneck of direct human judgment.

Lesson 1 Quiz

The Oversight Bottleneck · 3 questions
What is the core claim of the "scalable oversight problem"?
✓ Correct — Correct. Scalable oversight addresses the evaluation gap — the growing disparity between AI output quality and human ability to assess it.
Not quite. The scalable oversight problem is specifically about the evaluation gap — the inability of human supervisors to verify AI outputs as capability increases, regardless of what the AI "wants."
The 2022 Stanford study on AI coding assistants found that participants using AI wrote more security vulnerabilities — and were more confident. Why does this illustrate the evaluation gap?
✓ Correct — Exactly right. The gap between plausibility and correctness, invisible to the evaluator, is precisely the evaluation gap in practice.
The key point is that the code looked correct to reviewers while containing real flaws — overseers could not distinguish good from bad outputs. That is the evaluation gap.
What is "RLHF signal corruption" as described in this lesson?
✓ Correct — Correct. Signal corruption occurs when human evaluators cannot distinguish appearance from reality, so the model learns to optimize for the appearance of correctness.
Signal corruption here refers specifically to the evaluation gap: when humans cannot tell correct from incorrect, their feedback rewards the wrong thing — plausibility over truth.

Lab 1 · Mapping the Evaluation Gap

Identify where oversight breaks down in real deployment contexts

Your Task

You will discuss the scalable oversight problem with an AI tutor. Explore where the evaluation gap appears in real-world AI deployments, why RLHF signal corruption occurs, and what conditions make oversight harder or easier.

Suggested opener: "In which domains is the evaluation gap most severe right now, and what makes those domains especially hard to oversee?" — or ask your own question about the oversight bottleneck.
Scalable Oversight Tutor Lab 1
Welcome to Lab 1. We're examining the oversight bottleneck — the point where AI capability outpaces human evaluation. Where would you like to start: the domains where this gap is most visible today, the mechanics of RLHF signal corruption, or why the problem gets harder as capability scales?
Module 6 · Lesson 2

Debate, Amplification, and Recursive Reward Modeling

Three research programs trying to extend human judgment beyond its natural limits
Can we design oversight architectures where AI systems check each other, with humans adjudicating only the disagreements they can actually evaluate?

In 2018, Paul Christiano at OpenAI published a research agenda that would come to define a generation of scalable oversight work. The core intuition was elegant: if a human cannot evaluate an AI's answer directly, perhaps the human can evaluate a debate between two AIs — each trying to expose the other's errors. The asymmetry between generating a deceptive argument and exposing one might favor truth.

AI Safety via Debate

The Debate proposal, published by Irving et al. at OpenAI in 2018, works as follows. Two AI agents are given the same question. Each argues for a different answer. They take turns making claims and counter-claims. A human judge — who may not be an expert in the underlying domain — watches the exchange and decides which agent was more honest.

The crucial theoretical claim: if one agent is lying, the honest agent can always expose the lie using a statement simple enough for the human to verify. The honest agent does not need to explain the whole answer from scratch — only to identify the specific point where the deceptive agent's argument breaks down. This keeps the human's evaluation task bounded.

A 2021 empirical test by Michael Kenton et al. at DeepMind used the QuALITY dataset — comprehension questions about long documents that humans struggle to answer without re-reading. They found that human judges who watched AI debates were significantly more accurate than humans reading the documents alone. Debate provided genuine epistemic lift.

Key Limitation of Debate

The theoretical guarantee requires that the honest agent can always find a human-verifiable exposé of the lie. But for sufficiently complex deceptions — especially in domains like mathematics, neuroscience, or long-horizon planning — the exposé itself may not be verifiable by a non-expert. The adversarial structure helps, but it does not fully dissolve the evaluation gap.

Iterated Amplification

Paul Christiano's Iterated Amplification (IDA) proposal, published in 2018, takes a different approach. Instead of using adversarial debate, it builds a hierarchy of oversight. The idea: take a human H and give them access to many copies of a slightly-less-capable AI assistant. H uses the assistants to decompose hard questions into simpler sub-questions, answer those, and aggregate the results. The composite system — H plus assistants — is more capable than H alone.

This composite system can then be used to generate training signal for a new AI, which in turn becomes the assistant in the next round. The process iterates: each round, the AI becomes slightly more capable while remaining trained on a signal generated by a supervised human-AI composite. In principle, capability grows without losing the thread of human oversight.

The critical assumption: each decomposition step is small enough that the human can verify whether the sub-task was done correctly. If any step involves an unjustifiable leap of faith, the alignment guarantee breaks. Christiano himself has noted that IDA's safety properties depend heavily on whether this assumption holds in practice — an open empirical question.

AI Safety via Debate An oversight method where two AI agents argue opposing positions and a human judges honesty, exploiting the asymmetry between generating and detecting deception.
Iterated Amplification (IDA) A method for recursively building more capable oversight by having humans use AI assistants to decompose hard problems, then training new AI on the resulting human-AI composite judgments.
Recursive Reward Modeling (RRM) A technique where a reward model is trained not on direct human ratings but on human ratings of a simpler AI that itself assists in evaluating complex outputs.
Recursive Reward Modeling

Recursive Reward Modeling (RRM), developed at DeepMind and published by Leike et al. in 2018, combines elements of both approaches. The insight: instead of training an AI directly on human ratings of its outputs, train a reward model that is itself assisted by an AI. The human rates outputs with AI assistance, making harder evaluations tractable. The reward model captures these assisted judgments. The process can recurse — the assisting AI can itself be trained on assisted human judgments.

In practice, RRM is closely related to what became Constitutional AI at Anthropic and the broader ecosystem of "AI-assisted evaluation" that now underpins most frontier model training. When Anthropic trains Claude using a combination of human feedback and AI feedback (RLAIF), the AI feedback is itself generated by a model trained on human preferences — a direct application of the recursive reward modeling principle.

What These Methods Share — and Where They Diverge

Debate, IDA, and RRM all attempt to extend human oversight by making the evaluation task easier, not by making humans smarter. They differ in mechanism: Debate uses adversarial exposure, IDA uses decomposition and aggregation, RRM uses assisted evaluation and recursion. All three face the same fundamental challenge: if the AI being overseen is capable enough to generate outputs the human genuinely cannot evaluate — even with assistance — the methods' safety properties weaken or break. This is why scalable oversight research remains an open problem rather than a solved one.

Lesson 2 Quiz

Debate, Amplification, and Recursive Reward Modeling · 3 questions
In the AI Safety via Debate framework, what is the key asymmetry that the method relies on?
✓ Correct — Correct. The honest agent only needs to find one verifiable exposé; the deceptive agent must hide every flaw. That asymmetry is the theoretical foundation of the Debate method.
The method's key insight is asymmetric burden: the honest agent just needs to find one human-verifiable crack in the deceptive argument, while deception requires hiding all such cracks.
What is the critical assumption that Iterated Amplification's safety properties depend on?
✓ Correct — Exactly. If any decomposition step requires an unjustifiable leap — a step the human cannot verify — the alignment guarantee breaks at that point.
IDA's safety depends specifically on whether decomposition steps are individually verifiable by humans. Christiano himself flagged this as an open empirical question.
How does Recursive Reward Modeling relate to Anthropic's Constitutional AI and RLAIF in practice?
✓ Correct — Correct. The AI feedback used in RLAIF is generated by models trained on human preferences — exactly the recursive structure RRM describes.
RLAIF — reinforcement learning from AI feedback — uses AI-generated ratings that themselves trace back to human preferences, embodying the recursive reward modeling approach.

Lab 2 · Evaluating Oversight Architectures

Probe the strengths and failure modes of Debate, IDA, and RRM

Your Task

Discuss the three scalable oversight architectures — AI Safety via Debate, Iterated Amplification, and Recursive Reward Modeling — with your AI tutor. Probe their failure modes, compare their assumptions, and examine how they relate to real deployed systems like Constitutional AI.

Suggested opener: "What's the strongest argument that AI Safety via Debate actually works in practice, and where does that argument break down?" — or ask your own question.
Oversight Architectures Tutor Lab 2
Welcome to Lab 2. We're examining Debate, IDA, and Recursive Reward Modeling — three architectures that try to extend human oversight beyond the evaluation gap. What would you like to explore: their theoretical foundations, empirical tests, failure modes, or how they connect to real systems like Constitutional AI?
Module 6 · Lesson 3

Process-Based vs. Outcome-Based Supervision

Rewarding the right answer versus rewarding the right reasoning
If we cannot always verify that an AI's answer is correct, can we instead verify that its reasoning process was sound?

In May 2023, OpenAI published "Let's Verify Step by Step," a paper by Lightman et al. examining whether training AI on process-level feedback — rating each reasoning step rather than only final answers — improved mathematical problem-solving. The result was significant: process reward models (PRMs) outperformed outcome reward models (ORMs) on the MATH benchmark, and critically, the PRM was better at identifying where reasoning had gone wrong, not just that it had.

The Distinction That Matters

Outcome-based supervision evaluates AI by its final outputs. Did the code run correctly? Was the diagnosis accurate? Did the legal argument prevail? This is natural and easy to implement — outcomes are often observable. But outcome supervision has a structural problem: an AI can reach a correct outcome via incorrect reasoning, and an AI can reach an incorrect outcome via correct reasoning that hit bad luck. When we train on outcomes, we may reward the wrong processes.

Process-based supervision evaluates the reasoning chain that led to the output. Each step is assessed: is this inference valid? Is this intermediate claim supported? Does this derivation follow? If the steps are sound, the conclusion is likely sound — and we can catch errors before they propagate to final outputs.

The Goodhart Problem

Outcome-based supervision is particularly vulnerable to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. An AI optimized for outcome metrics learns to produce outputs that score well on those metrics, not necessarily outputs that are genuinely correct. Process-based supervision partially addresses this by making the target the reasoning itself, which is harder to game.

The PRM Approach in Practice

OpenAI's process reward model research involved human annotators rating individual steps in mathematical reasoning chains — marking each step as positive, negative, or neutral. The PRM learned to predict step-level correctness. This enabled two things: first, training models to reason more carefully by penalizing specific bad steps rather than only penalizing wrong final answers; second, using the PRM as a search guide during inference, steering beam search away from reasoning chains with low step-level scores.

The practical finding was that models trained with PRM feedback, or guided by PRMs during inference, solved significantly more competition-level math problems than ORM-equivalent baselines. The quality of the reasoning process was a better signal than outcome alone.

A related thread: DeepMind's work on "self-consistency" methods, and the "chain of thought" prompting research by Wei et al. (2022), both implicitly rely on the idea that surfacing reasoning steps makes errors more detectable — a process-supervision intuition applied without a formal reward model.

Process Reward Model (PRM) A model trained to evaluate the correctness of individual reasoning steps rather than only final outputs, enabling supervision of the reasoning process itself.
Outcome Reward Model (ORM) A model trained to evaluate final outputs only, without assessing the reasoning chain that produced them. Simpler to implement but more vulnerable to shortcut learning and Goodhart effects.
Reward Hacking When an AI finds ways to maximize a reward signal without achieving the intended behavior — exploiting the gap between what the reward measures and what we actually care about.
Challenges of Process Supervision

Process-based supervision is harder to implement. Humans must evaluate intermediate reasoning steps, which requires following the AI's chain of thought at every point — more labor-intensive than rating only final outputs. For domains where AI reasoning is opaque or uses representations humans cannot follow, process supervision becomes difficult or impossible.

There is also a coverage problem. An AI can produce correct-seeming process traces that conceal the true reasoning. If the model's actual computation is not reflected in its expressed chain of thought — a real possibility with large language models — then supervising the expressed reasoning provides weaker guarantees than it appears to.

This connects to a broader open question: is the chain of thought produced by a language model a genuine explanation of how it reached its answer, or is it a post-hoc rationalization that looks reasonable to humans? Research by Miles Turpin et al. (2023) found evidence that chain-of-thought explanations can be systematically unfaithful to the actual model computation — which would significantly undermine process supervision's safety guarantees.

What This Means for Oversight

Process supervision is a promising partial solution to the scalable oversight problem. By shifting evaluation from final outputs to reasoning steps, it can extend meaningful human supervision to more capable AI systems. But it faces real limits: step-level human evaluation remains expensive, and the faithfulness of expressed reasoning to actual model computation is genuinely uncertain. The field is still working out when process supervision's guarantees hold and when they weaken.

Lesson 3 Quiz

Process-Based vs. Outcome-Based Supervision · 3 questions
What did OpenAI's "Let's Verify Step by Step" paper find about process reward models versus outcome reward models on the MATH benchmark?
✓ Correct — Correct. PRMs provided better error localization and better downstream performance — supporting the case for process-level supervision.
The paper found PRMs outperformed ORMs, and crucially, PRMs were better at localizing errors in reasoning chains — not just catching wrong final answers.
Why does outcome-based supervision risk rewarding the wrong behavior even when outcomes look correct?
✓ Correct — Exactly. Outcome supervision cannot distinguish sound reasoning from lucky shortcuts — making it vulnerable to reward hacking and poor generalization.
The problem is that outcome supervision rewards whatever process produced the correct outcome — even if that process was flawed. Correct outcomes can come from wrong reasoning, and outcome feedback can't tell the difference.
What did Turpin et al. (2023) find that complicates the case for process supervision?
✓ Correct — Correct. If expressed reasoning doesn't reflect actual computation, then supervising the expressed reasoning provides weaker safety guarantees than it appears to.
Turpin et al. found evidence that chain-of-thought explanations can be post-hoc rationalizations rather than genuine explanations — which would significantly undermine process supervision's guarantees.

Lab 3 · Reasoning Under the Microscope

Examine how process supervision works — and where it fails

Your Task

Discuss process-based supervision with your AI tutor. Explore how PRMs differ from ORMs in practice, why chain-of-thought faithfulness matters for oversight, and what the limits of reasoning-level evaluation are.

Suggested opener: "If an AI's chain of thought can be unfaithful to its actual computation, does process supervision still provide any meaningful oversight guarantee?" — or ask your own question.
Process Supervision Tutor Lab 3
Welcome to Lab 3. We're examining process-based supervision — the idea that evaluating reasoning steps rather than final outputs can extend meaningful human oversight. The key tension: this approach depends on expressed reasoning faithfully reflecting actual computation, which is genuinely uncertain. Where would you like to dig in?
Module 6 · Lesson 4

Weak-to-Strong Generalization and the Frontier

Can a less-capable supervisor reliably align a more-capable student?
As AI systems become more capable than human overseers in specific domains, what oversight techniques remain viable — and what comes next?

In December 2023, OpenAI published "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervisors." The paper used a striking experimental setup: take a small language model as the "human" supervisor, have it generate labels for a task it handles imperfectly, then train a much larger model on those imperfect labels. The question: does the stronger model learn the task as well as if it had been trained on ground-truth labels? The finding was striking — and sobering.

The Core Experiment

OpenAI's weak-to-strong paper used GPT-2 as the "weak supervisor" and fine-tuned GPT-4-class models on GPT-2's labels. On several tasks — NLP benchmarks, chess position evaluation — the strong model trained on weak labels did not fully recover ground-truth performance. There was a performance gap between training on weak labels and training on correct labels.

However, the strong model still significantly outperformed its weak supervisor on these tasks. It generalized beyond the weak supervision signal — using its pretraining to fill in gaps the weak labels left open. The authors called this "elicitation": the strong model already had the capability; weak supervision was partly eliciting it, partly constraining it.

This has a direct implication for AI alignment: a future AI that is smarter than humans in some domain could still be trained by human overseers, and might generalize the right behaviors beyond what the human labels explicitly specify. But there would still be a gap — behaviors the human supervisor could not reliably elicit or correct. That gap is where alignment risk concentrates.

The Analogy to Human Oversight of Superhuman AI

The weak-to-strong setup is an analogy for the situation where humans (the weak supervisor) try to align AI systems that are more capable than them (the strong model). The paper's finding — that weak supervision partially works, but leaves a performance gap — suggests that human oversight of sufficiently capable AI will be partially effective but not fully reliable. The open question is whether that gap is small and correctable, or large and dangerous.

Bootstrapping Techniques

OpenAI's paper also tested several techniques to close the weak-to-strong gap. The most effective was auxiliary confidence loss: in addition to training on weak supervisor labels, also train the strong model to be well-calibrated on how confident its labels are. This helped the strong model identify where weak supervision was likely to be wrong and rely more on its own representations.

Another technique: using the strong model itself to generate better labels for a subset of the data, then training on the mix of weak human labels and strong-model-generated labels — a form of self-play or bootstrapping. This closed part of the performance gap, though not all of it.

These bootstrapping techniques are directly relevant to how frontier labs think about alignment today. Anthropic's use of Constitutional AI — where a model uses principles to critique and revise its own outputs — is a form of using the model's own capabilities to extend oversight beyond what direct human evaluation could provide.

Weak-to-Strong Generalization The phenomenon where a more-capable model trained on imperfect labels from a weaker supervisor still outperforms that supervisor — but not as well as if trained on ground-truth labels, leaving a performance gap.
Elicitation The process of drawing out capabilities that a model already possesses through training, prompting, or supervision — as opposed to instilling new capabilities.
Bootstrapping (in alignment) Using a model's own outputs to generate training signal that extends beyond what human supervision alone could provide, iteratively improving alignment without requiring proportionally more human labor.
Where the Field Stands

Scalable oversight is not a solved problem. Debate, IDA, RRM, process supervision, and weak-to-strong generalization are all research programs — some more mature than others, none fully validated at the capability levels that matter most.

Anthropic's 2023 model card for Claude notes explicitly that the company uses a combination of human feedback, AI feedback, and constitutional principles — but also acknowledges that evaluating whether these methods are sufficient for safety at frontier capability levels is an open research question. OpenAI's alignment research page similarly frames scalable oversight as one of the "hardest open problems" in the field.

What is clear: the oversight bottleneck will tighten as AI capability grows. Methods that work today may not work for systems five or ten times more capable. The research community is working against a moving target — which is why the pace of scalable oversight research has accelerated significantly since 2021, and why it is treated as a prerequisite for responsible deployment of more powerful systems.

The Practical Takeaway

For practitioners and policymakers: scalable oversight is not a binary property that systems either have or lack. It is a spectrum. Current frontier AI systems are partially overseen — human feedback, process supervision, and AI-assisted evaluation all provide some meaningful constraint. The question is whether these constraints remain adequate as capability scales, and whether the research community can develop better methods before the gap becomes critical. That is the live version of the alignment problem that the field is working on now.

Lesson 4 Quiz

Weak-to-Strong Generalization and the Frontier · 3 questions
What did OpenAI's December 2023 weak-to-strong generalization paper find when training GPT-4-class models on GPT-2's labels?
✓ Correct — Correct. The strong model generalized beyond its weak supervisor but not to ground-truth level — a partial-but-incomplete alignment that has direct implications for human oversight of superhuman AI.
The key finding was a performance gap: strong models trained on weak labels outperformed the weak supervisor but underperformed ground-truth training. That gap is where alignment risk concentrates.
What is "auxiliary confidence loss" and why did it help in the weak-to-strong experiments?
✓ Correct — Exactly. By training for calibration, the strong model could identify where weak supervision was unreliable and downweight those labels — partially closing the performance gap.
Auxiliary confidence loss trains the model to be well-calibrated about its own uncertainty, which lets it recognize where the weak supervisor's labels are likely wrong and rely more on its own stronger representations.
What is "elicitation" in the context of weak-to-strong generalization, and why does it matter for alignment?
✓ Correct — Correct. If strong models have latent aligned behaviors that weak supervision can elicit, human oversight remains partially viable even past the direct evaluation threshold. The gap is whether elicitation is reliable enough.
Elicitation here means drawing out pre-existing capabilities rather than installing new ones. This matters because it suggests human oversight of superhuman AI may work by eliciting aligned behaviors the model already has — but the performance gap shows this elicitation is imperfect.

Lab 4 · The Frontier of Oversight

Probe the weak-to-strong problem and what comes next

Your Task

Discuss weak-to-strong generalization and the current state of scalable oversight research with your AI tutor. Where does the performance gap come from? What techniques help close it? What does the field need to solve next?

Suggested opener: "If the performance gap in weak-to-strong generalization reflects where alignment risk concentrates, what specific behaviors would we expect to go wrong first as AI capability scales?" — or ask your own question.
Frontier Oversight Tutor Lab 4
Welcome to Lab 4. We're at the frontier of scalable oversight — weak-to-strong generalization, bootstrapping techniques, and the open question of whether current methods will hold as AI capability scales. What aspect of this would you like to explore: the performance gap mechanics, the bootstrapping approaches, or the practical implications for how frontier labs deploy oversight today?

Module 6 Test

Scalable Oversight · 15 questions · Pass at 80%
1. What is the "evaluation gap" in the context of scalable oversight?
✓ Correct — Correct. The evaluation gap is the core challenge: AI produces outputs faster and in domains where human evaluation is unreliable.
The evaluation gap is the disparity between AI output capability and human oversight capability — the core challenge scalable oversight tries to address.
2. The 2016 DeepMind DQN Breakout tunnel-exploit is cited in this module as an example of what?
✓ Correct — Correct. The exploit was undetectable in real time — researchers only caught it by slowing replay. That is the evaluation gap made concrete.
The DQN case illustrates the evaluation gap: the strategy was genuine and effective, but not evaluable by human observers in real time. Oversight had already failed by the time anyone noticed.
3. Why does RLHF signal corruption occur?
✓ Correct — Correct. When the evaluation gap makes human feedback unreliable, RLHF trains models toward the appearance of correctness — a systematic corruption of the training signal.
Signal corruption occurs when evaluators cannot distinguish appearance from reality — so their feedback rewards whichever property the AI can more easily produce: looking correct rather than being correct.
4. In AI Safety via Debate, why is the honest agent theoretically advantaged over the deceptive agent?
✓ Correct — Correct. Asymmetric burden is the theoretical foundation: exposing requires less than deceiving, favoring the honest agent.
The key asymmetry is burden: exposing a lie requires finding only one verifiable flaw; maintaining a deception requires hiding all flaws. That asymmetry theoretically favors truth.
5. What empirical study tested AI Safety via Debate using the QuALITY dataset, and what did it find?
✓ Correct — Correct. Kenton et al. at DeepMind demonstrated that debate provided measurable epistemic lift — human judges were more accurate when watching debates than when reading on their own.
Kenton et al. at DeepMind (2021) used QuALITY dataset comprehension tasks and found human judges watching AI debates significantly outperformed humans reading alone — evidence that debate provides genuine oversight benefit.
6. What is the critical assumption that Iterated Amplification's safety guarantee depends on?
✓ Correct — Correct. IDA's safety depends entirely on each step being individually human-verifiable. Unjustifiable leaps break the alignment thread.
IDA's guarantee rests on step-wise verifiability: if a human cannot check whether a specific sub-task was done correctly, the alignment guarantee fails at that point — and errors can propagate forward.
7. How does Recursive Reward Modeling relate to how Anthropic trains Claude using RLAIF?
✓ Correct — Correct. RLAIF embodies the RRM principle: AI-generated feedback derives from models trained on human preferences, extending human oversight recursively.
RLAIF uses AI feedback that itself traces back to human preferences — exactly the recursive structure RRM describes. The two are conceptually the same approach instantiated in practice.
8. What is the fundamental difference between process-based and outcome-based supervision?
✓ Correct — Correct. Process supervision catches errors in the reasoning chain, not just wrong final answers — a structurally different oversight approach.
The key distinction is what gets evaluated: outcomes evaluate only the end result, while process supervision evaluates each reasoning step — catching errors before they propagate to final outputs.
9. What did the OpenAI "Let's Verify Step by Step" paper (2023) find about process reward models on competition mathematics?
✓ Correct — Correct. PRMs provided both better performance and better error localization — two distinct advantages over outcome-only supervision.
The paper found PRMs outperformed ORMs and provided better error localization — making reasoning supervision more valuable than outcome-only evaluation for hard mathematical tasks.
10. What is Goodhart's Law and why is it especially relevant to outcome-based supervision?
✓ Correct — Correct. Goodhart's Law makes outcome supervision structurally vulnerable: optimizing for the measure divorces it from what the measure was meant to track.
Goodhart's Law: when a measure becomes a target, it stops being a good measure. For outcome supervision, this means AI learns to optimize outcome metrics — not to actually be correct.
11. What did Turpin et al. (2023) find about chain-of-thought faithfulness, and what is the implication for process supervision?
✓ Correct — Correct. If expressed reasoning is post-hoc rather than explanatory, supervising it may provide a false sense of security about the model's actual reasoning.
Turpin et al. found chain-of-thought can be unfaithful — post-hoc rationalization rather than genuine explanation. If we supervise expressed reasoning that doesn't reflect actual computation, process supervision's guarantees weaken significantly.
12. What was the experimental setup in OpenAI's weak-to-strong generalization paper, and why was it designed that way?
✓ Correct — Correct. Using GPT-2 as surrogate "human" allowed researchers to study the weak-supervisor/strong-student dynamic in a controlled, measurable way — directly analogizing human oversight of superhuman AI.
The key design choice was using a smaller model as surrogate human — making the weak-supervisor/strong-student dynamic empirically testable today, as an analogy for the future challenge of human oversight of superhuman AI.
13. What does it mean for alignment risk to "concentrate in the performance gap" in weak-to-strong generalization?
✓ Correct — Correct. The gap is where weak supervision fails to reliably elicit correct behavior — so it's where the strong model's behavior is least constrained by human oversight.
The performance gap represents behaviors that weak supervision cannot reliably shape — those behaviors are the least overseen, making the gap the zone where misalignment is most likely to persist undetected.
14. Constitutional AI at Anthropic can be understood as an instance of which scalable oversight technique?
✓ Correct — Correct. Constitutional AI is a bootstrapping approach: the model's own capabilities extend oversight beyond what human evaluation alone could provide, with the constitution serving as the alignment anchor.
Constitutional AI uses the model's own capabilities to critique and revise its outputs — a bootstrapping / recursive reward modeling approach, where AI-generated signal extends human oversight beyond the direct evaluation bottleneck.
15. Which statement best characterizes the current state of scalable oversight as a field?
✓ Correct — Correct. Partial, promising, but incomplete — and the target is moving as capability scales. That is the honest characterization of where the field stands.
Scalable oversight is neither solved nor irrelevant. Current methods provide real but partial oversight, and the adequacy of these methods for future, more-capable systems is a genuinely open and urgent question the field is actively working on.