Module 8 · Lesson 1

Long-Horizon Planning and Autonomous Execution

How agents are breaking the single-turn barrier — running for hours, days, and weeks without human checkpoints

When an agent can run for a week on its own, what does "control" even mean?

In November 2023, Cognition AI was barely a company. By March 2024 it released Devin — publicly described as the first AI software engineer. A demo showed Devin autonomously receiving a task, opening a browser, reading documentation, writing code, running tests, debugging failures, and committing a working solution — all without a human in the loop. The AI research community immediately began stress-testing the claim. The real story was more complicated, and more instructive.

What "Long-Horizon" Actually Means

Every agent system before roughly 2023 operated in what researchers call short-horizon mode: receive a prompt, produce a response, stop. The user evaluates the result and issues the next prompt. The human provides continuity. Long-horizon agents invert this: the agent itself maintains a running plan across many sub-tasks, tool calls, and decisions — sometimes across hours or days of wall-clock time.

The key technical ingredients that made this shift possible are not new individually. What changed is their combination: persistent memory stores that survive across sessions, structured planning loops (goal decomposition → sub-task execution → verification → replanning), reliable tool-use with error handling, and sufficiently large context windows to hold the full task state.

SWE-bench — the standard benchmark for software engineering agents — operationalizes this precisely. Given a GitHub issue, an agent must navigate a real codebase, reproduce the bug, write a fix, and pass a test suite. The 2024 leaderboard saw scores jump from under 5% (GPT-4 with basic scaffolding, early 2023) to 49% (Claude 3.5 Sonnet with SWE-agent scaffolding, mid-2024), with Devin scoring 13.86% on the official evaluation — below several open-source scaffolds, but the first commercially deployed end-to-end system.

SWE-Bench Reality Check

Independent researcher Albert Ziegler published a detailed replication study in April 2024 showing that Devin's demo tasks were cherry-picked easier instances. On the full unfiltered benchmark, Devin performed in line with published numbers. The lesson: benchmark selection matters as much as benchmark score.

The Architecture of Sustained Agency

Long-horizon agents need architecture that short-horizon chatbots do not. The core additions are: a task ledger (persistent record of goals, sub-goals, and completion status), a working memory buffer (recent context relevant to current sub-task), and a scratchpad (intermediate reasoning the model writes to and reads from). Together these create something like an executive function — the capacity to track "where am I in this plan and what must I do next."

OpenAI's o1 and o3 models introduced extended chain-of-thought as a first-class feature in late 2023 and early 2024. Before generating a final output, the model reasons through a private scratchpad that can run for hundreds of tokens. This directly improves long-horizon performance because errors in early sub-tasks are caught in reasoning before they propagate.

Context Window

1M tokens

Gemini 1.5 Pro's context window (April 2024) — enough to hold entire codebases in a single pass.

SWE-Bench Score

49%

Claude 3.5 Sonnet + SWE-agent, mid-2024 — up from sub-5% just 18 months prior.

Devin Launch

Mar 2024

First commercially deployed full software-engineering agent with public benchmark evaluation.

Failure Modes at Horizon

The longer an agent runs, the more ways it can fail. Three failure modes dominate empirical reports from 2024 deployments. Goal drift: the agent pursues a sub-goal so aggressively it loses track of the original objective — Anthropic's published research on Claude's "computer use" capability noted cases where Claude would fixate on a specific file path rather than the underlying task intent. Compounding error: a wrong assumption in step 3 propagates uncorrected through steps 4–12, producing a coherent-looking but fundamentally wrong result. Context poisoning: the agent's own earlier outputs — wrong code, bad documentation summaries — pollute the context used for later decisions.

Practical long-horizon systems in 2024 addressed these with checkpointing (periodic human review gates), verifiers (automated tests run after each major action), and explicit uncertainty signaling (the agent is prompted to flag when confidence drops below a threshold rather than proceeding blindly).

Key Terms

Long-horizon agentAn agent that maintains a running plan across many sub-tasks and tool calls without human re-prompting at each step.

SWE-benchStandard benchmark for software engineering agents: given a GitHub issue, resolve it in a real codebase.

Goal driftFailure mode where an agent pursues a sub-goal at the expense of the original task objective.

CheckpointingPeriodic human review gates inserted into long-running agent tasks to catch accumulated errors before they compound.

Lesson 1 Quiz

Long-Horizon Planning and Autonomous Execution

1. What score did Devin achieve on the official SWE-bench evaluation at launch in March 2024?

Correct. Devin's official SWE-bench score was 13.86% — a landmark commercially deployed result, though below several open-source scaffolds. The 49% figure belongs to Claude 3.5 Sonnet mid-2024.

Not quite. Devin's official SWE-bench score was 13.86% at launch. Several open-source scaffolds already outperformed it on the benchmark.

2. Which failure mode describes an agent losing sight of its original objective while pursuing a sub-task?

Correct. Goal drift is when an agent fixates on a sub-goal at the expense of the original objective — Anthropic's computer-use research documented this pattern.

Not quite. Goal drift is the term for losing the original objective while pursuing a sub-task. Context poisoning refers to the agent's earlier wrong outputs corrupting later decisions.

3. What was Gemini 1.5 Pro's context window size, announced in April 2024?

Correct. Gemini 1.5 Pro launched with a 1 million token context window — large enough to hold entire codebases, enabling qualitatively new long-horizon tasks.

Not quite. Gemini 1.5 Pro's context window was 1 million tokens — a landmark that enabled holding entire codebases in a single pass.

4. What technical feature did OpenAI's o1 introduce that directly improves long-horizon agent performance?

Correct. o1's key innovation was first-class extended chain-of-thought — a private scratchpad where the model reasons before producing output, catching errors before they propagate.

Not quite. o1's defining contribution to long-horizon tasks was its extended chain-of-thought private scratchpad, which lets the model catch early errors before they compound.

Lab 1 — Designing for Long Horizons

Explore architectural choices that separate short- from long-horizon agents

Your Scenario

You are advising a team building a software engineering agent similar to Devin. They want it to autonomously resolve GitHub issues — potentially running for 30–60 minutes per task without human review. They ask you to help them identify which architectural features are essential and which failure modes they must guard against.

Start by asking: "What are the three architectural elements I must have for a 30-minute autonomous coding agent?" Then dig into failure modes and mitigation strategies with the assistant.

Long-Horizon Architecture Advisor

AI Agents M8 · L1

Welcome. I'm here to help you think through the architecture for a long-horizon software engineering agent. What would you like to explore first — the core structural components, the failure modes, or the human oversight checkpoints?

Module 8 · Lesson 2

Multi-Agent Networks and Emergent Coordination

When agents work with agents — the new architecture of distributed AI labor

If ten agents are collaborating on a task none of them fully understands, who is responsible for the outcome?

In early 2024, Anthropic published a technical document on multi-agent architectures used internally. The document described an orchestrator–subagent pattern: one model receives the high-level task and decomposes it into sub-tasks; separate model instances execute each sub-task in parallel. The orchestrator then synthesizes results. Critically, Anthropic noted that subagents cannot verify that their orchestrator is legitimate — a problem with significant security implications they called "prompt injection via orchestration."

Why Multi-Agent Networks Emerged

A single agent faces hard limits: a finite context window, sequential execution, and the difficulty of maintaining coherent expertise across very different domains. Multi-agent systems address these by distributing work. A research task might be split: one agent searches the web, a second reads and summarizes papers, a third cross-checks claims, and a fourth writes the final report. Each agent operates within its context limit; the orchestrator manages the larger plan.

The commercial ecosystem formalized this quickly. LangChain's multi-agent framework (LangGraph) reached 1 million monthly downloads in early 2024. Microsoft's AutoGen framework — open-sourced in October 2023 — became the most starred AI framework on GitHub within three months, with its "conversational agents" pattern allowing models to message each other with structured handoffs. CrewAI added role specialization: agents are explicitly defined as "researcher," "writer," "critic," and so on, mimicking organizational structure.

Oct 2023

Microsoft AutoGen open-sourced

Became most-starred AI framework on GitHub within 90 days. Introduced structured multi-agent conversation patterns with human-in-the-loop integration.

Jan 2024

CrewAI launches

Role-specialized multi-agent framework with explicit "crew" metaphor. Gained 10,000 GitHub stars in its first two weeks.

Mar 2024

Anthropic multi-agent guidance published

Anthropic's technical documentation on orchestrator–subagent patterns, including first public discussion of trust boundaries between agents.

May 2024

OpenAI releases Assistants API v2

Added persistent threads and tool handoffs, enabling production multi-agent pipelines on GPT-4o without third-party orchestration frameworks.

Trust, Verification, and Security in Agent Networks

Multi-agent systems introduce a security surface single agents do not have: inter-agent trust. When an orchestrator instructs a subagent to execute an action, the subagent has no cryptographic way to verify the instruction is legitimate. A malicious or compromised intermediary could inject instructions. This is the core of prompt injection attacks on agent pipelines, documented in practice in 2024 by security researcher Johann Rehberger, who demonstrated that web content retrieved by a browsing agent could contain hidden instructions that redirected the agent's behavior without the user's knowledge.

The practical mitigations deployed in 2024 include minimal privilege by default (subagents are granted only the specific tools they need), output validation layers (an additional model reviews subagent outputs before they enter the orchestrator's context), and explicit human confirmation gates for high-stakes actions regardless of the agent's confidence level.

Documented Attack

In 2024, researcher Johann Rehberger demonstrated a live prompt injection attack against a multi-agent system: a webpage retrieved by a browsing agent contained hidden text instructing the agent to exfiltrate conversation history to an external URL. The attack succeeded because the agent treated retrieved web content with the same trust level as system instructions.

Emergent Coordination Patterns

A surprising finding from early multi-agent deployments: agents can develop implicit coordination that was not explicitly programmed. In a 2024 study by Google DeepMind ("Gemini for Tasks"), multiple Gemini agents coordinating on long research tasks showed spontaneous specialization — agents that were initialized identically began producing outputs with distinct characteristics after several rounds of interaction, with some becoming more focused on evidence-gathering and others on synthesis. Whether this is genuine specialization or an artifact of context divergence remains debated.

For practitioners, the relevant fact is simpler: multi-agent systems are harder to predict than their components suggest. Testing a single agent in isolation does not reliably predict its behavior as part of a network. This is why leading teams in 2024 began instrumenting agent networks with interaction tracing — logging not just inputs and outputs but every inter-agent message — to make debugging tractable.

Key Terms

Orchestrator–subagent patternArchitecture where one model decomposes tasks and manages sub-agents that execute components in parallel.

Inter-agent trustThe problem that subagents cannot cryptographically verify that instructions from an orchestrator are legitimate.

Prompt injection (pipeline)Attack where content retrieved from external sources contains hidden instructions that hijack agent behavior.

Interaction tracingLogging every inter-agent message to make multi-agent system debugging tractable.

Lesson 2 Quiz

Multi-Agent Networks and Emergent Coordination

1. Which framework, open-sourced by Microsoft in October 2023, became the most-starred AI framework on GitHub within 90 days?

Correct. Microsoft's AutoGen, open-sourced October 2023, became the most-starred AI framework on GitHub within 90 days, popularizing structured multi-agent conversation patterns.

Not quite. AutoGen — Microsoft's framework open-sourced in October 2023 — achieved this milestone. CrewAI launched later in January 2024.

2. Security researcher Johann Rehberger demonstrated what type of attack against a multi-agent system in 2024?

Correct. Rehberger showed that hidden text in a webpage retrieved by a browsing agent could redirect the agent to exfiltrate conversation history — a real pipeline prompt injection attack.

Not quite. Rehberger's documented attack used hidden instructions in retrieved web content to redirect the agent's behavior — a form of prompt injection targeting the pipeline trust model.

3. What is the core security problem with the orchestrator–subagent pattern as identified by Anthropic?

Correct. Anthropic's documentation explicitly flagged that subagents have no mechanism to verify orchestrator legitimacy — a trust boundary problem with real security consequences.

Not quite. The core problem is verification: subagents cannot confirm that the orchestrator sending them instructions is legitimate, opening the door to injection attacks via compromised intermediaries.

4. What did Google DeepMind's 2024 "Gemini for Tasks" multi-agent study observe about identically-initialized agents after several rounds of interaction?

Correct. The DeepMind study found spontaneous specialization: identically initialized agents diverged over interaction rounds, with some focusing on evidence-gathering and others on synthesis.

Not quite. The study found the opposite of convergence — agents developed distinct characteristics over interaction rounds, a form of spontaneous specialization that was not explicitly programmed.

Lab 2 — Securing Multi-Agent Pipelines

Design trust boundaries and attack mitigations for a real orchestrator–subagent system

Your Scenario

You're the security lead for a company deploying a multi-agent research assistant. The system uses an orchestrator agent that assigns tasks to a web-browsing subagent, a document-reading subagent, and a writing subagent. Your job is to define the trust architecture before launch.

Start by describing the pipeline above and asking: "Where are the highest-risk trust boundaries, and what specific controls should I put at each one?"

Multi-Agent Security Advisor

AI Agents M8 · L2

Ready to help you secure your multi-agent pipeline. Describe your architecture and we'll work through trust boundaries, injection attack surfaces, and practical controls for each interface.

Module 8 · Lesson 3

Computer Use and Physical-World Agency

Agents that see screens, click buttons, and operate outside the sandbox of structured APIs

When an agent can operate any software a human can, what happens to every digital safeguard designed for humans?

On October 22, 2024, Anthropic released computer use as a public beta for Claude 3.5 Sonnet. The capability is straightforward to describe and startling in implication: given access to a virtual machine, Claude can see a screenshot of the screen and decide what to click, type, or drag. No structured API. No defined action space. Just pixels in, actions out — the same interface every piece of software designed for humans provides.

In the launch technical blog, Anthropic was unusually candid: computer use was "one of Anthropic's most risky capabilities to date." They listed specific concerns before listing the use cases.

How Computer Use Works

Anthropic's computer use capability gives Claude three primitive tools: screenshot (capture the current screen state), click (move the cursor and click at coordinates), and type (input keyboard characters). From these three primitives, an agent can operate any graphical software, fill forms, navigate browsers, write and execute code, manage files — anything a human can do on a desktop.

The architecture is a tight vision–action loop. Claude receives a screenshot, outputs a thought and an action, the action is executed in the VM, a new screenshot is taken, and the loop repeats. Each iteration is a complete inference call. A task like "book a flight" might require 20–40 such iterations, each navigating a different screen state. The model must maintain task coherence across all of them without any explicit memory beyond its context window.

OpenAI followed with Operator, announced in January 2025, taking a slightly different approach: rather than raw screenshot control, Operator uses a web-focused action space with accessibility tree access (structured DOM data alongside pixel input), improving reliability on web tasks while sacrificing generality.

Claude Computer Use Beta

Oct 2024

Anthropic released computer use for Claude 3.5 Sonnet — described in launch docs as "one of our most risky capabilities to date."

OSWorld Benchmark

12.2%

Claude 3.5 Sonnet's score on OSWorld (computer task benchmark) at beta launch — vs. ~72% for human evaluators.

OpenAI Operator

Jan 2025

OpenAI's web-use agent, using accessibility tree + pixel input for more reliable web automation than pure screenshot control.

The Security Surface Problem

Computer use fundamentally breaks the assumption underlying most software security: that the entity operating the interface is a human, with human limitations and human accountability. When an agent can click any button, submit any form, and authenticate using stored credentials, the attack surface is every piece of software the agent has access to.

Three specific risk classes dominated security discussions after the October 2024 launch. Credential abuse: an agent with access to a user's browser can interact with any site where the user is already logged in — email, banking, administrative tools — without explicit authorization for that specific action. Cross-context contamination: an agent browsing the web for task A might encounter a malicious page designed to redirect it to perform actions for task B. Irreversibility: many computer actions (sending an email, deleting a file, submitting a form) cannot be undone. A single wrong click has consequences that a wrong word in a chatbot response does not.

Anthropic's mitigations at launch included isolating computer use to fresh VM instances (no persistent credentials), recommending human confirmation before any action that could not be reversed, and publishing explicit guidance that computer use should not be given access to production systems with real credentials during the beta period.

Benchmark Context

On OSWorld — a benchmark of real computer tasks — Claude 3.5 Sonnet scored 12.2% at launch. Human evaluators score ~72% on the same tasks. This gap is useful context: computer use agents in late 2024 were genuinely capable at narrow, well-specified tasks but far from general desktop operation. The 72% gap represents real error modes that practitioners must design around.

Agentic Browsing: A Narrower Slice

Even before full computer use, agentic web browsing was deployed at scale. Perplexity AI ran an agentic search product throughout 2024 in which an LLM autonomously issued search queries, read results, and synthesized answers — without any human reviewing the intermediate steps. At peak, Perplexity processed tens of millions of queries per day through this pipeline. The product worked well for informational queries but generated significant controversy in mid-2024 when it was reported to be accessing and summarizing paywalled content and sometimes attributing fabricated claims to real sources — failures of the verification layer, not the browsing layer.

Key Terms

Computer useCapability allowing an agent to see screenshots and issue click/type commands — operating any graphical software without a structured API.

Vision–action loopThe iterative cycle of screenshot → model inference → action → screenshot that drives computer-use agent execution.

OSWorldBenchmark measuring agent performance on real computer tasks; humans score ~72%, Claude 3.5 Sonnet scored 12.2% at beta launch.

Irreversibility riskThe asymmetric danger that many computer actions (delete, send, submit) cannot be undone, making agent errors costlier than chatbot errors.

Lesson 3 Quiz

Computer Use and Physical-World Agency

1. What three primitive tools does Anthropic's computer use capability give Claude?

Correct. The three primitives are screenshot (capture screen state), click (move cursor and click at coordinates), and type (keyboard input). Everything else is composed from these.

Not quite. The three primitives are screenshot, click, and type. From these simple tools, an agent can operate any graphical software — but only these three are the base layer.

2. What score did Claude 3.5 Sonnet achieve on the OSWorld computer task benchmark at beta launch?

Correct. Claude 3.5 Sonnet scored 12.2% on OSWorld at launch, compared to ~72% for human evaluators — a useful reminder that the capability was real but narrow, not general desktop operation.

Not quite. The score was 12.2%. Human evaluators score ~72% on the same benchmark, illustrating the substantial gap between demonstrated capability and general computer operation.

3. What key difference distinguished OpenAI's Operator from Anthropic's computer use approach?

Correct. Operator used accessibility tree (structured DOM data) alongside pixel input, improving web reliability but sacrificing the generality of pure screenshot control.

Not quite. The architectural difference was Operator's use of accessibility tree data (structured DOM) alongside pixel input — improving web task reliability compared to Anthropic's pure screenshot approach.

4. What specific safety measure did Anthropic recommend for computer use with irreversible actions?

Correct. Anthropic's launch guidance explicitly recommended human confirmation before irreversible actions — recognizing that the asymmetric cost of wrong clicks demands a different safety model than chatbot errors.

Not quite. Anthropic's recommended mitigation was requiring human confirmation before any irreversible action. Automatic rollback was not part of the launch guidance.

Lab 3 — Scoping Computer Use Safely

Design a risk-tiered deployment policy for a computer-use agent in a real enterprise context

Your Scenario

Your company wants to deploy a computer-use agent to help HR staff onboard new employees: creating accounts, filling intake forms, and generating welcome emails. Before deployment, you need a risk tiering policy — which actions can the agent take autonomously, which need human confirmation, and which are out of scope entirely.

Start by asking: "Help me create a three-tier risk policy for a computer-use agent handling HR onboarding tasks. What belongs in each tier?"

Computer Use Policy Advisor

AI Agents M8 · L3

Let's build your risk tiering policy for computer-use agents. I'll help you think through action reversibility, credential exposure, and data sensitivity for each category of task. What actions does your HR onboarding agent need to perform?

Module 8 · Lesson 4

Alignment, Oversight, and the Road Ahead

What frontier agent capability means for human control — and the real institutional responses to that question

If we can build agents that outperform humans at most tasks, can we still meaningfully oversee them?

On May 22, 2023, three of the most senior voices in AI — Geoffrey Hinton, Yoshua Bengio, and Sam Altman — testified before the U.S. Senate Judiciary Committee. The topic was AI risk. Altman's opening statement included a striking admission: "If this technology goes wrong, it can go quite wrong." Six months later, OpenAI's board briefly removed Altman over undisclosed disagreements, reinstated him within five days under pressure from employees and investors, and restructured its governance. The episode revealed how thin the institutional frameworks governing frontier AI actually are.

The Scalable Oversight Problem

The core challenge is not that agents will be obviously malicious. It is subtler: as agents become more capable than humans at specific tasks, humans lose the ability to evaluate whether the agent's output is correct. A human reviewer cannot reliably detect a subtle flaw in a 10,000-line codebase generated by a coding agent. They cannot verify a complex legal analysis on a domain they lack expertise in. This is the scalable oversight problem — the question of how to maintain meaningful human supervision as agent capability exceeds human domain expertise.

Anthropic's published research program on this is called Constitutional AI and later Responsible Scaling Policy (RSP). The RSP, published in September 2023 and updated in October 2024, defines evaluation thresholds — called "AI Safety Levels" — at which the company commits to pause development and deployment until adequate safety mitigations exist. This is the first public binding commitment by a frontier lab to capability-gated deployment.

Sep 2023

Anthropic publishes Responsible Scaling Policy

First public capability-gated deployment commitment by a frontier lab. Defines AI Safety Levels (ASL-1 through ASL-4) with explicit pause triggers at each threshold.

Nov 2023

OpenAI board crisis

Sam Altman briefly removed and reinstated within five days. Episode exposed fragility of nonprofit-controlled governance structures at commercially scaled AI labs.

Mar 2024

EU AI Act passes

World's first comprehensive AI regulatory framework. Classifies "general purpose AI" (GPAI) systems above a compute threshold as requiring additional transparency obligations.

Oct 2024

NIST AI RMF 1.0 adoption accelerates

U.S. federal agencies begin formally requiring NIST AI Risk Management Framework compliance for AI procurement, including agentic systems.

Technical Approaches to Oversight at Scale

Three research directions dominated frontier alignment work in 2024. Debate (proposed by OpenAI): two AI systems argue opposing positions while a human judge evaluates; the idea is that detecting flawed reasoning is easier than generating correct reasoning. Recursive reward modeling: humans supervise AI assistants that help other AI assistants, creating a supervision hierarchy. Interpretability: mechanistic understanding of model internals to detect deceptive or misaligned behavior before it manifests in outputs. Anthropic's 2024 interpretability work on "features" in Claude — identifying which internal representations correspond to specific concepts — was published as a major research milestone.

None of these is solved. What the 2024 research record shows is that the alignment community is aware of the scalable oversight problem, has multiple active research threads, and has produced partial but not comprehensive solutions. The honest summary for practitioners: current alignment techniques provide meaningful but not absolute guarantees for frontier agents.

The EU AI Act and Agents

The EU AI Act, passed in March 2024, classifies AI agents used in critical infrastructure, employment, education, and law enforcement as "high-risk" systems subject to mandatory conformity assessment, human oversight requirements, and transparency obligations. General-purpose AI models above 10²³ FLOPs training compute face additional systemic risk obligations. These were the first legally binding agent-specific requirements in major jurisdiction.

What the Frontier Looks Like in Practice

By the end of 2024, the frontier of agent capability was roughly this: agents could reliably complete well-specified software engineering sub-tasks (SWE-bench ~50%), navigate web interfaces to complete single-session tasks (WebArena ~40%), and operate desktop software for narrow use cases (OSWorld ~15%). They could maintain coherent plans across dozens of steps but drifted on tasks exceeding ~60 minutes. They could coordinate in multi-agent networks with meaningful quality improvement over single agents but with non-trivial security surfaces.

The institutional layer was catching up slowly: the EU AI Act was law but not yet in enforcement phase; the U.S. had executive orders (Biden's October 2023 AI EO) but no comprehensive legislation; Anthropic's RSP was a self-imposed constraint with no external enforcement mechanism. The gap between technical capability and institutional governance was, by every measure, widening.

Key Terms

Scalable oversightThe research problem of maintaining meaningful human supervision as agent capability exceeds human domain expertise in specific tasks.

Responsible Scaling Policy (RSP)Anthropic's capability-gated deployment commitment, defining AI Safety Levels at which development pauses until mitigations exist.

Debate (alignment technique)Method where two AI systems argue opposing positions and a human judges — exploiting the easier task of detecting flawed reasoning vs. generating correct reasoning.

EU AI ActWorld's first comprehensive AI regulatory framework (passed March 2024), classifying agents in critical domains as high-risk and imposing mandatory oversight requirements.

Lesson 4 Quiz

Alignment, Oversight, and the Road Ahead

1. Anthropic's Responsible Scaling Policy defines evaluation thresholds called what?

Correct. The RSP defines AI Safety Levels (ASL-1 through ASL-4), with explicit commitments to pause development and deployment at each threshold until adequate mitigations exist.

Not quite. Anthropic's RSP uses the term "AI Safety Levels" (ASL) for its capability thresholds. These are the trigger points at which the company commits to pause until mitigations exist.

2. What is the core idea behind the "debate" alignment technique proposed by OpenAI?

Correct. Debate's key insight is asymmetry: detecting flawed reasoning is an easier task for human judges than generating correct reasoning — allowing humans to maintain oversight even when they couldn't produce the answer themselves.

Not quite. Debate pits two AI systems against each other with a human judge. The key insight is that humans can detect bad arguments more reliably than they can generate good ones — enabling oversight beyond human expertise level.

3. What compute threshold triggers additional systemic risk obligations under the EU AI Act for general-purpose AI models?

Correct. The EU AI Act's 10²³ FLOPs training compute threshold is the trigger for GPAI systemic risk obligations — a specific technical measure embedded in binding law.

Not quite. The EU AI Act sets the threshold at 10²³ FLOPs of training compute, above which general-purpose AI models face additional systemic risk transparency and assessment obligations.

4. What event in November 2023 exposed fragility in AI lab governance structures?

Correct. The November 2023 OpenAI board crisis — Altman removed and reinstated in five days — revealed how thin governance structures at commercially scaled frontier labs actually are.

Not quite. The November 2023 OpenAI board crisis — Sam Altman's five-day removal and reinstatement — was the event that exposed the fragility of nonprofit-controlled governance at commercial scale.

Lab 4 — Oversight Policy Design

Apply scalable oversight principles to a real frontier agent deployment scenario

Your Scenario

You are advising a hospital system that wants to deploy a long-horizon AI agent to assist radiologists: the agent would read scan reports, flag anomalies, suggest differential diagnoses, and draft referral letters — autonomously, at scale. You need to design an oversight framework that works even as the agent's diagnostic accuracy approaches or exceeds average radiologist performance.

Start by asking: "What does meaningful human oversight look like when the agent may be more accurate than the human supervisor in specific diagnostic tasks?"

Frontier Oversight Policy Advisor

AI Agents M8 · L4

This is one of the hardest problems in deployed AI: designing oversight for a system that may outperform its supervisors. Let's work through what "meaningful oversight" actually means in high-stakes domains where human verification is unreliable. What specific tasks do you want the agent to handle autonomously vs. with human review?

Module 8 Test

The Frontier of Agent Capability · 15 questions · Pass at 80%

1. What was the primary architectural innovation that enabled "long-horizon" agents compared to short-horizon chatbots?

Correct. The shift to long-horizon was architectural: persistent task ledgers, working memory, and structured planning loops — not raw model size.

Not quite. The key was architectural: persistent state management across sub-tasks through task ledgers, working memory buffers, and planning loops.

2. On the official SWE-bench evaluation at its March 2024 launch, what score did Devin achieve?

Correct. Devin's official SWE-bench score was 13.86% — significant as the first commercially deployed end-to-end software engineering agent, though below several open-source scaffolds.

Not quite. Devin scored 13.86% on the official SWE-bench evaluation. The 49% figure belongs to Claude 3.5 Sonnet with SWE-agent scaffolding, mid-2024.

3. What is "context poisoning" in long-horizon agents?

Correct. Context poisoning is self-contamination: wrong code, bad summaries, or incorrect intermediate results produced earlier in the task corrupt later decisions in the same run.

Not quite. Context poisoning is an internal failure: the agent's own earlier wrong outputs contaminate the context it uses for subsequent decisions.

4. Microsoft's AutoGen framework was open-sourced in which month and year?

Correct. AutoGen was open-sourced in October 2023 and became the most-starred AI framework on GitHub within 90 days.

Not quite. AutoGen launched in October 2023, becoming GitHub's most-starred AI framework within three months.

5. The prompt injection attack documented by Johann Rehberger against multi-agent systems worked by:

Correct. Rehberger's attack embedded hidden instructions in webpage content; the browsing agent treated retrieved content with the same trust level as system instructions, enabling the redirect.

Not quite. The attack used hidden text in a webpage to redirect the agent — exploiting the agent's failure to distinguish retrieved external content from trusted system instructions.

6. What does the "minimal privilege by default" mitigation mean in multi-agent security?

Correct. Minimal privilege limits each subagent's tool access to only what its specific sub-task requires, reducing the blast radius of any compromise or misbehavior.

Not quite. Minimal privilege means each subagent receives only the tools its specific sub-task requires — not a blanket full tool set that might be exploited.

7. Anthropic's computer use capability was released as a public beta in which month and year?

Correct. Computer use launched as a public beta on October 22, 2024, alongside Claude 3.5 Sonnet.

Not quite. Computer use launched in October 2024. OpenAI's Operator followed in January 2025.

8. What percentage do human evaluators score on OSWorld (computer task benchmark)?

Correct. Human evaluators score approximately 72% on OSWorld, compared to Claude 3.5 Sonnet's 12.2% at launch — illustrating the substantial gap between current agents and human-level general computer operation.

Not quite. Human evaluators score ~72% on OSWorld. Claude 3.5 Sonnet scored 12.2% at launch, showing the scale of the remaining gap.

9. What is the "irreversibility risk" specific to computer-use agents that does not apply to chatbots?

Correct. The asymmetric cost of computer actions — sending an email or deleting a file cannot be undone — demands different safety models than chatbot responses, which can simply be ignored.

Not quite. The core irreversibility risk is that real-world computer actions (send, delete, submit) cannot be undone, unlike chatbot text outputs which can be ignored or corrected in the next turn.

10. What is the "scalable oversight problem" in AI alignment?

Correct. Scalable oversight is the fundamental challenge: when agents outperform humans in a domain, human reviewers can no longer reliably detect whether the agent's output is correct or subtly wrong.

Not quite. Scalable oversight is specifically about verification capability: when an agent surpasses human expertise in a domain, humans can no longer reliably evaluate whether the agent's outputs are correct.

11. Anthropic's Responsible Scaling Policy was first published in:

Correct. The RSP was first published in September 2023, making it the first public capability-gated deployment commitment by a frontier AI lab.

Not quite. The RSP was published in September 2023 and updated in October 2024. It was the first public binding commitment to pause development at defined capability thresholds.

12. Which multi-agent framework introduced "role specialization" — defining agents as researcher, writer, critic, etc.?

Correct. CrewAI, launched January 2024, brought explicit role specialization to multi-agent systems — the "crew" metaphor of distinct agents with distinct organizational roles.

Not quite. CrewAI introduced role specialization, launching in January 2024 with a "crew" metaphor. AutoGen focused on conversational agent patterns; LangGraph on stateful graph execution.

13. What compute threshold in the EU AI Act triggers additional systemic risk obligations for general-purpose AI?

Correct. The EU AI Act sets 10²³ FLOPs of training compute as the threshold for GPAI systemic risk obligations — making it the first legally binding use of a technical compute metric in AI regulation.

Not quite. The EU AI Act's threshold is 10²³ FLOPs, a specific technical measure embedded in binding law that captures frontier models while excluding smaller systems.

14. What key insight makes the "debate" alignment technique promising for scalable oversight?

Correct. Debate's key asymmetry: humans can detect bad arguments more reliably than they can generate good ones — allowing oversight to scale beyond human domain expertise level.

Not quite. The insight is an asymmetry in human cognition: detecting that an argument is wrong is easier than constructing the correct argument from scratch — enabling oversight beyond direct human expertise.

15. What was the approximate SWE-bench score range for the best agents by mid-2024?

Correct. By mid-2024, the leading SWE-bench scores had reached ~45–50% (Claude 3.5 Sonnet with SWE-agent scaffolding at ~49%), up from under 5% eighteen months earlier.

Not quite. By mid-2024, the frontier had reached the 45–50% range. The 5% figure is where the best agents stood in early 2023 — the improvement over 18 months was dramatic.