In November 2023, Cognition AI was barely a company. By March 2024 it released Devin β publicly described as the first AI software engineer. A demo showed Devin autonomously receiving a task, opening a browser, reading documentation, writing code, running tests, debugging failures, and committing a working solution β all without a human in the loop. The AI research community immediately began stress-testing the claim. The real story was more complicated, and more instructive.
Every agent system before roughly 2023 operated in what researchers call short-horizon mode: receive a prompt, produce a response, stop. The user evaluates the result and issues the next prompt. The human provides continuity. Long-horizon agents invert this: the agent itself maintains a running plan across many sub-tasks, tool calls, and decisions β sometimes across hours or days of wall-clock time.
The key technical ingredients that made this shift possible are not new individually. What changed is their combination: persistent memory stores that survive across sessions, structured planning loops (goal decomposition β sub-task execution β verification β replanning), reliable tool-use with error handling, and sufficiently large context windows to hold the full task state.
SWE-bench β the standard benchmark for software engineering agents β operationalizes this precisely. Given a GitHub issue, an agent must navigate a real codebase, reproduce the bug, write a fix, and pass a test suite. The 2024 leaderboard saw scores jump from under 5% (GPT-4 with basic scaffolding, early 2023) to 49% (Claude 3.5 Sonnet with SWE-agent scaffolding, mid-2024), with Devin scoring 13.86% on the official evaluation β below several open-source scaffolds, but the first commercially deployed end-to-end system.
Independent researcher Albert Ziegler published a detailed replication study in April 2024 showing that Devin's demo tasks were cherry-picked easier instances. On the full unfiltered benchmark, Devin performed in line with published numbers. The lesson: benchmark selection matters as much as benchmark score.
Long-horizon agents need architecture that short-horizon chatbots do not. The core additions are: a task ledger (persistent record of goals, sub-goals, and completion status), a working memory buffer (recent context relevant to current sub-task), and a scratchpad (intermediate reasoning the model writes to and reads from). Together these create something like an executive function β the capacity to track "where am I in this plan and what must I do next."
OpenAI's o1 and o3 models introduced extended chain-of-thought as a first-class feature in late 2023 and early 2024. Before generating a final output, the model reasons through a private scratchpad that can run for hundreds of tokens. This directly improves long-horizon performance because errors in early sub-tasks are caught in reasoning before they propagate.
The longer an agent runs, the more ways it can fail. Three failure modes dominate empirical reports from 2024 deployments. Goal drift: the agent pursues a sub-goal so aggressively it loses track of the original objective β Anthropic's published research on Claude's "computer use" capability noted cases where Claude would fixate on a specific file path rather than the underlying task intent. Compounding error: a wrong assumption in step 3 propagates uncorrected through steps 4β12, producing a coherent-looking but fundamentally wrong result. Context poisoning: the agent's own earlier outputs β wrong code, bad documentation summaries β pollute the context used for later decisions.
Practical long-horizon systems in 2024 addressed these with checkpointing (periodic human review gates), verifiers (automated tests run after each major action), and explicit uncertainty signaling (the agent is prompted to flag when confidence drops below a threshold rather than proceeding blindly).
You are advising a team building a software engineering agent similar to Devin. They want it to autonomously resolve GitHub issues β potentially running for 30β60 minutes per task without human review. They ask you to help them identify which architectural features are essential and which failure modes they must guard against.
In early 2024, Anthropic published a technical document on multi-agent architectures used internally. The document described an orchestratorβsubagent pattern: one model receives the high-level task and decomposes it into sub-tasks; separate model instances execute each sub-task in parallel. The orchestrator then synthesizes results. Critically, Anthropic noted that subagents cannot verify that their orchestrator is legitimate β a problem with significant security implications they called "prompt injection via orchestration."
A single agent faces hard limits: a finite context window, sequential execution, and the difficulty of maintaining coherent expertise across very different domains. Multi-agent systems address these by distributing work. A research task might be split: one agent searches the web, a second reads and summarizes papers, a third cross-checks claims, and a fourth writes the final report. Each agent operates within its context limit; the orchestrator manages the larger plan.
The commercial ecosystem formalized this quickly. LangChain's multi-agent framework (LangGraph) reached 1 million monthly downloads in early 2024. Microsoft's AutoGen framework β open-sourced in October 2023 β became the most starred AI framework on GitHub within three months, with its "conversational agents" pattern allowing models to message each other with structured handoffs. CrewAI added role specialization: agents are explicitly defined as "researcher," "writer," "critic," and so on, mimicking organizational structure.
Multi-agent systems introduce a security surface single agents do not have: inter-agent trust. When an orchestrator instructs a subagent to execute an action, the subagent has no cryptographic way to verify the instruction is legitimate. A malicious or compromised intermediary could inject instructions. This is the core of prompt injection attacks on agent pipelines, documented in practice in 2024 by security researcher Johann Rehberger, who demonstrated that web content retrieved by a browsing agent could contain hidden instructions that redirected the agent's behavior without the user's knowledge.
The practical mitigations deployed in 2024 include minimal privilege by default (subagents are granted only the specific tools they need), output validation layers (an additional model reviews subagent outputs before they enter the orchestrator's context), and explicit human confirmation gates for high-stakes actions regardless of the agent's confidence level.
In 2024, researcher Johann Rehberger demonstrated a live prompt injection attack against a multi-agent system: a webpage retrieved by a browsing agent contained hidden text instructing the agent to exfiltrate conversation history to an external URL. The attack succeeded because the agent treated retrieved web content with the same trust level as system instructions.
A surprising finding from early multi-agent deployments: agents can develop implicit coordination that was not explicitly programmed. In a 2024 study by Google DeepMind ("Gemini for Tasks"), multiple Gemini agents coordinating on long research tasks showed spontaneous specialization β agents that were initialized identically began producing outputs with distinct characteristics after several rounds of interaction, with some becoming more focused on evidence-gathering and others on synthesis. Whether this is genuine specialization or an artifact of context divergence remains debated.
For practitioners, the relevant fact is simpler: multi-agent systems are harder to predict than their components suggest. Testing a single agent in isolation does not reliably predict its behavior as part of a network. This is why leading teams in 2024 began instrumenting agent networks with interaction tracing β logging not just inputs and outputs but every inter-agent message β to make debugging tractable.
You're the security lead for a company deploying a multi-agent research assistant. The system uses an orchestrator agent that assigns tasks to a web-browsing subagent, a document-reading subagent, and a writing subagent. Your job is to define the trust architecture before launch.
On October 22, 2024, Anthropic released computer use as a public beta for Claude 3.5 Sonnet. The capability is straightforward to describe and startling in implication: given access to a virtual machine, Claude can see a screenshot of the screen and decide what to click, type, or drag. No structured API. No defined action space. Just pixels in, actions out β the same interface every piece of software designed for humans provides.
In the launch technical blog, Anthropic was unusually candid: computer use was "one of Anthropic's most risky capabilities to date." They listed specific concerns before listing the use cases.
Anthropic's computer use capability gives Claude three primitive tools: screenshot (capture the current screen state), click (move the cursor and click at coordinates), and type (input keyboard characters). From these three primitives, an agent can operate any graphical software, fill forms, navigate browsers, write and execute code, manage files β anything a human can do on a desktop.
The architecture is a tight visionβaction loop. Claude receives a screenshot, outputs a thought and an action, the action is executed in the VM, a new screenshot is taken, and the loop repeats. Each iteration is a complete inference call. A task like "book a flight" might require 20β40 such iterations, each navigating a different screen state. The model must maintain task coherence across all of them without any explicit memory beyond its context window.
OpenAI followed with Operator, announced in January 2025, taking a slightly different approach: rather than raw screenshot control, Operator uses a web-focused action space with accessibility tree access (structured DOM data alongside pixel input), improving reliability on web tasks while sacrificing generality.
Computer use fundamentally breaks the assumption underlying most software security: that the entity operating the interface is a human, with human limitations and human accountability. When an agent can click any button, submit any form, and authenticate using stored credentials, the attack surface is every piece of software the agent has access to.
Three specific risk classes dominated security discussions after the October 2024 launch. Credential abuse: an agent with access to a user's browser can interact with any site where the user is already logged in β email, banking, administrative tools β without explicit authorization for that specific action. Cross-context contamination: an agent browsing the web for task A might encounter a malicious page designed to redirect it to perform actions for task B. Irreversibility: many computer actions (sending an email, deleting a file, submitting a form) cannot be undone. A single wrong click has consequences that a wrong word in a chatbot response does not.
Anthropic's mitigations at launch included isolating computer use to fresh VM instances (no persistent credentials), recommending human confirmation before any action that could not be reversed, and publishing explicit guidance that computer use should not be given access to production systems with real credentials during the beta period.
On OSWorld β a benchmark of real computer tasks β Claude 3.5 Sonnet scored 12.2% at launch. Human evaluators score ~72% on the same tasks. This gap is useful context: computer use agents in late 2024 were genuinely capable at narrow, well-specified tasks but far from general desktop operation. The 72% gap represents real error modes that practitioners must design around.
Even before full computer use, agentic web browsing was deployed at scale. Perplexity AI ran an agentic search product throughout 2024 in which an LLM autonomously issued search queries, read results, and synthesized answers β without any human reviewing the intermediate steps. At peak, Perplexity processed tens of millions of queries per day through this pipeline. The product worked well for informational queries but generated significant controversy in mid-2024 when it was reported to be accessing and summarizing paywalled content and sometimes attributing fabricated claims to real sources β failures of the verification layer, not the browsing layer.
Your company wants to deploy a computer-use agent to help HR staff onboard new employees: creating accounts, filling intake forms, and generating welcome emails. Before deployment, you need a risk tiering policy β which actions can the agent take autonomously, which need human confirmation, and which are out of scope entirely.
On May 22, 2023, three of the most senior voices in AI β Geoffrey Hinton, Yoshua Bengio, and Sam Altman β testified before the U.S. Senate Judiciary Committee. The topic was AI risk. Altman's opening statement included a striking admission: "If this technology goes wrong, it can go quite wrong." Six months later, OpenAI's board briefly removed Altman over undisclosed disagreements, reinstated him within five days under pressure from employees and investors, and restructured its governance. The episode revealed how thin the institutional frameworks governing frontier AI actually are.
The core challenge is not that agents will be obviously malicious. It is subtler: as agents become more capable than humans at specific tasks, humans lose the ability to evaluate whether the agent's output is correct. A human reviewer cannot reliably detect a subtle flaw in a 10,000-line codebase generated by a coding agent. They cannot verify a complex legal analysis on a domain they lack expertise in. This is the scalable oversight problem β the question of how to maintain meaningful human supervision as agent capability exceeds human domain expertise.
Anthropic's published research program on this is called Constitutional AI and later Responsible Scaling Policy (RSP). The RSP, published in September 2023 and updated in October 2024, defines evaluation thresholds β called "AI Safety Levels" β at which the company commits to pause development and deployment until adequate safety mitigations exist. This is the first public binding commitment by a frontier lab to capability-gated deployment.
Three research directions dominated frontier alignment work in 2024. Debate (proposed by OpenAI): two AI systems argue opposing positions while a human judge evaluates; the idea is that detecting flawed reasoning is easier than generating correct reasoning. Recursive reward modeling: humans supervise AI assistants that help other AI assistants, creating a supervision hierarchy. Interpretability: mechanistic understanding of model internals to detect deceptive or misaligned behavior before it manifests in outputs. Anthropic's 2024 interpretability work on "features" in Claude β identifying which internal representations correspond to specific concepts β was published as a major research milestone.
None of these is solved. What the 2024 research record shows is that the alignment community is aware of the scalable oversight problem, has multiple active research threads, and has produced partial but not comprehensive solutions. The honest summary for practitioners: current alignment techniques provide meaningful but not absolute guarantees for frontier agents.
The EU AI Act, passed in March 2024, classifies AI agents used in critical infrastructure, employment, education, and law enforcement as "high-risk" systems subject to mandatory conformity assessment, human oversight requirements, and transparency obligations. General-purpose AI models above 10Β²Β³ FLOPs training compute face additional systemic risk obligations. These were the first legally binding agent-specific requirements in major jurisdiction.
By the end of 2024, the frontier of agent capability was roughly this: agents could reliably complete well-specified software engineering sub-tasks (SWE-bench ~50%), navigate web interfaces to complete single-session tasks (WebArena ~40%), and operate desktop software for narrow use cases (OSWorld ~15%). They could maintain coherent plans across dozens of steps but drifted on tasks exceeding ~60 minutes. They could coordinate in multi-agent networks with meaningful quality improvement over single agents but with non-trivial security surfaces.
The institutional layer was catching up slowly: the EU AI Act was law but not yet in enforcement phase; the U.S. had executive orders (Biden's October 2023 AI EO) but no comprehensive legislation; Anthropic's RSP was a self-imposed constraint with no external enforcement mechanism. The gap between technical capability and institutional governance was, by every measure, widening.
You are advising a hospital system that wants to deploy a long-horizon AI agent to assist radiologists: the agent would read scan reports, flag anomalies, suggest differential diagnoses, and draft referral letters β autonomously, at scale. You need to design an oversight framework that works even as the agent's diagnostic accuracy approaches or exceeds average radiologist performance.