On February 23, 2023, GitHub announced that Copilot X had evolved beyond line completion into a conversational coding assistant capable of explaining entire functions, suggesting test suites, and proposing fixes for flagged security vulnerabilities — all within the editor. The announcement crystallised what researchers had been watching for two years: coding agents were no longer autocomplete tools dressed up with a chat interface. They were autonomous loops that read a codebase, planned a change, wrote it, checked whether it compiled, and iterated if it did not.
That same month, a Stanford Human-Centered AI paper documented how developers using GitHub Copilot accepted roughly 26–35% of all suggestions outright — a rate that implied the agent was shaping production code at scale, not merely prompting human creativity.
A coding agent is an AI system that takes a goal expressed in natural language — "add pagination to the user table," "fix the failing CI test," "refactor the authentication module to use JWT" — and autonomously produces, runs, and iterates on code to satisfy that goal. This distinguishes them from code-completion systems, which respond only to the immediate context in the editor, and from chatbots, which describe code but do not execute or test it.
The architecture typically combines a large language model (LLM) as the reasoning core with a set of tool integrations: a code interpreter or sandbox, a file system reader/writer, a terminal executor, a linter or compiler, and often a version-control interface. The agent uses these tools in a loop: plan → write → execute → observe output → revise.
The critical difference between 2020-era code generation and 2023-onwards coding agents is the feedback loop. Earlier systems produced a block of code and stopped. Agents receive the error message when that code fails, incorporate it as new context, and try again — often multiple times before surfacing a result to the user.
Because the agent can observe runtime errors, it can fix classes of bugs that static generation cannot — including import errors, off-by-one logic failures, and API signature mismatches that only become visible at execution time.
The standard coding-agent loop, sometimes called ReAct (Reasoning + Acting), proceeds through four phases repeated until a success criterion is met or a step limit is hit:
Observe: The agent reads relevant files, error logs, test results, or documentation. It converts everything into tokens the LLM can process.
Plan: The LLM generates an internal reasoning trace — often called a "scratchpad" or "chain of thought" — that narrates what needs to change and why. This reasoning is typically hidden from end users but is critical to correctness.
Act: The agent calls a tool — writing a file, running a shell command, executing a Python snippet, calling a web search, or submitting a pull request.
Observe (again): The tool returns an output — success, error, or a side-effect result. This output enters the context window, and the loop repeats.
In May 2023, Cognition Labs released Devin, described as the first "fully autonomous software engineer." Devin's SWE-bench evaluation showed it could resolve approximately 13.86% of real GitHub issues end-to-end without human guidance — a result that, while contested in methodology, demonstrated that the observe-plan-act loop had matured enough to handle multi-step, multi-file engineering tasks.
By mid-2024, the best-performing coding agents on SWE-bench full-split were resolving between 18% and 26% of issues depending on the scaffold and model used (Anthropic's Claude 3.5 Sonnet powering SWE-agent at ~26% as of July 2024). This sounds low until you appreciate that the benchmark consists of real, hard, open issues on popular open-source projects — tasks that junior developers often need days to solve.
Common failure modes cluster into three categories. Context starvation: the agent cannot read all relevant files because the codebase exceeds the context window, and it makes assumptions that prove wrong. Silent failures: the code compiles and tests pass, but the logic is subtly incorrect in a way only humans would notice. Scope creep: the agent refactors far beyond the request, introducing changes that break unrelated functionality.
OpenAI's Code Interpreter (launched June 2023 inside ChatGPT) illustrated the power and limits simultaneously: it could execute multi-step data analysis, produce publication-quality charts, and iterate on errors — but it ran in an isolated environment with no internet access or persistent state, which capped its utility for real software engineering workflows.
Coding agents are most reliable on well-scoped tasks with clear success criteria (tests pass, linter is clean) and most fragile on tasks requiring deep domain context, cross-file architecture decisions, or understanding of undocumented team conventions.
In this lab you will explore the internal mechanics of coding agents through guided discussion. Ask about the ReAct loop, sandbox design, context window constraints, and tool integrations. Try to map each element of the loop to a concrete real-world coding scenario.
On March 12, 2024, Cognition Labs publicly demonstrated Devin solving a real Upwork freelance task — setting up a web scraper, running it, and delivering results — entirely without human input. The video circulated widely in developer communities. Within days, independent researchers at Gergely Orosz's newsletter and several open-source contributors tried to replicate the demo's claimed benchmark numbers and found the original SWE-bench methodology had used a non-standard subset of 25 problems, not the full 300. The corrected figure for the full benchmark was lower — but the episode illustrated something important about the coding-agent market: benchmarks and demos were driving enormous investment decisions, and the gap between demo performance and production reliability was already a serious concern.
GitHub Copilot, launched in technical preview in June 2021 and made generally available in June 2022, is the most widely deployed coding agent in production. As of early 2024, GitHub reported more than 1.3 million paid subscribers and over 50,000 organizations using the product. It operates primarily as an IDE plugin — initially for VS Code and JetBrains IDEs, later extended to Neovim and Visual Studio — and its core interaction model is inline suggestion: the developer writes, the agent completes.
The architectural choice is deliberate. By sitting in the editor and completing tokens rather than executing code autonomously, Copilot keeps the human firmly in the loop on every action. Copilot X, announced February 2023, added a chat sidebar, pull-request summarization, and CLI integration, moving closer to agentic behavior while retaining the suggestion-not-execution model as the default.
GitHub's own internal study, published in a 2022 research paper, found developers using Copilot completed a specific HTTP server coding task 55% faster than those without it. The result has been cited extensively — but critics note the task was narrow and lab-controlled, not representative of multi-file production workflows.
Devin (Cognition Labs, 2024) represents the maximally autonomous end of the spectrum: it receives a task, spins up its own development environment, reads documentation, writes and runs code, debugs failures, and submits results — with no mid-task human prompts expected. Its architecture includes a persistent shell, a browser for reading docs and forums, and a code editor, all orchestrated by an agent loop running on top of Claude or GPT-4 class models.
SWE-agent, released by Princeton NLP as open-source in April 2024, takes a leaner approach. It wraps an LLM (defaulting to GPT-4) in a structured interface that provides file-editing commands, a bash shell, and explicit scaffolding to prevent the model from losing track of its location in the codebase. On the SWE-bench benchmark it achieved approximately 12.5% resolution on the full split — competitive with or exceeding Devin's corrected numbers at a fraction of the compute cost.
OpenHands (formerly OpenDevin), an open-source community project, aimed to democratize the Devin-style autonomous agent architecture. By mid-2024 it had accumulated over 30,000 GitHub stars, demonstrating significant developer interest in controllable, self-hostable coding agents. Cursor, a VS Code fork with deep LLM integration, focused on a different trade-off: maximum context awareness of the local codebase rather than full autonomy, letting developers maintain tight control while benefiting from AI suggestions informed by thousands of project files.
High autonomy (Devin-style) maximizes throughput on well-scoped tasks but makes errors harder to catch mid-execution. Low autonomy (Copilot suggestion-style) keeps humans in the loop but requires more developer attention per unit of output.
A McKinsey survey published in December 2023 found that 67% of software developers in large enterprises had experimented with AI coding tools, but only 22% were using them for tasks beyond simple code completion in a production context. The primary barriers cited were security review of AI-generated code (61% of respondents), intellectual property uncertainty (48%), and integration with existing CI/CD pipelines (41%).
Amazon's internal CodeWhisperer deployment — their in-house alternative to Copilot — required the security team to build an additional scanning layer to catch AI-suggested code that inadvertently replicated patterns from known-vulnerable open-source functions. This added pipeline step became a de-facto industry pattern: AI coding output goes through human review and automated security scanning before merge.
The distinction between inline assistant (Copilot, Cursor), agentic patch generator (SWE-agent, Devin), and full autonomous engineer (theoretical) maps roughly to increasing capability and increasing risk of undetected errors entering production. Most enterprise deployments in 2024 remained firmly in the inline-assistant tier for production code, with agentic tools reserved for internal tooling, test generation, and documentation.
The coding agent market in 2024 was not a single category but a spectrum from smart autocomplete to fully autonomous deployment pipelines. Where an organization placed itself on that spectrum depended far more on risk tolerance and security posture than on pure capability preferences.
Use this lab to think critically about how different coding agent architectures fit different organizational needs. Discuss the autonomy spectrum, the Devin benchmark controversy, and what security-conscious enterprise deployment looks like in practice.
In August 2022, a team at Stanford University published a controlled study in IEEE S&P: they asked 47 developers to solve security-sensitive coding tasks, half of them using GitHub Copilot. The result was striking — developers with AI assistance were significantly more likely to introduce security vulnerabilities than those coding unaided. Crucially, the AI-assisted developers were more confident their code was secure. The combination of higher vulnerability rates and lower perceived risk was precisely the pattern security researchers had feared: the tool didn't just introduce bugs, it suppressed the developer's own doubt that drove manual review.
The Stanford finding built on earlier work: a 2021 NYU study by Pearce et al. tested Copilot on 89 diverse security-sensitive coding scenarios drawn from the MITRE CWE (Common Weakness Enumeration) list. Approximately 40% of Copilot's suggestions contained at least one vulnerability. The most common weakness types were buffer overflow risks (CWE-119), SQL injection patterns (CWE-89), and insufficient input validation (CWE-20).
The root mechanism is straightforward: LLMs are trained on public code, and public code contains vulnerabilities at high density. Stack Overflow answers, GitHub repositories, and tutorial sites routinely demonstrate techniques using known-insecure patterns — hard-coded credentials, eval() on user input, raw SQL string concatenation — because the goal of those examples is clarity of concept, not production security. The model learns from all of it.
When a developer asks an AI agent to "add a login endpoint," the agent draws on this training distribution. If most of the training examples for login endpoints used MD5 password hashing (common in tutorials circa 2010–2018), the model's prior skews toward suggesting MD5 — unless explicitly instructed otherwise or unless the system prompt enforces modern security standards.
In 2023, security researchers at Vulcan Cyber documented a novel attack vector: package hallucination. AI coding agents, when generating code that requires a third-party library, sometimes invent package names that do not exist in npm, PyPI, or other registries. If a developer runs the generated requirements file without verifying each package, the import fails — but an attacker who has noticed the hallucinated package name can register it with malicious code and serve it to every developer who subsequently runs that AI-generated dependency list.
This is a variant of the earlier dependency confusion attack documented by Alex Birsan in 2021 — but AI-assisted coding dramatically expands the attack surface by introducing novel fake package names at scale across millions of developer interactions.
A follow-up study published in early 2024 by Lanyado et al. tested multiple LLMs and found that hallucinated package recommendations occurred in roughly 5.2% of queries to popular models when asked to implement code requiring third-party libraries. At GitHub Copilot's scale of 1.3 million developers, that rate implies millions of potentially exploitable hallucinated dependency recommendations annually.
Attacker workflow: (1) monitor AI coding forums and outputs for hallucinated package names, (2) register the name on npm/PyPI with a malicious payload, (3) wait for developers to run AI-generated code that pulls the package. No phishing or social engineering required.
Coding agents that read external files, web pages, or API documentation as part of their task are vulnerable to prompt injection — hidden instructions embedded in the content they read that redirect the agent's behavior. In a documented proof-of-concept by researcher Johann Rehberger in 2023, a coding agent asked to summarize a GitHub README file processed hidden instructions in that README and exfiltrated the contents of other local files to an attacker-controlled server.
The risk is particularly acute because coding agents often operate with elevated permissions: they can write files, execute commands, and commit to repositories. A successful prompt injection attack on a coding agent is therefore not just a data leak — it can mean arbitrary code execution in the developer's environment or malicious commits to a production codebase.
Mitigation strategies documented by the OWASP Top-10 for LLMs (2023 edition) include: sandboxing all agent execution, privilege separation (the agent should not have write access unless the task explicitly requires it), human-in-the-loop checkpoints before any irreversible action, and output validation to detect anomalous file access patterns.
The security risk of AI coding agents is not primarily that the AI "makes mistakes" — it is that developers trust AI output more than human output, reducing the review intensity that would normally catch those mistakes. Restoring appropriate skepticism about AI-generated code is the foundational security control, regardless of what other technical mitigations are layered on top.
In this lab you will practice reasoning about the security risks introduced by coding agents. Ask the assistant to review hypothetical AI-generated code snippets for vulnerabilities, discuss package hallucination attack vectors, and explore how to build secure-by-default prompting practices for your team.
In July 2023, Cognition Labs and separately the Meta AI research team each published results on agent scaffolding that uses LLMs to generate their own test suites. Meta's approach, documented in the paper "InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback," showed that an agent given access to a bash shell could generate test cases, run them against its own code, observe failures, and iterate — functionally automating a portion of the test-driven development cycle that had previously required a human to specify expected behavior. The performance improvement on coding benchmarks when self-testing was enabled was substantial: error correction rates roughly doubled compared to single-pass generation.
The ability of an agent to generate its own tests is a significant capability amplifier. Traditional TDD (test-driven development) requires a developer to specify expected behavior before writing code — a slow, discipline-intensive process. When an agent can generate reasonable test cases from a function's docstring or type annotations, the feedback loop accelerates dramatically.
However, self-generated tests carry an important failure mode: the agent tends to write tests that match its own interpretation of the requirement, not necessarily the correct one. If the agent misunderstands the specification, the tests it writes will pass the code it writes — and the entire suite will be incorrect in a self-consistent way. Researchers at Google DeepMind, in their 2024 AlphaCode 2 paper, described this as the "specification alignment problem": the agent needs an external ground truth to compare against, or self-evaluation degenerates into circular validation.
Practical deployments address this by keeping human-authored tests for business logic and allowing agent-generated tests only for utility functions, edge-case coverage expansion, and regression tests that reproduce specific observed bugs. This hybrid approach captures the efficiency gain while maintaining a human anchor on correctness.
The next architectural evolution beyond the single coding agent is multi-agent orchestration: one coordinator agent decomposes a large task into sub-tasks and spawns specialized sub-agents to execute them in parallel. In October 2023, Microsoft Research published AutoGen, a framework specifically designed to orchestrate conversations between multiple LLM-powered agents that collectively write, review, test, and refine code.
In AutoGen's documented examples, a three-agent system — a "developer" agent that writes code, a "critic" agent that reviews it for errors, and a "tester" agent that generates and runs tests — outperformed a single-agent baseline on complex multi-step coding tasks by 15–25% across multiple benchmarks. The gains came primarily from the critic catching errors that the developer agent would not catch in its own output.
This mirrors how human software teams work: the person who wrote the code is least likely to spot its bugs, while a fresh reviewer catches structural problems quickly. Multi-agent systems computationally replicate the cognitive diversity that makes human code review effective.
Amazon Web Services began piloting multi-agent code generation internally in Q1 2024, with one documented use case generating infrastructure-as-code (CloudFormation templates) through an orchestration pipeline where a planner agent, a security-policy-checker agent, and a syntax-validator agent all acted sequentially on the same artifact before human review.
Multi-agent pipelines compound errors: if the coordinator misunderstands the original requirement, every downstream sub-agent operates on a flawed premise. Error propagation is faster and harder to diagnose than in single-agent systems.
A 2024 GitHub survey of 500 developers who had used Copilot for more than 12 months found that 72% reported spending significantly more time on system design and code review than before adoption, while time spent on boilerplate implementation had fallen sharply. The emerging role is less "code writer" and more code director: specifying intent clearly, reviewing AI output critically, and making architectural decisions the agent cannot make on its own.
This creates a bifurcated labour market risk. Developers who master the skills of directing, reviewing, and testing AI-generated code become significantly more productive. Developers who defer to AI output without developing strong review skills may find their code quality declining invisibly — the vulnerabilities and architectural mistakes accumulate unnoticed until a major incident surfaces them.
Goldman Sachs, in a widely-cited April 2023 research note, estimated that AI coding tools could automate approximately 25–50% of current programming tasks over the next decade. The note emphasized that this would displace some roles while creating demand for higher-abstraction engineering work — requirements modeling, system architecture, AI output auditing — that current educational pipelines do not explicitly train for.
The most concrete near-term shift visible in job postings by late 2024: AI code review as an explicit listed skill, and experience with LLM output evaluation appearing in senior engineering job descriptions at companies ranging from Stripe to Shopify to the UK Government Digital Service.
Coding agents do not replace software engineers — they raise the abstraction level at which engineers must operate. The critical skill shift is from implementation fluency to specification precision: the developer who can clearly articulate what correct behavior looks like will extract dramatically more value from an agent than one who relies on the agent to infer intent.
In this lab you will reason through multi-agent coding architectures, the specification alignment problem in self-generated tests, and the concrete skill shifts coding agents are creating in the engineering workforce. Try designing a multi-agent pipeline for a real task you or your team faces.