Module 4 · Lesson 1

From Prompt to Pull Request

Coding agents bridge the gap between natural-language intent and executable software — but the path from instruction to running code is stranger than most people expect.

What does it actually mean for an AI to "write code," and how do coding agents differ from autocomplete?

On February 23, 2023, GitHub announced that Copilot X had evolved beyond line completion into a conversational coding assistant capable of explaining entire functions, suggesting test suites, and proposing fixes for flagged security vulnerabilities — all within the editor. The announcement crystallised what researchers had been watching for two years: coding agents were no longer autocomplete tools dressed up with a chat interface. They were autonomous loops that read a codebase, planned a change, wrote it, checked whether it compiled, and iterated if it did not.

That same month, a Stanford Human-Centered AI paper documented how developers using GitHub Copilot accepted roughly 26–35% of all suggestions outright — a rate that implied the agent was shaping production code at scale, not merely prompting human creativity.

1.1 — Defining Coding Agents

A coding agent is an AI system that takes a goal expressed in natural language — "add pagination to the user table," "fix the failing CI test," "refactor the authentication module to use JWT" — and autonomously produces, runs, and iterates on code to satisfy that goal. This distinguishes them from code-completion systems, which respond only to the immediate context in the editor, and from chatbots, which describe code but do not execute or test it.

The architecture typically combines a large language model (LLM) as the reasoning core with a set of tool integrations: a code interpreter or sandbox, a file system reader/writer, a terminal executor, a linter or compiler, and often a version-control interface. The agent uses these tools in a loop: plan → write → execute → observe output → revise.

The critical difference between 2020-era code generation and 2023-onwards coding agents is the feedback loop. Earlier systems produced a block of code and stopped. Agents receive the error message when that code fails, incorporate it as new context, and try again — often multiple times before surfacing a result to the user.

Why This Matters

Because the agent can observe runtime errors, it can fix classes of bugs that static generation cannot — including import errors, off-by-one logic failures, and API signature mismatches that only become visible at execution time.

1.2 — The Observe-Plan-Act Loop in Practice

The standard coding-agent loop, sometimes called ReAct (Reasoning + Acting), proceeds through four phases repeated until a success criterion is met or a step limit is hit:

Observe: The agent reads relevant files, error logs, test results, or documentation. It converts everything into tokens the LLM can process.

Plan: The LLM generates an internal reasoning trace — often called a "scratchpad" or "chain of thought" — that narrates what needs to change and why. This reasoning is typically hidden from end users but is critical to correctness.

Act: The agent calls a tool — writing a file, running a shell command, executing a Python snippet, calling a web search, or submitting a pull request.

Observe (again): The tool returns an output — success, error, or a side-effect result. This output enters the context window, and the loop repeats.

In May 2023, Cognition Labs released Devin, described as the first "fully autonomous software engineer." Devin's SWE-bench evaluation showed it could resolve approximately 13.86% of real GitHub issues end-to-end without human guidance — a result that, while contested in methodology, demonstrated that the observe-plan-act loop had matured enough to handle multi-step, multi-file engineering tasks.

ReAct LoopA reasoning-then-action pattern where the model interleaves natural-language reasoning traces with calls to external tools, allowing each tool result to update subsequent reasoning.

SWE-benchA benchmark released by Princeton NLP in 2023 that tests coding agents against real, unresolved GitHub issues — specifically, whether the agent can produce a patch that passes all associated test cases.

SandboxAn isolated execution environment where the agent can run code without risk to the host system; outputs are captured and fed back to the agent as observations.

1.3 — Capabilities and Limits as of 2024

By mid-2024, the best-performing coding agents on SWE-bench full-split were resolving between 18% and 26% of issues depending on the scaffold and model used (Anthropic's Claude 3.5 Sonnet powering SWE-agent at ~26% as of July 2024). This sounds low until you appreciate that the benchmark consists of real, hard, open issues on popular open-source projects — tasks that junior developers often need days to solve.

Common failure modes cluster into three categories. Context starvation: the agent cannot read all relevant files because the codebase exceeds the context window, and it makes assumptions that prove wrong. Silent failures: the code compiles and tests pass, but the logic is subtly incorrect in a way only humans would notice. Scope creep: the agent refactors far beyond the request, introducing changes that break unrelated functionality.

OpenAI's Code Interpreter (launched June 2023 inside ChatGPT) illustrated the power and limits simultaneously: it could execute multi-step data analysis, produce publication-quality charts, and iterate on errors — but it ran in an isolated environment with no internet access or persistent state, which capped its utility for real software engineering workflows.

Key Insight

Coding agents are most reliable on well-scoped tasks with clear success criteria (tests pass, linter is clean) and most fragile on tasks requiring deep domain context, cross-file architecture decisions, or understanding of undocumented team conventions.

Lesson 1 Quiz

From Prompt to Pull Request — five questions

1. What primarily distinguishes a coding agent from a code-completion system like the original GitHub Copilot?

Correct. The feedback loop — executing code, reading the output, and revising — is the defining characteristic that separates coding agents from earlier completion tools.

Not quite. The core distinction is the observe-plan-act feedback loop, not model size or language constraints.

2. SWE-bench, released by Princeton NLP in 2023, tests coding agents by asking them to do what?

Correct. SWE-bench uses real open GitHub issues and grades agents on whether their patches cause the associated test suite to pass.

SWE-bench uses real GitHub issues, not algorithmic puzzles or documentation tasks.

3. In the ReAct (Reasoning + Acting) pattern used by coding agents, what is the role of the "scratchpad" or chain-of-thought?

Correct. The chain-of-thought or scratchpad is an internal reasoning layer that precedes tool calls; it is usually hidden from end users.

The scratchpad is an internal reasoning trace before tool calls, not persistent memory or a user-facing log.

4. What was the approximate rate at which developers accepted GitHub Copilot suggestions outright, according to the 2023 Stanford HAI research cited in this lesson?

Correct. The Stanford paper documented a 26–35% outright acceptance rate, which was high enough to indicate real influence on production code at scale.

The documented acceptance rate was 26–35%, not the higher or lower figures listed here.

5. Which failure mode describes a coding agent that refactors far beyond its original task scope and breaks unrelated functionality?

Correct. Scope creep occurs when the agent makes changes well beyond what was requested, risking unintended side-effects in the broader codebase.

Scope creep is the specific failure mode for over-broad changes. Context starvation refers to insufficient file-reading; silent failure is when code compiles but logic is wrong.

Lab 1 · Anatomy of a Coding Agent

Conversation-based lab — discuss the observe-plan-act loop with an AI assistant.

Lab Objective

In this lab you will explore the internal mechanics of coding agents through guided discussion. Ask about the ReAct loop, sandbox design, context window constraints, and tool integrations. Try to map each element of the loop to a concrete real-world coding scenario.

Suggested opener: "Walk me through exactly what happens — step by step — when a coding agent receives the instruction 'Fix the failing unit test in auth.py.'"

Coding Agent Lab Assistant

Lesson 1

Welcome to Lab 1. I'm here to help you explore the mechanics of coding agents — the observe-plan-act loop, sandboxes, context windows, and tool integrations. What would you like to dig into first?

Module 4 · Lesson 2

The Coding Agent Landscape

GitHub Copilot, Devin, SWE-agent, OpenHands, Cursor — the platforms differ sharply in architecture, integration model, and real-world deployment outcomes.

How do the leading coding agent platforms compare, and what trade-offs do their architectural choices create?

On March 12, 2024, Cognition Labs publicly demonstrated Devin solving a real Upwork freelance task — setting up a web scraper, running it, and delivering results — entirely without human input. The video circulated widely in developer communities. Within days, independent researchers at Gergely Orosz's newsletter and several open-source contributors tried to replicate the demo's claimed benchmark numbers and found the original SWE-bench methodology had used a non-standard subset of 25 problems, not the full 300. The corrected figure for the full benchmark was lower — but the episode illustrated something important about the coding-agent market: benchmarks and demos were driving enormous investment decisions, and the gap between demo performance and production reliability was already a serious concern.

2.1 — GitHub Copilot: IDE-Native, Suggestion-First

GitHub Copilot, launched in technical preview in June 2021 and made generally available in June 2022, is the most widely deployed coding agent in production. As of early 2024, GitHub reported more than 1.3 million paid subscribers and over 50,000 organizations using the product. It operates primarily as an IDE plugin — initially for VS Code and JetBrains IDEs, later extended to Neovim and Visual Studio — and its core interaction model is inline suggestion: the developer writes, the agent completes.

The architectural choice is deliberate. By sitting in the editor and completing tokens rather than executing code autonomously, Copilot keeps the human firmly in the loop on every action. Copilot X, announced February 2023, added a chat sidebar, pull-request summarization, and CLI integration, moving closer to agentic behavior while retaining the suggestion-not-execution model as the default.

GitHub's own internal study, published in a 2022 research paper, found developers using Copilot completed a specific HTTP server coding task 55% faster than those without it. The result has been cited extensively — but critics note the task was narrow and lab-controlled, not representative of multi-file production workflows.

2.2 — Devin, SWE-agent, and Autonomous Patch Generation

Devin (Cognition Labs, 2024) represents the maximally autonomous end of the spectrum: it receives a task, spins up its own development environment, reads documentation, writes and runs code, debugs failures, and submits results — with no mid-task human prompts expected. Its architecture includes a persistent shell, a browser for reading docs and forums, and a code editor, all orchestrated by an agent loop running on top of Claude or GPT-4 class models.

SWE-agent, released by Princeton NLP as open-source in April 2024, takes a leaner approach. It wraps an LLM (defaulting to GPT-4) in a structured interface that provides file-editing commands, a bash shell, and explicit scaffolding to prevent the model from losing track of its location in the codebase. On the SWE-bench benchmark it achieved approximately 12.5% resolution on the full split — competitive with or exceeding Devin's corrected numbers at a fraction of the compute cost.

OpenHands (formerly OpenDevin), an open-source community project, aimed to democratize the Devin-style autonomous agent architecture. By mid-2024 it had accumulated over 30,000 GitHub stars, demonstrating significant developer interest in controllable, self-hostable coding agents. Cursor, a VS Code fork with deep LLM integration, focused on a different trade-off: maximum context awareness of the local codebase rather than full autonomy, letting developers maintain tight control while benefiting from AI suggestions informed by thousands of project files.

Architecture Trade-off

High autonomy (Devin-style) maximizes throughput on well-scoped tasks but makes errors harder to catch mid-execution. Low autonomy (Copilot suggestion-style) keeps humans in the loop but requires more developer attention per unit of output.

2.3 — Enterprise Adoption Patterns

A McKinsey survey published in December 2023 found that 67% of software developers in large enterprises had experimented with AI coding tools, but only 22% were using them for tasks beyond simple code completion in a production context. The primary barriers cited were security review of AI-generated code (61% of respondents), intellectual property uncertainty (48%), and integration with existing CI/CD pipelines (41%).

Amazon's internal CodeWhisperer deployment — their in-house alternative to Copilot — required the security team to build an additional scanning layer to catch AI-suggested code that inadvertently replicated patterns from known-vulnerable open-source functions. This added pipeline step became a de-facto industry pattern: AI coding output goes through human review and automated security scanning before merge.

The distinction between inline assistant (Copilot, Cursor), agentic patch generator (SWE-agent, Devin), and full autonomous engineer (theoretical) maps roughly to increasing capability and increasing risk of undetected errors entering production. Most enterprise deployments in 2024 remained firmly in the inline-assistant tier for production code, with agentic tools reserved for internal tooling, test generation, and documentation.

Key Insight

The coding agent market in 2024 was not a single category but a spectrum from smart autocomplete to fully autonomous deployment pipelines. Where an organization placed itself on that spectrum depended far more on risk tolerance and security posture than on pure capability preferences.

Lesson 2 Quiz

The Coding Agent Landscape — five questions

1. What controversy surrounded Devin's initial SWE-bench benchmark results in March 2024?

Correct. Independent researchers found that Devin's headline numbers used 25 problems, not the standard full split, inflating the apparent performance.

The controversy was about the subset used — 25 problems vs. the full 300 — not hard-coding or private codebases.

2. How many paid subscribers did GitHub report for GitHub Copilot as of early 2024?

Correct. GitHub reported more than 1.3 million paid subscribers and over 50,000 organizations by early 2024.

GitHub's reported figure was more than 1.3 million paid subscribers as of early 2024.

3. What was the primary architectural advantage of SWE-agent's design compared to Devin, according to this lesson?

Correct. SWE-agent's structured interface kept the model from losing track of its codebase position and achieved comparable benchmark numbers to Devin at significantly less compute.

SWE-agent's advantage was its lean structured scaffold producing competitive results economically — not proprietary data or browser access.

4. According to the McKinsey December 2023 survey, what was the most commonly cited barrier to using AI coding agents in production enterprise contexts?

Correct. Security review of AI-generated code was the top concern at 61%, ahead of IP uncertainty (48%) and CI/CD integration (41%).

Security review was the top barrier at 61%. IP and CI/CD concerns followed at 48% and 41% respectively.

5. What additional pipeline step did Amazon build for its internal CodeWhisperer deployment to handle AI-generated code in production?

Correct. Amazon added a security scanning layer that checked AI suggestions against patterns from known-vulnerable open-source code — a practice that became an industry template.

Amazon's specific addition was an automated security scanner for vulnerable patterns, not a time delay or rewrite process.

Lab 2 · Comparing Coding Agent Platforms

Conversation-based lab — compare architectures, trade-offs, and enterprise fit.

Lab Objective

Use this lab to think critically about how different coding agent architectures fit different organizational needs. Discuss the autonomy spectrum, the Devin benchmark controversy, and what security-conscious enterprise deployment looks like in practice.

Suggested opener: "My company is evaluating GitHub Copilot vs. a more autonomous agentic tool for our development team. What questions should we be asking before we decide?"

Coding Agent Lab Assistant

Lesson 2

Welcome to Lab 2. I'm ready to help you compare coding agent platforms and think through architectural trade-offs and enterprise deployment considerations. What's on your mind?

Module 4 · Lesson 3

When the Agent Writes the Vulnerability

AI-generated code introduces novel supply-chain risks — insecure defaults, hallucinated dependencies, and subtle logic flaws that pass review and reach production.

How do AI coding agents introduce security risks that differ qualitatively from those in human-written code?

In August 2022, a team at Stanford University published a controlled study in IEEE S&P: they asked 47 developers to solve security-sensitive coding tasks, half of them using GitHub Copilot. The result was striking — developers with AI assistance were significantly more likely to introduce security vulnerabilities than those coding unaided. Crucially, the AI-assisted developers were more confident their code was secure. The combination of higher vulnerability rates and lower perceived risk was precisely the pattern security researchers had feared: the tool didn't just introduce bugs, it suppressed the developer's own doubt that drove manual review.

3.1 — Insecure Code at Scale

The Stanford finding built on earlier work: a 2021 NYU study by Pearce et al. tested Copilot on 89 diverse security-sensitive coding scenarios drawn from the MITRE CWE (Common Weakness Enumeration) list. Approximately 40% of Copilot's suggestions contained at least one vulnerability. The most common weakness types were buffer overflow risks (CWE-119), SQL injection patterns (CWE-89), and insufficient input validation (CWE-20).

The root mechanism is straightforward: LLMs are trained on public code, and public code contains vulnerabilities at high density. Stack Overflow answers, GitHub repositories, and tutorial sites routinely demonstrate techniques using known-insecure patterns — hard-coded credentials, eval() on user input, raw SQL string concatenation — because the goal of those examples is clarity of concept, not production security. The model learns from all of it.

When a developer asks an AI agent to "add a login endpoint," the agent draws on this training distribution. If most of the training examples for login endpoints used MD5 password hashing (common in tutorials circa 2010–2018), the model's prior skews toward suggesting MD5 — unless explicitly instructed otherwise or unless the system prompt enforces modern security standards.

Package HallucinationWhen an AI coding agent invents a plausible-sounding but non-existent package name in an import statement — a risk attackers can exploit by publishing a malicious package under that name before developers notice the error.

3.2 — Package Hallucination and Dependency Confusion

In 2023, security researchers at Vulcan Cyber documented a novel attack vector: package hallucination. AI coding agents, when generating code that requires a third-party library, sometimes invent package names that do not exist in npm, PyPI, or other registries. If a developer runs the generated requirements file without verifying each package, the import fails — but an attacker who has noticed the hallucinated package name can register it with malicious code and serve it to every developer who subsequently runs that AI-generated dependency list.

This is a variant of the earlier dependency confusion attack documented by Alex Birsan in 2021 — but AI-assisted coding dramatically expands the attack surface by introducing novel fake package names at scale across millions of developer interactions.

A follow-up study published in early 2024 by Lanyado et al. tested multiple LLMs and found that hallucinated package recommendations occurred in roughly 5.2% of queries to popular models when asked to implement code requiring third-party libraries. At GitHub Copilot's scale of 1.3 million developers, that rate implies millions of potentially exploitable hallucinated dependency recommendations annually.

Attack Vector

Attacker workflow: (1) monitor AI coding forums and outputs for hallucinated package names, (2) register the name on npm/PyPI with a malicious payload, (3) wait for developers to run AI-generated code that pulls the package. No phishing or social engineering required.

3.3 — Prompt Injection in Coding Agents

Coding agents that read external files, web pages, or API documentation as part of their task are vulnerable to prompt injection — hidden instructions embedded in the content they read that redirect the agent's behavior. In a documented proof-of-concept by researcher Johann Rehberger in 2023, a coding agent asked to summarize a GitHub README file processed hidden instructions in that README and exfiltrated the contents of other local files to an attacker-controlled server.

The risk is particularly acute because coding agents often operate with elevated permissions: they can write files, execute commands, and commit to repositories. A successful prompt injection attack on a coding agent is therefore not just a data leak — it can mean arbitrary code execution in the developer's environment or malicious commits to a production codebase.

Mitigation strategies documented by the OWASP Top-10 for LLMs (2023 edition) include: sandboxing all agent execution, privilege separation (the agent should not have write access unless the task explicitly requires it), human-in-the-loop checkpoints before any irreversible action, and output validation to detect anomalous file access patterns.

Key Insight

The security risk of AI coding agents is not primarily that the AI "makes mistakes" — it is that developers trust AI output more than human output, reducing the review intensity that would normally catch those mistakes. Restoring appropriate skepticism about AI-generated code is the foundational security control, regardless of what other technical mitigations are layered on top.

Lesson 3 Quiz

When the Agent Writes the Vulnerability — five questions

1. The 2022 Stanford study on AI-assisted coding found which alarming combination in developers who used GitHub Copilot?

Correct. The Stanford study found AI-assisted developers had more vulnerabilities and were more confident their code was safe — the worst possible combination for security review culture.

The finding was specifically that AI-assisted developers had both more vulnerabilities and more confidence in their code's security — not the same rate, and not self-correcting.

2. In the 2021 NYU study (Pearce et al.) testing Copilot on 89 security-sensitive scenarios, approximately what percentage of Copilot's suggestions contained at least one vulnerability?

Correct. Roughly 40% of Copilot's suggestions in the study contained at least one vulnerability from the MITRE CWE list.

The NYU study found approximately 40% of Copilot suggestions contained vulnerabilities — higher than most developers would expect or assume.

3. What is "package hallucination" in the context of AI coding agents, and why is it a security risk?

Correct. Package hallucination creates names that don't exist yet — attackers can claim them and arm developers who run the AI-generated code without verification.

Package hallucination is the invention of non-existent package names, which attackers can then register with malicious payloads.

4. In the 2024 Lanyado et al. study on LLM package recommendations, approximately what percentage of queries produced hallucinated package names?

Correct. At roughly 5.2% across tested models, and at Copilot's scale of 1.3 million developers, this represents millions of potentially exploitable hallucinated dependencies per year.

The study documented roughly 5.2% of queries producing hallucinated package recommendations — low per query but significant at scale.

5. According to OWASP's Top-10 for LLMs (2023), which of the following is NOT listed as a mitigation strategy for prompt injection in coding agents?

Correct. OWASP recommends sandboxing, privilege separation, human checkpoints, and output validation — not replacing LLMs with rule-based systems, which is not a practical mitigation.

OWASP's listed mitigations include sandboxing, privilege separation, human checkpoints, and output validation. Replacing the LLM with a rule-based generator is not among them.

Lab 3 · Security Review of AI-Generated Code

Conversation-based lab — practice identifying and mitigating security risks in AI coding agent output.

Lab Objective

In this lab you will practice reasoning about the security risks introduced by coding agents. Ask the assistant to review hypothetical AI-generated code snippets for vulnerabilities, discuss package hallucination attack vectors, and explore how to build secure-by-default prompting practices for your team.

Suggested opener: "Here's a code snippet an AI agent generated for a login endpoint — it uses MD5 for password hashing. Walk me through everything wrong with it from a security standpoint and how I'd prompt the agent to do better next time."

Security-Focused Coding Lab Assistant

Lesson 3

Welcome to Lab 3. I'm here to help you think through the security risks of AI-generated code — vulnerability patterns, package hallucination, prompt injection, and how to build better review practices. What would you like to explore?

Module 4 · Lesson 4

The Self-Improving Codebase

Agents that write tests for their own code, generate documentation, and spawn sub-agents to parallelize work are changing what software engineering looks like at every level of the stack.

What happens when coding agents can not only write code but also evaluate, test, and improve it — and what does that mean for the developers who use them?

In July 2023, Cognition Labs and separately the Meta AI research team each published results on agent scaffolding that uses LLMs to generate their own test suites. Meta's approach, documented in the paper "InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback," showed that an agent given access to a bash shell could generate test cases, run them against its own code, observe failures, and iterate — functionally automating a portion of the test-driven development cycle that had previously required a human to specify expected behavior. The performance improvement on coding benchmarks when self-testing was enabled was substantial: error correction rates roughly doubled compared to single-pass generation.

4.1 — Test Generation and Self-Evaluation

The ability of an agent to generate its own tests is a significant capability amplifier. Traditional TDD (test-driven development) requires a developer to specify expected behavior before writing code — a slow, discipline-intensive process. When an agent can generate reasonable test cases from a function's docstring or type annotations, the feedback loop accelerates dramatically.

However, self-generated tests carry an important failure mode: the agent tends to write tests that match its own interpretation of the requirement, not necessarily the correct one. If the agent misunderstands the specification, the tests it writes will pass the code it writes — and the entire suite will be incorrect in a self-consistent way. Researchers at Google DeepMind, in their 2024 AlphaCode 2 paper, described this as the "specification alignment problem": the agent needs an external ground truth to compare against, or self-evaluation degenerates into circular validation.

Practical deployments address this by keeping human-authored tests for business logic and allowing agent-generated tests only for utility functions, edge-case coverage expansion, and regression tests that reproduce specific observed bugs. This hybrid approach captures the efficiency gain while maintaining a human anchor on correctness.

Specification AlignmentThe requirement that an agent's tests reflect the intended behavior of the system rather than its own (potentially misaligned) implementation assumptions.

4.2 — Multi-Agent Coding Architectures

The next architectural evolution beyond the single coding agent is multi-agent orchestration: one coordinator agent decomposes a large task into sub-tasks and spawns specialized sub-agents to execute them in parallel. In October 2023, Microsoft Research published AutoGen, a framework specifically designed to orchestrate conversations between multiple LLM-powered agents that collectively write, review, test, and refine code.

In AutoGen's documented examples, a three-agent system — a "developer" agent that writes code, a "critic" agent that reviews it for errors, and a "tester" agent that generates and runs tests — outperformed a single-agent baseline on complex multi-step coding tasks by 15–25% across multiple benchmarks. The gains came primarily from the critic catching errors that the developer agent would not catch in its own output.

This mirrors how human software teams work: the person who wrote the code is least likely to spot its bugs, while a fresh reviewer catches structural problems quickly. Multi-agent systems computationally replicate the cognitive diversity that makes human code review effective.

Amazon Web Services began piloting multi-agent code generation internally in Q1 2024, with one documented use case generating infrastructure-as-code (CloudFormation templates) through an orchestration pipeline where a planner agent, a security-policy-checker agent, and a syntax-validator agent all acted sequentially on the same artifact before human review.

Multi-Agent Risk

Multi-agent pipelines compound errors: if the coordinator misunderstands the original requirement, every downstream sub-agent operates on a flawed premise. Error propagation is faster and harder to diagnose than in single-agent systems.

4.3 — Workforce and Skill-Set Implications

A 2024 GitHub survey of 500 developers who had used Copilot for more than 12 months found that 72% reported spending significantly more time on system design and code review than before adoption, while time spent on boilerplate implementation had fallen sharply. The emerging role is less "code writer" and more code director: specifying intent clearly, reviewing AI output critically, and making architectural decisions the agent cannot make on its own.

This creates a bifurcated labour market risk. Developers who master the skills of directing, reviewing, and testing AI-generated code become significantly more productive. Developers who defer to AI output without developing strong review skills may find their code quality declining invisibly — the vulnerabilities and architectural mistakes accumulate unnoticed until a major incident surfaces them.

Goldman Sachs, in a widely-cited April 2023 research note, estimated that AI coding tools could automate approximately 25–50% of current programming tasks over the next decade. The note emphasized that this would displace some roles while creating demand for higher-abstraction engineering work — requirements modeling, system architecture, AI output auditing — that current educational pipelines do not explicitly train for.

The most concrete near-term shift visible in job postings by late 2024: AI code review as an explicit listed skill, and experience with LLM output evaluation appearing in senior engineering job descriptions at companies ranging from Stripe to Shopify to the UK Government Digital Service.

Key Insight

Coding agents do not replace software engineers — they raise the abstraction level at which engineers must operate. The critical skill shift is from implementation fluency to specification precision: the developer who can clearly articulate what correct behavior looks like will extract dramatically more value from an agent than one who relies on the agent to infer intent.

Lesson 4 Quiz

The Self-Improving Codebase — five questions

1. What is the "specification alignment problem" as described in Google DeepMind's AlphaCode 2 paper?

Correct. Specification alignment fails when self-generated tests validate the agent's own misunderstanding of the requirement rather than the correct intended behavior.

Specification alignment specifically refers to the circular validation problem when an agent tests its own misinterpretation and passes it — not formal-language issues or API drift.

2. In Microsoft Research's AutoGen multi-agent system, what role did the "critic" agent play, and what performance improvement was observed?

Correct. The critic reviewed output from the developer agent, and the three-agent system (developer, critic, tester) outperformed single-agent approaches by 15–25% on complex coding benchmarks.

The critic reviewed the developer's code for errors; the three-agent system showed 15–25% improvement, not 5% or a speed decrease.

3. According to the 2024 GitHub survey of long-term Copilot users, how did developers report their time allocation had changed?

Correct. 72% of surveyed developers reported spending more time on system design and code review, with boilerplate implementation time falling sharply.

The survey found the shift was toward system design and code review, away from boilerplate implementation — not documentation or overall hour reduction.

4. What did the Goldman Sachs April 2023 research note estimate regarding AI coding tools and the programming workforce over the next decade?

Correct. Goldman's estimate was 25–50% task automation, with the note emphasizing role displacement paired with new demand for higher-abstraction engineering skills.

Goldman's note projected 25–50% task automation with a mix of displacement and new role creation — not full elimination, salary predictions, or demand increase framing.

5. What is the recommended hybrid approach for deploying agent-generated tests in production codebases?

Correct. This hybrid captures efficiency gains on lower-risk test categories while preserving human specification authority on the business-logic tests that define correct behavior.

The recommended hybrid keeps human tests for business logic and limits agent tests to utility functions, edge cases, and regressions — not all-or-nothing approaches.

Lab 4 · Multi-Agent Coding & the Future of Engineering

Conversation-based lab — design multi-agent pipelines and explore the evolving developer role.

Lab Objective

In this lab you will reason through multi-agent coding architectures, the specification alignment problem in self-generated tests, and the concrete skill shifts coding agents are creating in the engineering workforce. Try designing a multi-agent pipeline for a real task you or your team faces.

Suggested opener: "I want to design a multi-agent system to handle our team's GitHub issue backlog — writing code, generating tests, and doing a security review before a human signs off. What agents do I need, and what are the biggest failure risks in that pipeline?"

Multi-Agent Coding Lab Assistant

Lesson 4

Welcome to Lab 4. I'm ready to help you design multi-agent coding architectures, reason through specification alignment challenges, and think about how the developer role is evolving alongside these tools. What would you like to explore?

Module 4 Test

Coding Agents — 15 questions · 80% to pass

1. What is the defining architectural feature that distinguishes a coding agent from a code-completion tool?

Correct. The feedback loop — executing code, observing output, and revising — is the core distinction.

Model size and language support are not the defining distinction. The feedback loop is.

2. Which benchmark tests coding agents against real, unresolved GitHub issues and grades them on patch success?

Correct. SWE-bench (Princeton NLP, 2023) uses real GitHub issues and tests whether submitted patches make associated test suites pass.

SWE-bench is the GitHub-issue benchmark. HumanEval and MBPP test function-level code generation on synthetic problems.

3. When Cognition Labs first announced Devin's SWE-bench results in March 2024, what did subsequent independent analysis find?

Correct. Independent reviewers found the headline numbers used 25 problems rather than the full split, overstating performance.

The issue was use of a non-standard 25-problem subset, not fabrication or language restrictions.

4. GitHub Copilot's core interaction model deliberately keeps humans in the loop by defaulting to which mode of operation?

Correct. Inline suggestion with human accept/reject keeps the developer firmly in the control loop.

Copilot's default is inline suggestion, not autonomous PRs or batch generation.

5. The 2022 Stanford study found that AI-assisted developers had what combination of outcomes compared to unaided developers?

Correct. Higher vulnerabilities plus higher confidence is the dangerous combination the Stanford study documented.

The study specifically found more vulnerabilities combined with more confidence — not fewer, same rate, or better self-review.

6. What percentage of Copilot suggestions in the 2021 NYU Pearce et al. study contained at least one security vulnerability?

Correct. Roughly 40% of suggestions in security-sensitive scenarios contained at least one CWE-listed vulnerability.

The NYU study found approximately 40% — higher than developers typically assume or expect.

7. Why does an AI coding agent tend to suggest insecure patterns like MD5 password hashing when asked to build authentication?

Correct. The model's prior is shaped by the distribution of training examples, which includes many insecure legacy patterns from tutorials and old repos.

The mechanism is training distribution bias toward older tutorial patterns, not simplicity preference or recency.

8. Package hallucination attacks exploit what specific behavior of AI coding agents?

Correct. Hallucinated names create an exploitable window for attackers to register the non-existent package before developers notice the error.

Package hallucination is specifically about invented package names that attackers can then register — not version modification or unofficial mirrors.

9. What does OWASP's Top-10 for LLMs (2023) recommend as a key mitigation against prompt injection attacks on coding agents?

Correct. Sandboxing and privilege separation are the primary OWASP mitigations for prompt injection in agentic coding contexts.

OWASP recommends sandboxing and privilege separation — not disabling internet access or running as root.

10. Microsoft Research released AutoGen in October 2023 to solve which specific challenge in coding-agent deployment?

Correct. AutoGen is a multi-agent orchestration framework enabling specialized agents to collaborate on complex coding tasks.

AutoGen addresses multi-agent orchestration — not hardware optimization, API unification, or fine-tuned IaC models.

11. The McKinsey December 2023 survey found what percentage of large-enterprise software developers were using AI coding tools beyond simple code completion in production?

Correct. 67% had experimented, but only 22% were using agents beyond simple completion in production — a large gap driven by security and IP concerns.

67% had experimented with AI tools, but only 22% used them beyond completion in production. The question asks about production use beyond completion.

12. What is the "specification alignment problem" in the context of AI-generated test suites?

Correct. Specification alignment fails when the agent's tests confirm its own misunderstanding — without an external ground truth, all tests pass despite incorrect behavior.

Specification alignment is about circular validation of misunderstandings — not compilation issues, doc sync, or external API limitations.

13. Johann Rehberger's 2023 proof-of-concept demonstrated that coding agents reading external files are vulnerable to what attack?

Correct. Rehberger demonstrated that instructions hidden in a README could redirect a coding agent to exfiltrate local files — prompt injection in the file-reading context.

The Rehberger PoC was prompt injection via hidden instructions in a README — not XSS, buffer overflow, or SQL injection.

14. According to the 2024 GitHub survey of long-term Copilot users, what skill shift did 72% of developers report?

Correct. System design and code review increased; boilerplate implementation time fell — reflecting the developer-as-director role shift.

The shift was toward system design and code review, away from boilerplate — not toward documentation, reduced hours, or pure pipeline management.

15. OpenHands (formerly OpenDevin) is best described as which type of coding agent project?

Correct. OpenHands is an open-source, self-hostable alternative to proprietary autonomous agents like Devin, accumulating over 30,000 GitHub stars by mid-2024.

OpenHands is open-source and community-built — not a Microsoft product, a benchmark, or an Anthropic model.