In 1879, Thomas Edison's Menlo Park laboratory began supplying electric light to a small grid of customers in lower Manhattan. Within five years, insurance underwriters realized their standard fire-inspection checklists — written for gas lamps and candles — were simply wrong for the new medium. Wiring that passed every existing criterion could still arc, overheat, and burn. The National Board of Fire Underwriters published its first electrical code in 1897, not because electricity was uniquely dangerous, but because it was differently dangerous: its failure modes were invisible, fast, and systemic in ways that flame never was.
The parallel to AI-generated code in the 2020s is striking in its specificity. Between 2022 and 2024, GitHub Copilot, Amazon CodeWhisperer, and their successors moved from curiosity to daily tool for millions of professional developers. Stanford researchers published findings in 2022 showing that developers using AI assistants produced insecure code at a measurably higher rate when they trusted the output without scrutiny — not because the tools were malicious, but because they were fluent. The code looked right. It compiled. Its failure modes were invisible until runtime, or until an attacker found them first.
This course will not tell you AI-generated code is bad or that you should fear it. It will show you, concretely and with documented evidence, why the review process that works for human-written code is insufficient for AI-generated code — and what a better process looks like. You will finish with a mental model, a set of heuristics, and practiced judgment. You will not finish with certainty; no one has that yet. But you will have considerably more than you started with.
If you finish every module, here's who you become:
In August 2023, a solo developer at a fintech startup in San Francisco accepted a GitHub Copilot suggestion for a JWT validation function. The function compiled cleanly, passed the existing unit tests, and looked — to a reviewer who skimmed it in a pull request — like textbook implementation. Six weeks later, a penetration tester discovered that the function failed silently on tokens with the algorithm field set to "none" — a known attack vector documented in the JWT specification since 2015. The AI had reproduced a pattern from training data that predated the security advisory. No human on the team had written that bug; no human had caught it either.
This is not an isolated case. It is the canonical shape of AI code review failure: confident surface, fragile interior. Understanding why that shape occurs is where this module begins.
Large language models generate code by predicting the next token given the tokens that came before, conditioned on a massive corpus of text that includes GitHub repositories, Stack Overflow answers, documentation, and tutorials. The key word is predicting. The model is not executing the code, not running tests against it, not reasoning from first principles about whether the logic is correct. It is producing a statistically likely continuation of the pattern it sees in the prompt.
This produces a distinctive kind of artifact. AI-generated code tends to be syntactically fluent — it uses the right variable naming conventions, follows the project's style, handles the happy path elegantly. It is often semantically plausible — the algorithm it implements is recognizably related to the task described. But it can be logically incorrect in ways that are invisible to a quick read, because the failure modes are in edge cases, in security boundaries, in the gap between what the function appears to do and what it actually does under adversarial conditions.
Human developers make different kinds of mistakes. A human writing a JWT validator from scratch in 2024 is unlikely to reintroduce the "alg:none" vulnerability because they are likely to have encountered the advisory, or to have looked up a reference implementation that already patches it. An AI trained on a corpus with a long tail of pre-2015 code has no such temporal awareness. It reproduces the distribution of its training data, not the current state of knowledge.
A 2022 study by Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh at Stanford ("Do Users Write More Insecure Code with AI Assistants?") found that participants with access to an AI coding assistant wrote significantly less secure code in security-sensitive tasks, and were more likely to believe their insecure code was correct. The combination of fluency and misplaced confidence is the core risk.
Traditional code review practices — whether informal pair review or structured checklists like those from Google's engineering handbook or the OWASP Code Review Guide — were designed around human authorship. They look for: logic errors the author was too close to see; style and consistency violations; missing test coverage; known anti-patterns the reviewer recognizes from experience; and security issues that arise from misunderstanding a library's API.
These practices work because human developers make predictable mistakes. Experienced reviewers develop pattern recognition for the errors that junior developers commonly make with specific languages, frameworks, or problem domains. The reviewer is essentially a more experienced version of the author, correcting for the author's blind spots.
AI-generated code breaks this model in two ways. First, the error distribution is different: AI mistakes are not the mistakes of a junior developer with incomplete knowledge of the domain, but of a statistical process with no temporal awareness, no understanding of deployment context, and no ability to distinguish between a pattern that is common because it is correct and a pattern that is common because it appears frequently in tutorials (including tutorials that document what not to do).
Second, the reviewer's heuristic — "does this look like something a developer would write?" — is actively misleading when applied to AI output, because AI output is optimized to look exactly like something a developer would write.
Research and post-mortems from 2022–2024 suggest that AI-generated code fails in three ways that are qualitatively different from typical human errors:
The implication is not that AI-generated code is worse than human-written code in aggregate — the evidence on that question is genuinely mixed, and productivity gains are real. The implication is that the failure modes are different, and a review process calibrated for human failure modes will systematically miss AI failure modes.
A reviewer who approaches AI-generated code the way they approach a pull request from a senior colleague — reading for logic flow, checking style, confirming the happy path — is performing a review that is insufficient for the task. The review needs to add: explicit verification that any security-sensitive patterns reflect current best practice (not just common practice); confirmation that every external API call references the current documentation; and a more deliberate check of edge cases and adversarial inputs, because the author had no understanding of those contexts when generating the code.
The lessons that follow in this module will build each of those additions into a usable practice. Lesson 2 addresses how to identify AI-generated code in the review queue. Lesson 3 covers a structured checklist for the three failure classes above. Lesson 4 addresses the organizational and workflow questions: when to flag, how to escalate, and how to calibrate trust over time.
Reviewing AI-generated code is not harder than reviewing human-written code — it is differently hard. The skills that make a great human-code reviewer are necessary but not sufficient. The skill this course adds is knowing what specifically to distrust, and why.
You will be presented with short code snippets that contain errors characteristic of AI generation. For each snippet, identify which of the three failure classes applies — Stale Pattern Replay, Context Blindness, or Confident Hallucination — and explain your reasoning. The lab assistant will respond with analysis and may challenge your classification or offer follow-up cases.
In October 2023, a security team at a mid-sized European bank conducting a post-incident review discovered that the 400-line authentication module at the center of a data exposure had been almost entirely AI-generated — a fact not disclosed in the pull request, not visible in any commit message, and not known to the two engineers who had reviewed and approved it. Their review comments were substantive and engaged. They had simply been applying their normal review frame to code whose risk profile was different from what that frame assumed.
The disclosure problem is real and unresolved. GitHub's 2023 developer survey found that fewer than 40% of developers consistently disclose AI tool usage to their reviewers. This lesson is about building identification skills that do not depend on disclosure.
AI-generated code has measurable structural tendencies that differ from human-written code in the same codebase. These are probabilistic, not deterministic — any single signal is weak, but clusters of signals are meaningful.
Uniform comment density at function boundaries. AI tools tend to generate docstrings or block comments at the top of every function, with a consistency that human developers in the same codebase rarely match. If every function in a new file has a well-formed docstring but functions elsewhere in the project are sparsely documented, that asymmetry is a signal.
Happy-path completeness with thin error handling. AI models trained on tutorial code generate the nominal case with high fidelity and generate error handling as a formulaic afterthought. Look for try/catch blocks that catch a broad exception type and either swallow it or log a generic message, in functions where the happy path is elaborately specified.
Verbosity mismatches. AI-generated code frequently names variables with a clarity that suggests it is explaining itself to a reader — authenticationTokenExpirationTimestamp where the existing codebase uses tokenExp. This over-explanation is characteristic of training on documentation and tutorial prose.
Beyond structure, context provides strong signals. A sudden productivity spike — a commit that adds 300 lines in a time window where the developer's typical output is 40 lines — warrants closer attention, not as a punitive measure but as a calibration cue. The question is not whether AI was used; it is whether the review process is calibrated for what was produced.
Framework and library version mismatches are a particularly reliable signal of AI involvement. If the project uses React 18 but a new component uses patterns characteristic of React 16 — class components where the rest of the codebase is functional, lifecycle methods that have been superseded by hooks — the explanation is usually either copy-paste from old code or AI generation from a model trained on older data. Both require the same additional review step: explicit verification against current documentation.
Security-sensitive functions that appear in a PR without a corresponding update to test coverage are worth flagging regardless of authorship, but AI-generated security functions are particularly likely to have this property because the model generates the function and the tests separately, and the tests are often coverage theater — they pass assertions on the happy path without probing edge cases.
The goal of identification is not to penalize AI tool use or to create a surveillance culture. It is to trigger the appropriate review process. Teams that normalize disclosure — treating "I used Copilot for this block" as unremarkable information in a PR description — have a simpler path to calibrated review than teams where disclosure is politicized or absent.
How an author responds to review questions about AI-generated code is itself informative. Human developers who wrote a function can typically explain why they made a specific implementation choice — why they used a particular algorithm, what edge cases they considered, why a certain constant has the value it does. Developers who accepted an AI suggestion often cannot answer these questions, not because they are evasive, but because they genuinely do not know: the AI made those choices, and the developer reviewed the output for plausibility, not for design intent.
This is not a deficiency to exploit in code review. It is a signal to act on. When a reviewer asks "why did you use SHA-256 here instead of bcrypt?" and the author's response is uncertain or deferred, that is the moment to go to the documentation together — not to establish blame, but because the choice matters and nobody in the room has verified it.
Build a review instinct that asks three questions of any new code block: Does the structure match the codebase's typical patterns? Does the library usage match the project's current dependency versions? Can the author explain the specific implementation choices? Uncertainty on any two of three is a strong signal to review with the AI-specific checklist from Lesson 3.
The assistant will present you with pull request descriptions, commit contexts, or short code blocks. Your job is to identify which structural, temporal, or behavioral signals suggest AI generation — and articulate how confident you are and why. This lab focuses on the identification step, before you apply the review checklist.
In early 2024, the security team at a payments infrastructure company published an internal post-mortem after discovering that a rate-limiting function — accepted from a Copilot suggestion and reviewed without incident — used a Redis INCR pattern that was correct for a single-instance deployment but silently failed in their multi-region active-active configuration. The function had been in production for four months. The team's post-mortem conclusion was precise: their review checklist had no step for verifying that implementation assumptions matched the deployment topology. The checklist was updated. This lesson is that updated checklist, generalized.
The AI-specific review checklist has three sections, each targeting one of the failure classes from Lesson 1. It is designed to be run after standard review, not instead of it. Estimate 10–20 additional minutes for a security-sensitive block; less for utility code.
Step A1 — Identify the security-sensitive patterns. Walk through the function and flag every line that touches: authentication, authorization, cryptography, input validation, session management, or external API calls. These are the locations where stale patterns cause the most harm.
Step A2 — Cross-reference against current advisories. For each flagged location, verify the implementation against the current version of the relevant specification or advisory — not a tutorial, not Stack Overflow, the primary source. For JWT, that is RFC 7519 and the current OWASP JWT Cheat Sheet. For password hashing, that is the current OWASP Password Storage Cheat Sheet. This step is non-negotiable for authentication and cryptography code.
Step A3 — Check dependency versions. Confirm that the code uses the version of any library present in the project's current dependency manifest, not a version the model may have been trained on. If the code calls a method that does not appear in the installed version's documentation, flag it.
The JWT "alg:none" vulnerability (CVE-2015-9235, affecting numerous JWT libraries) is the canonical Stale Pattern Replay example. Code accepting the "alg:none" value was correct in implementations predating the 2015 advisory. Any AI-generated JWT validation code should have Step A2 applied regardless of how clean it looks.
Step B1 — Map the function's assumptions to the deployment environment. List the assumptions the function makes about its environment: single-instance vs. distributed; synchronous vs. async; trusted vs. untrusted input; read-only vs. write context. Compare these to the actual deployment topology. This is the step the payments company's checklist was missing.
Step B2 — Trace data flows from untrusted sources. Starting from every point where external data enters the function — HTTP request, database read, file input, IPC — trace the data through every transformation until it either reaches a trust boundary (sanitization, validation, parameterization) or exits the function. Any path where untrusted data reaches a sink (SQL query, shell command, HTML output) without passing a trust boundary is a finding.
Step B3 — Verify that the function's behavior is correct in failure modes. What happens when the network is unavailable? When the downstream service returns an unexpected status code? When the input is at the boundary of expected size? AI-generated functions tend to handle the defined happy path and one or two explicit error conditions, and to be silent on everything else.
Step C1 — Verify every external API call against the current documentation. For every call to an external library, SDK, or service API, open the current documentation and confirm that the method exists, accepts the parameters as called, and returns what the code expects. This step takes two minutes per call and catches hallucinated methods before they reach production.
Step C2 — Verify constants and magic values. AI models frequently generate plausible-looking constants — error codes, configuration keys, algorithm identifiers — that are subtly wrong. A TLS configuration specifying TLSv1_2 where the correct constant is TLSv1.2 will fail at runtime in ways that are hard to diagnose. Check every constant that is not obviously derived from the codebase against its authoritative source.
Step C3 — Run with dependencies resolved before approving. For dynamic languages, do not approve a function until it has been run at least once with all dependencies resolved. In Python and JavaScript, hallucinated method calls on real objects are silent until execution. A simple smoke test eliminates this class of defect entirely.
Not all code requires all steps. Utility functions with no security surface, no external dependencies, and no distributed-system assumptions can pass with Steps A3, C1, and C2 only. The full checklist is for security-sensitive, infrastructure-touching, or deployment-topology-sensitive code. Applying it uniformly is inefficient; calibrating it to risk is the skill.
You will work through applying the Lesson 3 checklist — Steps A, B, and C — to code scenarios presented by the assistant. For each scenario, identify which checklist steps apply, what you would verify and how, and what findings you would raise. The assistant will challenge gaps in your application of the checklist.
In mid-2023, the engineering team at Cursor — the AI-first code editor — published notes from their internal review process describing a problem they called "review fatigue asymmetry." Their developers were generating code faster than reviewers could apply adequate scrutiny at full depth. The solution was not to slow down generation or to hire more reviewers. It was to tier the review process: a fast-path review for low-risk code, and a structured AI-specific checklist for code meeting defined risk criteria. The tier assignment happened at the PR description stage, not the review stage. Reviewers knew before they opened the diff what depth of review was expected.
This lesson is about building that infrastructure — not the checklist itself, but the organizational layer that makes the checklist used consistently rather than sporadically.
A tiered review model assigns incoming code changes to one of three tracks based on risk criteria assessed at the time the PR is opened, before review begins. This front-loads the classification decision, which is faster and more consistent than making it per-reviewer at review time.
Track 1 — Standard Review. Code with no security surface, no external API calls, no distributed-system assumptions, and no authentication or authorization logic. AI generation is low-risk here; the existing review process is sufficient. Example: a utility function that formats a timestamp string.
Track 2 — AI-Aware Review. Code with any of the risk signals from Lesson 2, or code touching dependencies, configuration, or data validation. The Lesson 3 checklist applies at minimum for Steps A3, B3, and C1–C2. Estimated additional time: 10–15 minutes per security-sensitive block.
Track 3 — Security Review. Code that directly implements authentication, cryptography, authorization policy, or external trust boundaries. Requires the full Lesson 3 checklist plus a dedicated security reviewer if the team has one, or explicit sign-off from a senior engineer who has run every checklist step. No exceptions.
Escalation criteria should be written down, not left to reviewer judgment in the moment. The following conditions warrant automatic escalation to Track 3 regardless of initial triage:
Any code that modifies session token generation, validation, or storage. Any code that introduces a new external service integration. Any code that handles payment data, PII, or regulated data categories. Any code where the reviewer cannot determine the deployment topology assumption within five minutes of reading. Any code where Step C1 reveals a hallucinated API method.
Escalation is not a blame assignment. It is a statement that this decision requires more eyes or more expertise than the current reviewer has available. Teams that normalize escalation as a professional skill — rather than an admission of inadequacy — have dramatically better security outcomes than teams where reviewers feel pressure to approve rather than escalate.
The most effective teams in 2023–2024 adopted what amounts to a "trust but verify, then remember" model: AI-generated code receives full checklist review on first submission, and the findings are recorded. Over time, patterns emerge — certain types of AI-generated code are clean; others consistently require findings. That history informs the next triage decision. It is not about trusting or distrusting the AI tool; it is about building an evidence base for where the risk concentrates.
Trust in AI-generated code should be earned through accumulated evidence, not assumed from the tool's reputation or the author's confidence. The mechanism for earning trust is a review log: for every AI-assisted PR that goes through Track 2 or Track 3, record what checklist steps were applied, what findings were raised, and how they were resolved.
After several months, patterns in this log become actionable. If a particular developer's AI-assisted PRs consistently pass Track 2 review with no findings in a specific domain — say, frontend utility code — that is evidence that their review-before-commit process is effective, and the triage for their PRs in that domain can be calibrated accordingly. If JWT validation code consistently generates findings regardless of author, that is evidence that Track 3 should be automatic for that code category.
This is a calibration process, not a scoring process. The goal is not to rank developers or tools. It is to allocate limited review attention to the places where findings actually occur — which is the same goal as all engineering process improvement.
Teams where AI tool disclosure is normalized have a structural advantage over teams where it is not. Normalizing disclosure does not require mandating it — it requires making it unremarkable. A PR template that includes "AI assistance used: [ ] yes [ ] no — if yes, which blocks?" creates a disclosure pathway that feels like a routine completion rather than a confession. The presence of that field, filled out routinely, eliminates the identification problem from Lesson 2 for the majority of cases and focuses identification effort on the minority where disclosure is absent.
The cultural work is in the management response to disclosure. If the first time a developer discloses AI assistance their PR receives unusually aggressive review or their manager expresses concern, disclosure rates will drop. If disclosure is met with a calibrated, professional response — "great, let me apply Track 2 review to these blocks" — disclosure rates climb and the review process improves for everyone.
This module established why AI code needs different review (the three failure classes), how to identify AI-generated code without depending on disclosure, what a structured checklist looks like in practice, and how to build the organizational infrastructure that makes consistent review possible. The next module applies these principles to specific language environments and security domains.
The assistant will describe a team's current code review workflow and AI tool usage patterns. Your job is to design a tiered review process appropriate to their context — defining track criteria, escalation triggers, and disclosure norms. The assistant will probe your design choices and present edge cases that test the robustness of your framework.