AI Code Review Fundamentals · Introduction

Every Tool That Amplifies Human Work Eventually Demands a New Kind of Oversight

This course exists because the old review checklist was written for humans, and AI is not one.

In 1879, Thomas Edison's Menlo Park laboratory began supplying electric light to a small grid of customers in lower Manhattan. Within five years, insurance underwriters realized their standard fire-inspection checklists — written for gas lamps and candles — were simply wrong for the new medium. Wiring that passed every existing criterion could still arc, overheat, and burn. The National Board of Fire Underwriters published its first electrical code in 1897, not because electricity was uniquely dangerous, but because it was differently dangerous: its failure modes were invisible, fast, and systemic in ways that flame never was.

The parallel to AI-generated code in the 2020s is striking in its specificity. Between 2022 and 2024, GitHub Copilot, Amazon CodeWhisperer, and their successors moved from curiosity to daily tool for millions of professional developers. Stanford researchers published findings in 2022 showing that developers using AI assistants produced insecure code at a measurably higher rate when they trusted the output without scrutiny — not because the tools were malicious, but because they were fluent. The code looked right. It compiled. Its failure modes were invisible until runtime, or until an attacker found them first.

This course will not tell you AI-generated code is bad or that you should fear it. It will show you, concretely and with documented evidence, why the review process that works for human-written code is insufficient for AI-generated code — and what a better process looks like. You will finish with a mental model, a set of heuristics, and practiced judgment. You will not finish with certainty; no one has that yet. But you will have considerably more than you started with.

If you finish every module, here's who you become:

You'll understand why AI-generated code fails differently than human-written code — and why fluency is its most dangerous quality.
You'll be able to spot hallucinated APIs and phantom dependencies before they reach production or an attacker's hands.
You'll recognize the specific tells and patterns that signal an AI output needs deeper scrutiny, not just a quick scan.
You'll trace data through AI-generated logic to catch type mismatches, undocumented assumptions, and silent failures at the source.
You'll write tests designed for the failure modes AI code actually introduces, not the ones traditional test suites were built to find.
You'll finish with a personal review checklist — repeatable, yours, and built from practiced judgment rather than borrowed convention.
You're becoming the kind of developer who can work with AI tools at full speed without surrendering the oversight those tools demand.

AI Code Review Fundamentals · Lesson 1

The Fluency Illusion: Why Correct-Looking Code Is Not the Same as Correct Code

AI models are trained to produce plausible text, not verified programs — and that distinction changes everything about review.

What makes AI-generated code structurally different from human-written code, and why do standard review practices miss the gap?

In August 2023, a solo developer at a fintech startup in San Francisco accepted a GitHub Copilot suggestion for a JWT validation function. The function compiled cleanly, passed the existing unit tests, and looked — to a reviewer who skimmed it in a pull request — like textbook implementation. Six weeks later, a penetration tester discovered that the function failed silently on tokens with the algorithm field set to "none" — a known attack vector documented in the JWT specification since 2015. The AI had reproduced a pattern from training data that predated the security advisory. No human on the team had written that bug; no human had caught it either.

This is not an isolated case. It is the canonical shape of AI code review failure: confident surface, fragile interior. Understanding why that shape occurs is where this module begins.

1.1 — How AI Code Generation Actually Works

Large language models generate code by predicting the next token given the tokens that came before, conditioned on a massive corpus of text that includes GitHub repositories, Stack Overflow answers, documentation, and tutorials. The key word is predicting. The model is not executing the code, not running tests against it, not reasoning from first principles about whether the logic is correct. It is producing a statistically likely continuation of the pattern it sees in the prompt.

This produces a distinctive kind of artifact. AI-generated code tends to be syntactically fluent — it uses the right variable naming conventions, follows the project's style, handles the happy path elegantly. It is often semantically plausible — the algorithm it implements is recognizably related to the task described. But it can be logically incorrect in ways that are invisible to a quick read, because the failure modes are in edge cases, in security boundaries, in the gap between what the function appears to do and what it actually does under adversarial conditions.

Human developers make different kinds of mistakes. A human writing a JWT validator from scratch in 2024 is unlikely to reintroduce the "alg:none" vulnerability because they are likely to have encountered the advisory, or to have looked up a reference implementation that already patches it. An AI trained on a corpus with a long tail of pre-2015 code has no such temporal awareness. It reproduces the distribution of its training data, not the current state of knowledge.

Research Finding

A 2022 study by Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh at Stanford ("Do Users Write More Insecure Code with AI Assistants?") found that participants with access to an AI coding assistant wrote significantly less secure code in security-sensitive tasks, and were more likely to believe their insecure code was correct. The combination of fluency and misplaced confidence is the core risk.

1.2 — What Standard Code Review Was Designed to Catch

Traditional code review practices — whether informal pair review or structured checklists like those from Google's engineering handbook or the OWASP Code Review Guide — were designed around human authorship. They look for: logic errors the author was too close to see; style and consistency violations; missing test coverage; known anti-patterns the reviewer recognizes from experience; and security issues that arise from misunderstanding a library's API.

These practices work because human developers make predictable mistakes. Experienced reviewers develop pattern recognition for the errors that junior developers commonly make with specific languages, frameworks, or problem domains. The reviewer is essentially a more experienced version of the author, correcting for the author's blind spots.

AI-generated code breaks this model in two ways. First, the error distribution is different: AI mistakes are not the mistakes of a junior developer with incomplete knowledge of the domain, but of a statistical process with no temporal awareness, no understanding of deployment context, and no ability to distinguish between a pattern that is common because it is correct and a pattern that is common because it appears frequently in tutorials (including tutorials that document what not to do).

Second, the reviewer's heuristic — "does this look like something a developer would write?" — is actively misleading when applied to AI output, because AI output is optimized to look exactly like something a developer would write.

1.3 — The Three Distinctive Failure Classes

Research and post-mortems from 2022–2024 suggest that AI-generated code fails in three ways that are qualitatively different from typical human errors:

Stale Pattern Replay The model reproduces a pattern from its training data that was correct at the time of training but has since been superseded by a security patch, a deprecation, or a change in best practice. The JWT "alg:none" case is a textbook example. The model has no mechanism to know that a 2015 advisory changed the correct implementation.

Context Blindness The model generates code that is correct in isolation but wrong in the specific deployment context. A function that safely handles user input in a read-only analytics context may silently fail to sanitize when that same function is later called with data destined for a SQL query. The model cannot see the surrounding system; it can only see what is in the context window.

Confident Hallucination The model invents API methods, library functions, or constants that do not exist, with no syntactic signal that anything is wrong. In 2023, multiple documented cases emerged of Copilot and ChatGPT suggesting calls to non-existent AWS SDK methods that would only fail at runtime. In a compiled language, this surfaces at build time; in Python or JavaScript, it can reach production.

1.4 — Why This Requires a Different Review Process

The implication is not that AI-generated code is worse than human-written code in aggregate — the evidence on that question is genuinely mixed, and productivity gains are real. The implication is that the failure modes are different, and a review process calibrated for human failure modes will systematically miss AI failure modes.

A reviewer who approaches AI-generated code the way they approach a pull request from a senior colleague — reading for logic flow, checking style, confirming the happy path — is performing a review that is insufficient for the task. The review needs to add: explicit verification that any security-sensitive patterns reflect current best practice (not just common practice); confirmation that every external API call references the current documentation; and a more deliberate check of edge cases and adversarial inputs, because the author had no understanding of those contexts when generating the code.

The lessons that follow in this module will build each of those additions into a usable practice. Lesson 2 addresses how to identify AI-generated code in the review queue. Lesson 3 covers a structured checklist for the three failure classes above. Lesson 4 addresses the organizational and workflow questions: when to flag, how to escalate, and how to calibrate trust over time.

Core Principle

Reviewing AI-generated code is not harder than reviewing human-written code — it is differently hard. The skills that make a great human-code reviewer are necessary but not sufficient. The skill this course adds is knowing what specifically to distrust, and why.

Lesson 1 Quiz

Four questions · Select the best answer for each

1. The 2022 Stanford study by Perry et al. found that developers using AI assistants in security-sensitive tasks were more likely to:

Correct. The study found that AI assistant users produced less secure code in security tasks and rated their insecure code as correct at higher rates — the fluency illusion in measurable form.

Not quite. The study's central finding was that AI users were both more likely to write insecure code AND more likely to believe their insecure code was correct — a dangerous combination.

2. "Stale Pattern Replay" refers to which specific AI code failure mode?

Correct. Stale Pattern Replay occurs when AI reproduces historically common patterns without awareness that best practice has since changed — as in the JWT "alg:none" case from 2015.

That's not quite right. Stale Pattern Replay specifically refers to outdated security or implementation patterns from training data — patterns that were valid when written but have since been superseded.

3. Why does applying standard human-code review heuristics to AI-generated code produce a misleading result?

Correct. The fluency illusion is central: the heuristic "does this look like something a developer would write?" fails precisely because AI is optimized to produce output that looks like something a developer would write.

Not quite. The problem is the opposite of opacity — AI code looks professionally written, which causes reviewers to extend it unwarranted trust and miss underlying errors.

4. A Copilot suggestion calls a method on an AWS SDK object that does not exist in any version of that SDK. This is an example of which failure class?

Correct. Confident Hallucination is the failure mode where the model invents API methods or constants that do not exist — producing syntactically valid code that only fails at runtime.

Not quite. This is Confident Hallucination — the model fabricated an API method with no syntactic signal of the error. Context Blindness involves correct-in-isolation code that fails in a specific deployment environment.

Lab 1 — The Failure Mode Identifier

Practice classifying AI code failures · Three exchanges to complete

Your Task

You will be presented with short code snippets that contain errors characteristic of AI generation. For each snippet, identify which of the three failure classes applies — Stale Pattern Replay, Context Blindness, or Confident Hallucination — and explain your reasoning. The lab assistant will respond with analysis and may challenge your classification or offer follow-up cases.

Start by asking for your first code snippet, or describe a code scenario you want to analyze. The assistant will guide you through at least three classification exercises.

Lab Assistant

AI Code Review · L1

Ready when you are. Ask for a code snippet to classify, or paste in a piece of AI-generated code you want to analyze. We'll work through at least three examples together, and I'll push back if your classification needs refinement.

AI Code Review Fundamentals · Lesson 2

Identifying AI-Generated Code in the Review Queue

You cannot apply a different review process to code you cannot distinguish — here is how to distinguish it reliably.

What signals — structural, stylistic, and contextual — indicate that a code block was likely generated by an AI tool, and how should that change your review posture?

In October 2023, a security team at a mid-sized European bank conducting a post-incident review discovered that the 400-line authentication module at the center of a data exposure had been almost entirely AI-generated — a fact not disclosed in the pull request, not visible in any commit message, and not known to the two engineers who had reviewed and approved it. Their review comments were substantive and engaged. They had simply been applying their normal review frame to code whose risk profile was different from what that frame assumed.

The disclosure problem is real and unresolved. GitHub's 2023 developer survey found that fewer than 40% of developers consistently disclose AI tool usage to their reviewers. This lesson is about building identification skills that do not depend on disclosure.

2.1 — Structural Signals

AI-generated code has measurable structural tendencies that differ from human-written code in the same codebase. These are probabilistic, not deterministic — any single signal is weak, but clusters of signals are meaningful.

Uniform comment density at function boundaries. AI tools tend to generate docstrings or block comments at the top of every function, with a consistency that human developers in the same codebase rarely match. If every function in a new file has a well-formed docstring but functions elsewhere in the project are sparsely documented, that asymmetry is a signal.

Happy-path completeness with thin error handling. AI models trained on tutorial code generate the nominal case with high fidelity and generate error handling as a formulaic afterthought. Look for try/catch blocks that catch a broad exception type and either swallow it or log a generic message, in functions where the happy path is elaborately specified.

Verbosity mismatches. AI-generated code frequently names variables with a clarity that suggests it is explaining itself to a reader — authenticationTokenExpirationTimestamp where the existing codebase uses tokenExp. This over-explanation is characteristic of training on documentation and tutorial prose.

2.2 — Temporal and Contextual Signals

Beyond structure, context provides strong signals. A sudden productivity spike — a commit that adds 300 lines in a time window where the developer's typical output is 40 lines — warrants closer attention, not as a punitive measure but as a calibration cue. The question is not whether AI was used; it is whether the review process is calibrated for what was produced.

Framework and library version mismatches are a particularly reliable signal of AI involvement. If the project uses React 18 but a new component uses patterns characteristic of React 16 — class components where the rest of the codebase is functional, lifecycle methods that have been superseded by hooks — the explanation is usually either copy-paste from old code or AI generation from a model trained on older data. Both require the same additional review step: explicit verification against current documentation.

Security-sensitive functions that appear in a PR without a corresponding update to test coverage are worth flagging regardless of authorship, but AI-generated security functions are particularly likely to have this property because the model generates the function and the tests separately, and the tests are often coverage theater — they pass assertions on the happy path without probing edge cases.

Note on Attribution

The goal of identification is not to penalize AI tool use or to create a surveillance culture. It is to trigger the appropriate review process. Teams that normalize disclosure — treating "I used Copilot for this block" as unremarkable information in a PR description — have a simpler path to calibrated review than teams where disclosure is politicized or absent.

2.3 — Behavioral Signals in Review Conversations

How an author responds to review questions about AI-generated code is itself informative. Human developers who wrote a function can typically explain why they made a specific implementation choice — why they used a particular algorithm, what edge cases they considered, why a certain constant has the value it does. Developers who accepted an AI suggestion often cannot answer these questions, not because they are evasive, but because they genuinely do not know: the AI made those choices, and the developer reviewed the output for plausibility, not for design intent.

This is not a deficiency to exploit in code review. It is a signal to act on. When a reviewer asks "why did you use SHA-256 here instead of bcrypt?" and the author's response is uncertain or deferred, that is the moment to go to the documentation together — not to establish blame, but because the choice matters and nobody in the room has verified it.

Practical Takeaway

Build a review instinct that asks three questions of any new code block: Does the structure match the codebase's typical patterns? Does the library usage match the project's current dependency versions? Can the author explain the specific implementation choices? Uncertainty on any two of three is a strong signal to review with the AI-specific checklist from Lesson 3.

Lesson 2 Quiz

Four questions · Select the best answer for each

1. According to GitHub's 2023 developer survey cited in Lesson 2, what percentage of developers consistently disclose AI tool usage to their reviewers?

Correct. The survey found fewer than 40% consistently disclosed, which is why building identification skills that don't depend on disclosure is necessary.

Not quite. The figure was fewer than 40%, which underscores why reviewers cannot rely on disclosure and must develop independent identification skills.

2. A new pull request uses class components and componentDidMount() lifecycle methods, while every other component in the React codebase uses functional components and hooks. What is the most likely explanation relevant to AI code review?

Correct. Framework version mismatches are a reliable signal of AI generation from an older training corpus, and both AI generation and copy-paste from old sources require the same additional review step.

Not quite. A systematic pattern mismatch like this — old API style in a codebase using newer APIs — is a textbook signal of AI generation from an older training corpus.

3. What is the recommended response when a reviewer asks an author why they made a specific implementation choice and the author cannot explain it?

Correct. An author's inability to explain implementation choices is a signal to verify together — the goal is accurate review, not attribution of blame.

Not quite. This is precisely the moment to verify the implementation choice against documentation — together, collaboratively, without punishment — because the AI made the choice and nobody has verified it.

4. Which combination of signals provides the strongest basis for applying an AI-specific review checklist to a code block?

Correct. The practical rule from Lesson 2: uncertainty on two or more of the three questions (structure match, version match, author can explain choices) is strong grounds for AI-specific review.

Not quite. Any single signal is weak. The practical threshold is two or more of: structural mismatch, version mismatch, author cannot explain choices.

Lab 2 — Signal Detection Practice

Identify AI-origin signals in pull request descriptions and code contexts · Three exchanges to complete

Your Task

The assistant will present you with pull request descriptions, commit contexts, or short code blocks. Your job is to identify which structural, temporal, or behavioral signals suggest AI generation — and articulate how confident you are and why. This lab focuses on the identification step, before you apply the review checklist.

Ask for your first PR scenario, or describe a code review situation you've encountered where AI involvement was suspected but undisclosed.

Lab Assistant

AI Code Review · L2

Ready. Ask for a pull request scenario to analyze for AI-origin signals, or describe a real situation you want to think through. We'll work through at least three exercises.

AI Code Review Fundamentals · Lesson 3

The AI-Specific Review Checklist: What to Verify and How

A structured verification process calibrated to the three failure classes — not a replacement for standard review, but an addition to it.

What does a complete AI-aware code review look like in practice, step by step?

In early 2024, the security team at a payments infrastructure company published an internal post-mortem after discovering that a rate-limiting function — accepted from a Copilot suggestion and reviewed without incident — used a Redis INCR pattern that was correct for a single-instance deployment but silently failed in their multi-region active-active configuration. The function had been in production for four months. The team's post-mortem conclusion was precise: their review checklist had no step for verifying that implementation assumptions matched the deployment topology. The checklist was updated. This lesson is that updated checklist, generalized.

3.1 — The Checklist Structure

The AI-specific review checklist has three sections, each targeting one of the failure classes from Lesson 1. It is designed to be run after standard review, not instead of it. Estimate 10–20 additional minutes for a security-sensitive block; less for utility code.

3.2 — Section A: Stale Pattern Verification

Step A1 — Identify the security-sensitive patterns. Walk through the function and flag every line that touches: authentication, authorization, cryptography, input validation, session management, or external API calls. These are the locations where stale patterns cause the most harm.

Step A2 — Cross-reference against current advisories. For each flagged location, verify the implementation against the current version of the relevant specification or advisory — not a tutorial, not Stack Overflow, the primary source. For JWT, that is RFC 7519 and the current OWASP JWT Cheat Sheet. For password hashing, that is the current OWASP Password Storage Cheat Sheet. This step is non-negotiable for authentication and cryptography code.

Step A3 — Check dependency versions. Confirm that the code uses the version of any library present in the project's current dependency manifest, not a version the model may have been trained on. If the code calls a method that does not appear in the installed version's documentation, flag it.

Real Case Reference

The JWT "alg:none" vulnerability (CVE-2015-9235, affecting numerous JWT libraries) is the canonical Stale Pattern Replay example. Code accepting the "alg:none" value was correct in implementations predating the 2015 advisory. Any AI-generated JWT validation code should have Step A2 applied regardless of how clean it looks.

3.3 — Section B: Context Verification

Step B1 — Map the function's assumptions to the deployment environment. List the assumptions the function makes about its environment: single-instance vs. distributed; synchronous vs. async; trusted vs. untrusted input; read-only vs. write context. Compare these to the actual deployment topology. This is the step the payments company's checklist was missing.

Step B2 — Trace data flows from untrusted sources. Starting from every point where external data enters the function — HTTP request, database read, file input, IPC — trace the data through every transformation until it either reaches a trust boundary (sanitization, validation, parameterization) or exits the function. Any path where untrusted data reaches a sink (SQL query, shell command, HTML output) without passing a trust boundary is a finding.

Step B3 — Verify that the function's behavior is correct in failure modes. What happens when the network is unavailable? When the downstream service returns an unexpected status code? When the input is at the boundary of expected size? AI-generated functions tend to handle the defined happy path and one or two explicit error conditions, and to be silent on everything else.

3.4 — Section C: Hallucination Verification

Step C1 — Verify every external API call against the current documentation. For every call to an external library, SDK, or service API, open the current documentation and confirm that the method exists, accepts the parameters as called, and returns what the code expects. This step takes two minutes per call and catches hallucinated methods before they reach production.

Step C2 — Verify constants and magic values. AI models frequently generate plausible-looking constants — error codes, configuration keys, algorithm identifiers — that are subtly wrong. A TLS configuration specifying TLSv1_2 where the correct constant is TLSv1.2 will fail at runtime in ways that are hard to diagnose. Check every constant that is not obviously derived from the codebase against its authoritative source.

Step C3 — Run with dependencies resolved before approving. For dynamic languages, do not approve a function until it has been run at least once with all dependencies resolved. In Python and JavaScript, hallucinated method calls on real objects are silent until execution. A simple smoke test eliminates this class of defect entirely.

Calibration Note

Not all code requires all steps. Utility functions with no security surface, no external dependencies, and no distributed-system assumptions can pass with Steps A3, C1, and C2 only. The full checklist is for security-sensitive, infrastructure-touching, or deployment-topology-sensitive code. Applying it uniformly is inefficient; calibrating it to risk is the skill.

Lesson 3 Quiz

Four questions · Select the best answer for each

1. Step A2 of the AI review checklist requires cross-referencing security-sensitive patterns against current advisories. What is specified as the appropriate source for this verification?

Correct. Step A2 explicitly requires verification against the primary source — not tutorials, not Stack Overflow — because the failure mode is precisely that AI trains on secondary sources that may be outdated.

Not quite. The checklist specifically requires primary sources (RFCs, OWASP Cheat Sheets) because the AI's training data includes secondary sources that may themselves be outdated.

2. The payments company post-mortem cited in Lesson 3 identified that their Redis rate-limiting function failed in production because of which failure class?

Correct. Context Blindness: the function was technically valid in isolation and for its original deployment assumption (single-instance), but failed when the deployment context was multi-region active-active.

Not quite. This is a Context Blindness failure — the code's assumptions about its environment (single-instance) didn't match the actual deployment topology (multi-region). Step B1 is designed to catch exactly this.

3. Step B2 — tracing data flows from untrusted sources — considers a path a "finding" when:

Correct. The definition of a finding is a path from untrusted source to a sink without an intervening trust boundary — regardless of path length or data origin.

Not quite. The criterion is specifically: does untrusted data reach a sink without passing through a trust boundary (sanitization, validation, parameterization)? Path length and data origin are not the relevant factors.

4. For which category of code does the checklist suggest that the full nine-step process is appropriate, as opposed to a subset?

Correct. The checklist is calibrated to risk: utility functions with no security surface may only need Steps A3, C1, and C2. The full checklist is for security-sensitive, infrastructure, or topology-sensitive code.

Not quite. The calibration is by risk category: security-sensitive, infrastructure-touching, or deployment-topology-sensitive code warrants the full checklist. Applying it uniformly to all code is explicitly called out as inefficient.

Lab 3 — Checklist Application

Apply the three-section AI review checklist to real code scenarios · Three exchanges to complete

Your Task

You will work through applying the Lesson 3 checklist — Steps A, B, and C — to code scenarios presented by the assistant. For each scenario, identify which checklist steps apply, what you would verify and how, and what findings you would raise. The assistant will challenge gaps in your application of the checklist.

Ask for your first code scenario to review with the checklist, or describe a security-sensitive function you want to work through systematically.

Lab Assistant

AI Code Review · L3

Ready. Ask for a code scenario to review using the Lesson 3 checklist, and walk me through which steps you'd apply and why. I'll probe your reasoning on each section.

AI Code Review Fundamentals · Lesson 4

Workflow, Escalation, and Calibrating Trust Over Time

Individual review skill is necessary but insufficient — the surrounding process determines whether findings reach resolution.

How do you integrate AI-aware review into your team's existing workflow, and how do you build calibrated trust in AI-generated code over time?

In mid-2023, the engineering team at Cursor — the AI-first code editor — published notes from their internal review process describing a problem they called "review fatigue asymmetry." Their developers were generating code faster than reviewers could apply adequate scrutiny at full depth. The solution was not to slow down generation or to hire more reviewers. It was to tier the review process: a fast-path review for low-risk code, and a structured AI-specific checklist for code meeting defined risk criteria. The tier assignment happened at the PR description stage, not the review stage. Reviewers knew before they opened the diff what depth of review was expected.

This lesson is about building that infrastructure — not the checklist itself, but the organizational layer that makes the checklist used consistently rather than sporadically.

4.1 — Tiering Your Review Process

A tiered review model assigns incoming code changes to one of three tracks based on risk criteria assessed at the time the PR is opened, before review begins. This front-loads the classification decision, which is faster and more consistent than making it per-reviewer at review time.

Track 1 — Standard Review. Code with no security surface, no external API calls, no distributed-system assumptions, and no authentication or authorization logic. AI generation is low-risk here; the existing review process is sufficient. Example: a utility function that formats a timestamp string.

Track 2 — AI-Aware Review. Code with any of the risk signals from Lesson 2, or code touching dependencies, configuration, or data validation. The Lesson 3 checklist applies at minimum for Steps A3, B3, and C1–C2. Estimated additional time: 10–15 minutes per security-sensitive block.

Track 3 — Security Review. Code that directly implements authentication, cryptography, authorization policy, or external trust boundaries. Requires the full Lesson 3 checklist plus a dedicated security reviewer if the team has one, or explicit sign-off from a senior engineer who has run every checklist step. No exceptions.

4.2 — When and How to Escalate

Escalation criteria should be written down, not left to reviewer judgment in the moment. The following conditions warrant automatic escalation to Track 3 regardless of initial triage:

Any code that modifies session token generation, validation, or storage. Any code that introduces a new external service integration. Any code that handles payment data, PII, or regulated data categories. Any code where the reviewer cannot determine the deployment topology assumption within five minutes of reading. Any code where Step C1 reveals a hallucinated API method.

Escalation is not a blame assignment. It is a statement that this decision requires more eyes or more expertise than the current reviewer has available. Teams that normalize escalation as a professional skill — rather than an admission of inadequacy — have dramatically better security outcomes than teams where reviewers feel pressure to approve rather than escalate.

Organizational Pattern

The most effective teams in 2023–2024 adopted what amounts to a "trust but verify, then remember" model: AI-generated code receives full checklist review on first submission, and the findings are recorded. Over time, patterns emerge — certain types of AI-generated code are clean; others consistently require findings. That history informs the next triage decision. It is not about trusting or distrusting the AI tool; it is about building an evidence base for where the risk concentrates.

4.3 — Calibrating Trust Over Time

Trust in AI-generated code should be earned through accumulated evidence, not assumed from the tool's reputation or the author's confidence. The mechanism for earning trust is a review log: for every AI-assisted PR that goes through Track 2 or Track 3, record what checklist steps were applied, what findings were raised, and how they were resolved.

After several months, patterns in this log become actionable. If a particular developer's AI-assisted PRs consistently pass Track 2 review with no findings in a specific domain — say, frontend utility code — that is evidence that their review-before-commit process is effective, and the triage for their PRs in that domain can be calibrated accordingly. If JWT validation code consistently generates findings regardless of author, that is evidence that Track 3 should be automatic for that code category.

This is a calibration process, not a scoring process. The goal is not to rank developers or tools. It is to allocate limited review attention to the places where findings actually occur — which is the same goal as all engineering process improvement.

4.4 — Disclosure Culture

Teams where AI tool disclosure is normalized have a structural advantage over teams where it is not. Normalizing disclosure does not require mandating it — it requires making it unremarkable. A PR template that includes "AI assistance used: [ ] yes [ ] no — if yes, which blocks?" creates a disclosure pathway that feels like a routine completion rather than a confession. The presence of that field, filled out routinely, eliminates the identification problem from Lesson 2 for the majority of cases and focuses identification effort on the minority where disclosure is absent.

The cultural work is in the management response to disclosure. If the first time a developer discloses AI assistance their PR receives unusually aggressive review or their manager expresses concern, disclosure rates will drop. If disclosure is met with a calibrated, professional response — "great, let me apply Track 2 review to these blocks" — disclosure rates climb and the review process improves for everyone.

Module Summary

This module established why AI code needs different review (the three failure classes), how to identify AI-generated code without depending on disclosure, what a structured checklist looks like in practice, and how to build the organizational infrastructure that makes consistent review possible. The next module applies these principles to specific language environments and security domains.

Lesson 4 Quiz

Four questions · Select the best answer for each

1. In a tiered review model, when does track assignment happen — and why does timing matter?

Correct. Front-loading the classification decision at PR description time makes it faster and more consistent than leaving it to individual reviewer judgment at review time.

Not quite. The key insight from the Cursor example is that track assignment happens at PR description time — before review begins — so the reviewer's depth expectation is set before they open the diff.

2. Which of the following conditions is listed as warranting automatic escalation to Track 3 regardless of initial triage?

Correct. A hallucinated API method discovered during C1 verification is an explicit automatic escalation trigger — it indicates the review process found a concrete defect that requires deeper scrutiny.

Not quite. The listed escalation criteria include: session token code, new external service integrations, regulated data, unclear deployment topology, and — specifically — any code where Step C1 reveals a hallucinated API method.

3. What is the described purpose of maintaining a review log for AI-assisted PRs over time?

Correct. The review log is a calibration tool — its purpose is to allocate limited review attention to where findings actually concentrate, not to score developers or tools.

Not quite. The review log is explicitly framed as a calibration instrument: it shows where findings occur so review effort can be directed there. It is explicitly not a scoring or ranking mechanism.

4. What does Lesson 4 identify as the key factor that determines whether disclosure culture takes hold in a team?

Correct. The cultural work is in the management response. If disclosure is met with disproportionate scrutiny or concern, disclosure rates drop. If it's treated as routine professional information, disclosure rates climb.

Not quite. Templates create the pathway, but the management response determines whether it's used. Disclosure met with aggressive review produces less disclosure, not more — regardless of what the template says.

Lab 4 — Workflow Design Practice

Design a tiered review process for a described team context · Three exchanges to complete

Your Task

The assistant will describe a team's current code review workflow and AI tool usage patterns. Your job is to design a tiered review process appropriate to their context — defining track criteria, escalation triggers, and disclosure norms. The assistant will probe your design choices and present edge cases that test the robustness of your framework.

Ask for a team context to design a review process for, or describe your own team's current workflow and the assistant will help you identify where AI-aware review should be integrated.

Lab Assistant

AI Code Review · L4

Ready. Describe your team's current workflow and I'll help you design an AI-aware review tier system, or ask me to present a team scenario for you to design around. Either way, we'll work through at least three design iterations.

Module 1 Test

15 questions · 80% required to pass · All lessons covered

1. Large language models generate code by:

Correct. Token prediction without execution or formal reasoning is the root cause of all three AI code failure classes.

Not quite. LLMs predict the next token statistically — they do not execute code, verify it, or retrieve from a database of verified implementations.

2. The JWT "alg:none" vulnerability cited in the module is an example of which failure class?

Correct. The model reproduced a JWT validation pattern that predated the 2015 security advisory — a textbook Stale Pattern Replay.

Not quite. This is Stale Pattern Replay: the model reproduced a historically common pattern without awareness that a 2015 advisory had changed the correct implementation.

3. Context Blindness failures occur when AI-generated code:

Correct. Context Blindness is specifically about the gap between the function's assumptions and the actual deployment environment — which the model cannot see beyond the context window.

Not quite. Context Blindness is the failure mode where code is correct in isolation but wrong for its specific deployment context — the model has no visibility into the surrounding system.

4. Why does the "does this look like something a developer would write?" heuristic fail when applied to AI-generated code?

Correct. The fluency illusion: AI is optimized for plausibility, so looking professional is exactly what it does — regardless of whether the logic is correct.

Not quite. The heuristic fails because AI is good at producing fluent, professional-looking code — which means the surface appearance gives false confidence about correctness.

5. According to the 2022 Stanford study, what was the combined effect of AI assistance in security-sensitive coding tasks?

Correct. Less secure code plus misplaced confidence is the worst-case combination — the finding that motivates this entire course.

Not quite. The study found the dangerous combination: more insecure code AND higher confidence in that insecure code — which means developers were less likely to seek review.

6. Uniform comment density at function boundaries — where every function in a new file has a well-formed docstring but functions elsewhere in the project are sparsely documented — is a signal of:

Correct. The asymmetry between the new file's documentation density and the rest of the codebase is the signal — AI tools produce consistently formatted docstrings that human-authored code in the same project typically doesn't match.

Not quite. The signal is the asymmetry — consistent docstrings in the new code where the surrounding codebase is sparse. AI models trained on tutorial code produce this pattern reliably.

7. A developer cannot explain why they used SHA-256 instead of bcrypt in a password hashing function. According to Lesson 2, the recommended response is:

Correct. The author's uncertainty is a signal to verify together — and in this specific case, SHA-256 is wrong for password hashing (bcrypt, scrypt, or Argon2 are correct). This is exactly the finding that collaborative verification catches.

Not quite. The right move is collaborative verification against current documentation. The inability to explain is a review signal, not grounds for rejection or deferral.

8. Step A2 of the AI review checklist requires verification against:

Correct. Primary sources only — because the AI's failure mode is reproducing patterns from secondary sources that may be outdated. Stack Overflow and tutorials are secondary sources.

Not quite. Step A2 specifies primary sources only: RFCs, official specifications, OWASP Cheat Sheets. Secondary sources (tutorials, Stack Overflow) may themselves reproduce outdated patterns.

9. The payments company post-mortem described in Lesson 3 led to the addition of which step to their review checklist?

Correct. The missing step was explicitly mapping implementation assumptions (single-instance) against deployment reality (multi-region active-active) — now codified as Step B1.

Not quite. The missing step was B1: verifying that the function's assumptions about its deployment environment matched the actual environment. That gap is why the rate limiter failed in multi-region.

10. Step C3 of the checklist addresses Confident Hallucination in dynamic languages by requiring:

Correct. In Python and JavaScript, a hallucinated method call on a real object is syntactically valid and only fails at runtime. A simple smoke test eliminates this class entirely.

Not quite. Step C3 is specifically about running the code before approval — because in dynamic languages, hallucinated methods are syntactically valid and only surface at execution time.

11. In the tiered review model, which type of code is described as warranting Track 1 (standard review) only?

Correct. Track 1 applies when all four conditions hold: no security surface, no external APIs, no distributed assumptions, no auth logic. Any one of these conditions triggers at least Track 2.

Not quite. Track 1 requires all four conditions simultaneously: no security surface, no external API calls, no distributed-system assumptions, no auth/authz logic. External APIs alone trigger Track 2.

12. Escalation in the Lesson 4 framework is described as:

Correct. Escalation is explicitly framed as a professional skill. Teams that normalize it have better security outcomes than teams where reviewers feel pressure to approve rather than escalate.

Not quite. Escalation is framed as a professional skill — a statement that more expertise is needed — not a last resort or an admission of failure. Normalizing it is the organizational goal.

13. The review log maintained for AI-assisted PRs is intended to enable:

Correct. The review log is a calibration tool — it shows where findings concentrate so review attention can be allocated accordingly. It is explicitly not a scoring mechanism.

Not quite. The review log's purpose is calibration: learning where findings occur so limited review effort goes there. It is not for scoring, compliance, or tool comparison.

14. What does Lesson 4 identify as the factor that determines whether a PR template disclosure field actually increases disclosure rates?

Correct. The template creates the pathway; the management response determines whether it's used. Disclosure met with aggressive review produces less disclosure regardless of what the template says.

Not quite. The template creates the pathway, but management response is the determining factor. If disclosure triggers disproportionate review, developers stop disclosing — regardless of whether the field is mandatory.

15. Which of the following best summarizes the core argument of Module 1?

Correct. Different failure modes requiring calibrated review — not a blanket judgment of AI code quality — is the precise argument of this module.

Not quite. The module's argument is more precise: AI code's failure modes are different from human failure modes, and a review process calibrated for human mistakes will systematically miss AI mistakes. The fix is calibration, not restriction.