Designing a Team AI Code Review Process
Learning Objectives
- Understand how the volume and characteristics of AI-generated code change the demands on a code review process.
- Identify the structural components of an effective AI code review workflow: stages, roles, and handoffs.
- Recognize the failure modes of traditional review processes when applied to AI-generated output at volume.
- Begin mapping an AI code review process appropriate to their own team's context and tooling.
Session Overview
Most engineering teams that have adopted AI coding assistants have not redesigned their code review process to match. They are using the same pull request workflow, the same reviewer assignment, and the same approval criteria they built for a world where humans wrote every line. The result is review fatigue, inconsistent depth, and a growing gap between the volume of code that ships and the scrutiny it receives.
This session establishes the design problem clearly: AI-assisted development changes the nature of code in review — more of it, more syntactically consistent, with characteristic failure patterns that differ from human mistakes. A review process designed for this environment is not simply "more review" — it is a different kind of review, organized around different questions, supported by different tooling, and distributed across roles with different responsibilities.
Key Teaching Points
- Volume is the first structural challenge. AI tools can generate hundreds of lines of code per minute. A team that previously reviewed 200 lines per day per developer may find themselves reviewing 2,000. The bottleneck moves from "writing the code" to "reviewing the code," and a review process that scales linearly with code volume will break quickly. The design question is which review activities can be automated, which can be parallelized, and which require focused human attention.
- AI code has characteristic patterns that warrant systematic, not just holistic, review. Human-written code tends to fail in idiosyncratic ways — a developer's blind spots, misunderstood requirements, copy-paste errors specific to their history. AI code fails in predictable categories: injection, missing validation, insecure defaults, hardcoded secrets. A review process that incorporates checklists for these categories catches more issues more consistently than a "read through it and see if anything feels wrong" approach.
- The author's role changes when the author is a human plus an AI. The developer who submits AI-generated code for review is responsible for that code — but their relationship to it is different from code they wrote themselves. They may not have read every line; they may not understand every function the AI generated. A good review process acknowledges this and expects the author to have done a first-pass review of the AI's output before submitting it, not just forwarded the AI's response.
- Review stages should be explicit and sequenced. Effective AI code review typically involves at least three distinct stages: automated checks that run on every commit (linters, SAST, secrets scanning), author self-review before submission, and peer review focused on logic, architecture, and contextual correctness. Each stage has a different purpose and a different appropriate depth. Conflating them produces shallow coverage at every stage.
- Reviewer assignment must account for AI-specific expertise. Not all reviewers approach AI-generated code with equal skepticism or equal familiarity with its failure modes. Teams should consider assigning reviewers who have training in AI code review patterns to high-risk code paths, rather than simply rotating by availability.
Discussion Prompts
- How has the volume of code your team reviews changed since adopting AI coding assistants? Has your review process changed to match, or are you using the same process at higher volume?
- When a developer submits AI-generated code for review, what do you expect them to have done before hitting "submit"? Is that expectation written down anywhere?
- Who in your team currently has the most experience recognizing AI code failure patterns? Is that expertise distributed or concentrated in one or two people?
- If you could change one thing about your current review process tomorrow to better handle AI-generated code, what would it be?
Start by asking participants to raise hands if their team's code review process has been explicitly updated to account for AI tool adoption. The typical response — very few hands, or none — immediately validates the premise of the course without requiring the instructor to argue for it.
The "author responsibility" point often generates the most discussion, because it surfaces a genuine organizational ambiguity: who is responsible for AI-generated code that passes review but causes a production incident? Make clear that the answer is the human developer who submitted it — but acknowledge that the review process needs to support that developer in taking meaningful ownership, not just pro-forma ownership.
This is a framing session, so resist the temptation to go deep on tooling or specific checklists — those are Sessions 3 and 4. Keep the discussion at the workflow design level: stages, roles, responsibilities, and volume.
Timing Guide
Transition to Session 2
Session 1 established the structure of the review process. Session 2 addresses the criteria: before a team can review consistently, it needs to agree on what "passing" looks like. Quality gates and acceptance criteria for AI-generated code are the subject of the next session.
Quality Gates and Acceptance Criteria
Learning Objectives
- Distinguish between binary quality gates (automated pass/fail) and graduated acceptance criteria (human judgment standards).
- Identify the dimensions along which AI-generated code should be evaluated beyond functional correctness.
- Design context-appropriate acceptance criteria for different code risk tiers within a single organization.
- Recognize how inconsistent acceptance criteria lead to review variability and compounding technical debt.
Session Overview
Code review without explicit acceptance criteria is essentially a measure of how confident each individual reviewer happens to feel on any given day. Different reviewers approve different things; the same reviewer approves different things depending on how busy they are. This problem existed before AI coding tools, but AI amplifies it significantly: when a reviewer is faced with 800 lines of clean-looking, syntactically consistent AI-generated code, the absence of clear criteria tends to resolve as "I don't see anything obviously wrong — approved."
Quality gates are the explicit, pre-defined standards that code must meet before it can proceed. Some are binary and automated — all tests pass, no SAST findings above a severity threshold, no secrets detected. Others are graduated and require judgment — is the logic correct for the business requirement? Is the performance characteristic appropriate for the load? Is the design maintainable by the next developer who touches it? This session covers how to define both types and how to communicate them consistently across a team.
Key Teaching Points
- Binary gates belong in automation, not in human review. Any criterion that can be evaluated mechanically — test coverage threshold, linting compliance, known vulnerability presence, secrets detection — should be an automated gate that blocks merge before a human reviewer sees the code. Human attention is too valuable and too inconsistent to apply to questions automation can answer reliably. Every binary criterion that stays in human review is a criterion that will occasionally be skipped or overridden under time pressure.
- Human review criteria should be explicit and written down. "The code should be readable" is not a criterion — it is a feeling. "New functions should have names that describe their purpose without requiring a comment, and should not exceed 40 lines" is a criterion. The more concrete the written standard, the more consistently reviewers apply it and the more clearly authors know what they are aiming for when they submit AI output for review.
- Risk tiering is necessary for a sustainable process. Not all code carries the same risk. Authentication middleware, payment processing logic, and data encryption code warrant stricter acceptance criteria and deeper review than a utility function that formats a date string. Define at least two or three risk tiers with explicitly different review requirements, and create a simple rule for assigning code to a tier based on the functionality it touches.
- AI-specific acceptance criteria address AI-specific failure modes. Standard acceptance criteria were designed for human-written code and do not cover the characteristic failures of AI output. A complete acceptance framework for AI-generated code should add criteria specifically addressing the failure patterns this course covers: injection surfaces reviewed, validation logic verified against requirements, dependencies checked against current advisories, no credentials in code.
- Acceptance criteria must be revisited as the tooling evolves. AI coding tools improve, change behavior, and introduce new capabilities on a rapid cycle. Criteria defined for one version of a tool may be insufficient or unnecessarily strict for the next. Schedule regular reviews of acceptance criteria — quarterly is a reasonable cadence — and update them when the team's experience with AI output reveals new patterns.
Discussion Prompts
- Does your team have written acceptance criteria for code review today? If so, how often are they consulted during a review? If not, how do you think reviewers are currently making approval decisions?
- How would you determine what risk tier to assign to a given piece of code? What information would you need, and who would make that determination?
- If you introduced explicit acceptance criteria for AI-generated code tomorrow, how would your current team receive them — as useful clarity or as bureaucratic overhead? What would make the difference?
- Think of a recent review where you approved something and later regretted it, or rejected something and felt it was too aggressive. What criterion, if it had existed, would have guided the decision correctly?
The binary-vs-graduated distinction is useful framing that most participants will not have articulated before. Spend time on the concept that human review should be reserved for questions that require judgment — anything with a deterministic answer belongs in automation. This frames the tooling sessions that follow as not optional extras but as essential components of a correctly designed review process.
Risk tiering tends to generate practical discussion because participants immediately start debating which of their code falls into which tier. This is a productive conversation — do not cut it off early, but do keep it grounded in the principle (what makes code high-risk?) rather than getting too deep into organizational specifics.
Participants often ask for a template or example acceptance criteria document. If you have one from a relevant domain, sharing it is useful. If not, suggest that a good starting point is listing the five things the team most commonly catches in review and turning each into an explicit criterion.
Timing Guide
Transition to Session 3
Session 2 defined what passing looks like. Session 3 covers the automated layer that enforces the binary criteria before code ever reaches a human reviewer — linting, static analysis, and the configurations that make these tools effective specifically for AI-generated code.
Automated Linting and Static Analysis
Learning Objectives
- Identify the categories of automated analysis tools relevant to AI-generated code review: linters, SAST, secrets scanners, and SCA.
- Understand how to configure and tune static analysis tools to reduce false positives while catching AI-characteristic failure patterns.
- Design a CI pipeline stage that runs automated checks in the correct sequence and blocks on the right failure conditions.
- Recognize the limitations of automated analysis and articulate what it cannot catch.
Session Overview
Automated analysis is the first line of defense in an AI code review workflow — and the most consistent one. Human reviewers have variable attention, variable expertise, and variable time. Automated tools apply the same rules to every commit, every time, without fatigue. The goal of this session is to help participants understand not just which tools exist, but how to configure and integrate them as a coherent quality layer rather than a collection of disconnected scripts that occasionally produce noise.
The specific challenge with AI-generated code is that off-the-shelf tool configurations were designed for typical human code patterns. AI code is syntactically cleaner (fewer style violations) but semantically riskier in specific ways (more injection surfaces, more hardcoded secrets, more insecure defaults). Effective tooling for AI code review requires tuning: turning up sensitivity on the patterns AI produces and turning down noise on the style issues that are largely irrelevant.
Key Teaching Points
- Linters address style, complexity, and obvious anti-patterns. Language linters (ESLint, Pylint, RuboCop, golangci-lint) enforce code style, flag overly complex functions, identify unused variables, and catch obvious API misuse. For AI code specifically, complexity metrics are valuable — AI will sometimes generate deeply nested logic or very long functions that pass style checks but are difficult to reason about. Configure complexity thresholds appropriate to your codebase and enforce them automatically.
- SAST tools catch security-relevant code patterns. Static Application Security Testing tools (Semgrep, CodeQL, Bandit, Brakeman) analyze code for patterns associated with known vulnerability classes. For AI code, configure rules that specifically target injection patterns, insecure function calls, and cryptographic misuse. Semgrep is particularly useful because its rule language allows teams to write custom rules targeting AI-specific patterns they have observed in their own codebase.
- Secrets scanning must cover the full commit history, not just the current state. Secrets scanners (truffleHog, gitleaks, detect-secrets) find credentials, API keys, and tokens in code. For AI-generated code, this is a critical control because AI tools consistently produce hardcoded credentials when connecting to external services. Run secrets scanning on every push, and configure it to scan the entire commit delta rather than just the current file state — a secret that was added and then removed in a subsequent commit is still present in git history.
- Software Composition Analysis covers the dependency surface. SCA tools (Grype, Snyk, OWASP Dependency-Check, Dependabot) compare installed package versions against vulnerability databases. Because AI tools recommend packages based on training data that has a fixed cutoff date, recently published CVEs will not be reflected in AI recommendations. SCA that runs on every dependency change catches the gap between what the AI knew and what the current vulnerability landscape looks like.
- Pipeline design determines whether tools are enforced or merely advisory. A tool that runs and reports findings but does not block the merge is an advisory tool, not a gate. For binary quality criteria, tools must be configured to fail the CI build and block merge on relevant findings. Define clearly which tool categories are blocking (SAST critical findings, secrets, known-vulnerable dependencies) and which are informational (style warnings, low-severity linting), and enforce that distinction consistently.
Discussion Prompts
- Which of the tool categories discussed — linting, SAST, secrets scanning, SCA — does your team currently run, and which ones block merge on failure versus just report? What is the gap?
- If your SAST tool currently has a high false positive rate and developers are ignoring or suppressing findings, what would you need to change to make it effective? Where does that work fall?
- Secrets scanning requires access to git history for completeness. What would you find if you ran a historical secrets scan on your repository today? Is that a conversation your organization has had?
- Writing custom Semgrep rules for AI-specific patterns your team has observed — who would do that work, and where would it fit in your team's current workload?
The most common pain point participants raise with static analysis is false positives — tools that fire constantly on acceptable code patterns until developers start ignoring them entirely. Validate this experience and address it directly: the solution is not to tolerate noise or turn the tools off, it is to invest in configuration. Semgrep in particular can be tuned very precisely, and the investment in custom rules pays off quickly if the team is seeing recurring AI-generated patterns.
The pipeline design discussion often reveals that teams have tools configured but not enforcing. The advisory-vs-gate distinction is a useful intervention: ask participants to think about whether a finding from each of their current tools would actually stop a bad merge, or just produce a report that gets filed and forgotten. That question often motivates a concrete action item.
Keep tool-specific technical depth moderate — the goal is that participants understand what each category does and can make informed decisions, not that they can configure each tool from scratch in this session. Reference documentation and community rule sets as starting points rather than building the impression that this requires deep expertise to begin.
Timing Guide
Transition to Session 4
Session 3 covered the tools that generate data about code quality. Session 4 addresses how to use that data: measuring review quality, tracking patterns in what the tools find, and turning aggregate findings into insights that improve the development process over time.
Code Review Metrics and Tracking
Learning Objectives
- Identify the metrics that meaningfully indicate review process health versus the vanity metrics that feel useful but are not.
- Design a tracking approach for AI-generated code issues that reveals patterns rather than just recording incidents.
- Understand how to use review data to create feedback loops between the review process and developer behavior.
- Recognize the leading indicators of review process degradation before it produces production incidents.
Session Overview
Code review generates a stream of data that most teams discard. Comments made and resolved, findings from static analysis, approval patterns, time from submission to merge — all of these are signals about how the review process is functioning and whether AI-generated code quality is improving, degrading, or holding steady. Teams that measure systematically can identify emerging patterns and intervene proactively. Teams that do not measure are responding to fires after they are lit.
For AI-generated code in particular, patterns matter more than individual incidents. A single injection vulnerability found in review is a fix. Five injection vulnerabilities found in the code generated when a specific developer uses a specific prompt pattern is a training opportunity and a process intervention. Getting from incidents to patterns requires deliberate tracking and a willingness to look at aggregate data rather than case by case.
Key Teaching Points
- Time-based metrics are useful but easily gamed. Mean time from PR submission to first review, mean time to approval, and review throughput per reviewer are visible and easy to calculate. They are also easy to game — an approval that takes 30 seconds is faster than one that takes 30 minutes, but not better. Use time metrics as a sanity check on overall process health, not as a measure of review quality. A review process that is very fast and producing production incidents is not a healthy process.
- Issue density is a leading indicator of process stress. Track the number of substantive review comments per pull request over time, broken down by issue type. A rising issue density in AI-generated code means the AI output quality or the human review quality is degrading — both require investigation. A falling issue density could mean the code is improving, or it could mean reviewers have stopped looking as carefully. Context from other metrics determines which interpretation is correct.
- Categorize findings by origin to build an AI-specific dataset. If your team tags review comments or SAST findings by category (security, logic, style, performance), you can eventually analyze which categories are most common in AI-generated versus human-written code. This data is valuable for configuring tools, writing checklists, and prioritizing reviewer training. It requires only a lightweight taxonomy applied consistently — it does not need to be elaborate.
- Track escape rate: what gets through review and causes production issues. The most important metric is the rate at which issues that should have been caught in review escape to production. For AI-generated code, track whether escaped issues follow a pattern — the same category, the same developer, the same type of prompt. Escape rate analysis turns individual incidents into process information and is the most direct measure of whether the review process is achieving its purpose.
- Reviewer calibration data reveals inconsistency. If the same code is reviewed by different reviewers and the outcomes vary significantly — one approves quickly, another raises five substantive comments — the team has a calibration problem. This is detectable if approval and comment data is maintained. Regular calibration sessions, where the team reviews the same sample code together and compares notes, are the operational intervention for this problem.
Discussion Prompts
- What data does your team currently collect from code review? How is it used, and who looks at it? If nobody is looking at it systematically, what would need to change for that to happen?
- Have you noticed recurring patterns in the issues found in AI-generated code on your team? If so, have those patterns driven any changes to the development or review process?
- How would you build a lightweight tagging system for review findings without creating so much process overhead that reviewers stop using it?
- If you discovered that a specific developer's AI-generated code consistently had a higher escape rate than others, how would you handle that conversation? What would the process look like?
The escape rate concept tends to be the most resonant part of this session. It reframes the purpose of review metrics from "how fast is our process?" to "is our process actually doing its job?" Most teams that have not been tracking escape rate are initially uncertain what they would find — and that uncertainty is itself informative. It suggests the team does not have a clear picture of review effectiveness.
Be cautious about the calibration discussion in teams where the culture does not support this kind of peer comparison. Frame calibration exercises as team learning rather than individual performance evaluation — the goal is to understand where the standard is inconsistently applied, not to identify "bad" reviewers. In some organizational contexts, this distinction needs to be made explicitly and repeatedly.
Emphasize throughout that the goal of metrics is to improve the process, not to create accountability theater. If tracking makes people anxious or drives gaming behavior, the framing of the metrics program needs to be revisited. Metrics should surface information that helps the team make better decisions.
Timing Guide
Transition to Session 5
Sessions 1 through 4 have covered the structural design of a review process: workflow, criteria, tooling, and measurement. Session 5 turns to the human element — how do you train the people doing the reviewing to approach AI-generated code with appropriate skepticism and appropriate speed?
Onboarding Reviewers to AI Code
Learning Objectives
- Understand why reviewing AI-generated code requires a different mental model than reviewing human-written code.
- Identify the cognitive biases that make AI-generated code appear more trustworthy than it is, and how to counteract them.
- Design an onboarding program that develops AI code review skills progressively rather than assuming reviewers already have them.
- Establish practices that sustain reviewer engagement and skepticism over time, not just in the first week.
Session Overview
Reviewing AI-generated code is a skill that most developers do not have yet and have not been asked to develop. The common assumption — "you already know how to review code, just apply that to AI output" — underestimates the extent to which the AI context changes the review task. AI code is stylistically consistent, free of the personal idiosyncrasies that signal where a human developer cut corners, and often accompanied by implicit pressure to trust it because it came from a sophisticated tool. Reviewers who do not account for these dynamics will be less effective than they would be reviewing equivalent human-written code.
A structured onboarding program for AI code review is not a long course — it can be built as a series of short practical exercises. The key is to build two things: knowledge of AI-specific failure patterns (what to look for) and a calibrated skepticism response (how to maintain appropriate doubt in the face of clean-looking code). Both require practice, not just information.
Key Teaching Points
- The "automation bias" problem is real and has a specific AI variant. Automation bias is the tendency to over-rely on automated or algorithmic outputs. Reviewers who know code came from an AI tool often subconsciously apply less scrutiny, on the assumption that the tool must have gotten it right. Making this bias explicit in onboarding — naming it, giving it a handle, showing examples of how it causes missed issues — is one of the most effective interventions a training program can provide.
- Train reviewers on AI failure patterns before they encounter them in review. A reviewer who has never seen a hardcoded API key in AI output will not be looking for one during review. Exposure training — reviewing samples of annotated AI-generated code that contains known vulnerabilities — is the most efficient way to build the pattern recognition needed for effective review. Use real AI-generated examples, not synthetic ones constructed for training purposes.
- Pair reviewing is a powerful onboarding technique. Pairing a new reviewer with an experienced one and having them review the same code independently, then compare notes, reveals calibration gaps and teaches the reasoning behind experienced reviewers' decisions in a way that guidelines documents cannot. Make pair reviewing a standard part of the first month for anyone who will be reviewing AI code.
- Speed calibration matters as much as depth calibration. New reviewers tend to fall into one of two failure modes: reviewing so cautiously that they become bottlenecks, or reviewing so quickly that they miss substance. Help reviewers develop a sense for which parts of AI output warrant close reading (auth logic, data handling, external calls, crypto) and which parts can be assessed more quickly (utility functions, UI rendering, test assertions). Risk-based reading is a skill that can be taught.
- Ongoing practice is required to maintain skills over time. The initial onboarding builds skills; ongoing practice maintains them. Regular team exercises — monthly annotated code review sessions, retrospectives on escaped issues, calibration sessions — keep AI code review skills sharp and update them as AI tool behavior evolves. Onboarding is not a one-time event for a skill that needs to grow with the technology.
Discussion Prompts
- Have you experienced automation bias in your own code review — approving something from an AI tool more readily than you would have approved the same thing written by a human? What was that situation like?
- If you were designing a two-hour onboarding session for a new reviewer who will be reviewing AI-generated code, what would you include and in what order?
- What is the most valuable thing an experienced reviewer on your team could teach a new reviewer about AI code — and is there a way to systematize that transfer?
- How do you balance the need for reviewers to maintain healthy skepticism with the organizational pressure to review and merge AI-generated code quickly? Where does that tension get resolved in your team?
Naming automation bias directly and explicitly is usually a moment of recognition for participants — they have experienced it without having a name for it. Having a named cognitive bias makes it actionable: once you know the pattern, you can build a mental check for it. The practical intervention is something like a deliberate pause before approving AI code: "Am I approving this because it passed review, or because it came from an AI and I assumed that was enough?"
Pair reviewing is underutilized in most software teams and tends to generate enthusiasm when introduced as a concept, even among experienced reviewers who assume they don't need it. Frame it not as remediation for weak reviewers but as a calibration tool for everyone — even experts benefit from comparing notes with another expert, especially as AI tools change.
The speed calibration point is worth spending extra time on for teams that are under strong delivery pressure. The practical question "which parts of AI output warrant slow reading?" gives reviewers permission to be faster on lower-risk sections while being appropriately thorough on high-risk ones — it is a more workable framing than "review everything with the same depth."
Timing Guide
Transition to Session 6
Session 5 focused on how reviewers develop and maintain skills. Session 6 addresses one of the most practically difficult situations in AI code review: what happens when a reviewer's judgment conflicts with what the AI generated, and neither party is obviously wrong. Navigating that conflict well is a skill in itself.
When AI and Human Disagree
Learning Objectives
- Recognize the categories of AI-reviewer conflict: stylistic, correctness, approach, and contextual disagreements.
- Understand when reviewer preference should yield to AI output and when it should override it.
- Apply a resolution framework that produces consistent, defensible decisions without creating conflict escalation patterns.
- Distinguish between a reviewer overriding AI output for good reasons and a reviewer blocking AI adoption out of preference for familiar patterns.
Session Overview
When a reviewer and a human developer disagree about a piece of code, there are established social and technical processes for resolution. When the disagreement is between a reviewer and AI-generated code, the dynamics are different and often poorly navigated. The developer who submitted the code may not have strong ownership of the AI's implementation choices. The reviewer may be uncertain whether their objection reflects a real quality concern or an unfamiliarity with how AI approaches a problem. And both parties may invoke "the AI said so" or "the AI is wrong" without applying rigorous reasoning to the question of which is true.
This session provides a framework for resolving AI-reviewer conflicts systematically — one that keeps the focus on code quality rather than on the source of the code, and that produces consistent decisions that can be documented and learned from.
Key Teaching Points
- Categorize the disagreement before resolving it. A stylistic disagreement (the reviewer prefers a different naming convention) is different from a correctness disagreement (the reviewer believes the logic is wrong for the requirement) which is different from a contextual disagreement (the AI didn't know about a domain constraint the reviewer knows). Each category has a different appropriate resolution process. Conflating them produces either unnecessary conflict over style or insufficient rigor on correctness.
- Stylistic disagreements should defer to documented standards, not individual preference. If the team has documented coding standards, the question "does this match our standards?" has a factual answer. If the AI's output follows the documented standard and the reviewer's preference is different, the standard wins. If neither the AI nor the reviewer is following a documented standard, the resolution process should create one rather than relitigating the same argument on every pull request.
- Correctness disagreements require demonstrating the flaw, not asserting it. When a reviewer believes AI-generated code is logically incorrect, the burden is to show why — a specific input that produces the wrong output, a requirement that the code fails to satisfy, a test case that reveals the error. "I don't think this is right" is not a review comment; "this function returns null when passed an empty list, but the requirement specifies it should return an empty collection" is. Requiring reviewers to demonstrate correctness issues raises the quality of the review and eliminates unfounded rejections.
- Contextual disagreements are the most common and the most valuable. The AI that generated the code had access to what was in the prompt; the reviewer has access to the broader system, the organization's constraints, the production environment, and the history of why certain decisions were made. When a reviewer flags a contextual concern — "this approach doesn't account for how we handle authentication in legacy systems" — that information is genuinely valuable and should be incorporated, regardless of what the AI generated. This is the clearest case for reviewer override.
- Document resolutions so they do not recur. Every significant AI-reviewer conflict that requires a decision should be documented: what the disagreement was, what was decided, and why. This documentation feeds the team knowledge base (Session 7's topic), prevents the same conflict from recurring with different people, and creates a record that can be used to identify patterns in what the AI consistently gets wrong in the team's specific context.
Discussion Prompts
- Can you think of a time when a reviewer's objection to AI-generated code turned out to be a style preference rather than a correctness issue? How was that resolved, and how could the resolution have been cleaner?
- What is your current process for resolving review disagreements that cannot be settled between the author and the reviewer? Does it work well, or does it tend to either escalate unnecessarily or let too many things through?
- How do you distinguish between a reviewer who is applying rigorous judgment to AI output and a reviewer who is functionally resistant to AI-generated code on principle? Does that distinction matter for how you handle the conflict?
- If a reviewer overrides AI-generated code and the override turns out to be wrong — producing a regression or a new bug — what is the process for handling that, and does it differ from how you handle other bugs?
The categorization framework is the core practical tool of this session — spend real time making each category concrete with examples. Participants often realize mid-discussion that conflicts they have experienced were actually stylistic (masquerading as correctness) or contextual (which they were right to raise but did not have clear language for). Naming the categories is itself useful.
The question of resistance to AI code on principle is delicate and worth addressing directly rather than avoiding. Some developers have principled objections to AI-assisted development; some have aesthetic preferences for familiar patterns; some have legitimate quality concerns about specific AI outputs. A team needs the ability to distinguish between these — resistance grounded in quality reasoning should be respected and documented; resistance grounded in preference should be addressed differently. Avoid either dismissing reviewer concerns or treating all reviewer objections as presumptively valid.
Emphasize that "document resolutions" is not bureaucratic overhead — it is the input to the knowledge base that Session 7 covers. Each documented resolution is a piece of institutional knowledge about how the team's context differs from what AI knows. Those documents have high future value.
Timing Guide
Transition to Session 7
Session 6 established that resolved conflicts should be documented. Session 7 turns documentation into a living system — a team knowledge base that captures patterns, anti-patterns, and decisions and makes that institutional knowledge available to every reviewer and every developer who uses AI tools going forward.
Building a Review Knowledge Base
Learning Objectives
- Understand what should and should not be captured in a team AI code review knowledge base.
- Design a knowledge base structure that is useful to reviewers during active review, not just as a reference document people read once.
- Identify the workflow hooks that feed new knowledge into the base without creating a dedicated documentation burden.
- Recognize the lifecycle of knowledge base content and how to keep it current as AI tool behavior evolves.
Session Overview
Every team that has been using AI coding assistants for more than a few months has accumulated informal knowledge: the patterns the AI gets wrong in their specific codebase, the prompts that produce better or worse results, the contextual constraints the AI does not know about, the decisions that were made and why. This knowledge lives in the heads of individual developers and in the resolved threads of pull request conversations. When those developers leave or move to other teams, the knowledge leaves with them.
A team review knowledge base is the deliberate capture of that institutional knowledge in a form that persists, is searchable, and is usable by every reviewer regardless of their tenure. Building it is not a large up-front project — it is a practice of capturing decisions at the moment they are made, using the pull request workflow that already exists as the primary input channel.
Key Teaching Points
- The knowledge base has three primary content types. Anti-patterns are code constructs the AI produces that the team has decided are unacceptable, with an explanation of why and the preferred alternative. Contextual constraints are facts about the codebase, domain, or operational environment that the AI cannot know and that regularly affect review decisions. Decision records are documented resolutions to significant AI-reviewer conflicts — what was decided, by whom, and on what reasoning. Each type serves a different purpose and benefits from different formatting.
- Pull request review threads are the primary raw material. When a reviewer catches an AI-generated pattern and explains why it is wrong, that explanation is already written — in the PR comment. The operational challenge is moving that explanation from the PR thread (where it is buried and will eventually be archived) to the knowledge base (where it is findable and organized). Establish a lightweight tagging convention — a label, a keyword, a Slack reaction — that signals "this comment should be added to the knowledge base" without requiring the reviewer to do the capture work in the moment.
- Organize for findability during review, not for completeness. A knowledge base that is comprehensive but organized alphabetically or chronologically is not useful during an active code review. Reviewers need to find "is there anything we know about how AI handles database transactions in our context?" quickly, not browse a comprehensive reference. Organize content by code domain (authentication, data access, external APIs), by vulnerability category, and by the AI tool or prompt type that tends to generate the relevant pattern.
- The knowledge base feeds back into tooling and prompts. Anti-patterns that appear frequently enough to document should be candidates for custom linting rules or SAST configurations — automated detection is more reliable than expecting reviewers to catch the same thing repeatedly. Similarly, documented contextual constraints can be incorporated into prompt templates the team uses with AI tools, reducing the frequency with which the AI generates code that ignores those constraints.
- Assign maintenance ownership and schedule reviews. Knowledge bases decay without maintenance. Facts become stale; documented anti-patterns become handled by updated AI models; new patterns emerge that are not yet captured. Assign clear ownership for maintenance — not a single person, but a rotation or a team norm. Schedule quarterly reviews of knowledge base content to identify what is outdated, what is missing, and what has been superseded by changes in team practice or AI tool behavior.
Discussion Prompts
- Where does your team's institutional knowledge about AI code patterns currently live? If a new developer joined tomorrow, how would they access it?
- Think about the last three substantive review comments you made on AI-generated code. Would any of them have been more useful as a persistent knowledge base entry than as a one-time PR comment? What made them significant enough to document?
- Who in your team would be the right person to own the initial setup of a review knowledge base? What authority and time would they need to make it work?
- How do you keep a knowledge base current as AI tools change — specifically, as newer AI models stop making mistakes that older ones made regularly? What is the lifecycle of a documented anti-pattern?
Emphasize the pull request thread as raw material — this reframes the documentation work from "something extra we need to do" to "something we have already done that we just need to capture more efficiently." Most of the knowledge already exists; the practice is about moving it from an ephemeral location to a persistent one, and that is a much smaller lift than starting from scratch.
The feedback loop to tooling is worth spending extra time on if the group is technical. The path from "we keep catching this AI pattern manually" to "we now have a Semgrep rule that catches it automatically" is a concrete improvement in the review process that most teams can implement. A knowledge base that feeds back into automation is more valuable than one that remains purely documentary.
Be realistic about maintenance. Knowledge bases that are set up enthusiastically and then allowed to decay can be worse than no knowledge base — they create the impression that the team's knowledge is captured when it is actually outdated. Set expectations for maintenance from the beginning, with realistic time estimates and ownership norms.
Timing Guide
Transition to Session 8
Sessions 1 through 7 have built a complete review practice for a single team. Session 8 addresses the expansion challenge: how do the processes, criteria, tooling, and knowledge structures we have designed propagate effectively across multiple teams in a larger organization, without requiring each team to rebuild everything from scratch?
Scaling Audit Practices Across Teams
Learning Objectives
- Understand the organizational dynamics that affect how AI code review practices spread — and fail to spread — across teams.
- Identify the components of a review practice that should be standardized centrally versus adapted locally by each team.
- Design a center-of-excellence or community-of-practice model that supports review quality without creating a coordination bottleneck.
- Recognize the inflection points in organizational AI adoption that require a deliberate review of and update to review standards.
Session Overview
What works for one team rarely spreads to ten teams without deliberate design. The practices, tooling configurations, acceptance criteria, and knowledge bases developed by an early-adopter team are genuinely valuable — but they are embedded in that team's context. Other teams face different codebases, different technology stacks, different regulatory environments, and different cultures of review. A scaling strategy that mandates exact replication typically fails; one that identifies the transferable principles and provides tools for local adaptation has a better chance of producing consistent quality across a growing organization.
This session focuses on the governance, structure, and communication mechanisms that allow AI code review practices to scale. It addresses the roles, forums, and artifacts through which a team's hard-won knowledge about reviewing AI code can be made available to every team in the organization — without requiring a centralized bureaucracy to maintain it.
Key Teaching Points
- Distinguish mandated standards from recommended practices. Some review requirements are non-negotiable across the organization: secrets must be scanned, critical SAST findings must be addressed, certain security-critical code paths require dedicated security review. Others are recommended but locally adaptable: the specific linting rules, the structure of the team knowledge base, the reviewer rotation policy. Being explicit about which is which prevents mandated standards from being ignored as bureaucracy and prevents recommended practices from being treated as optional extras by teams that need them most.
- A center of excellence model provides expertise without ownership. A small, cross-organizational team with deep expertise in AI code review can serve as a resource and a standard-setter without becoming a bottleneck. This team develops and maintains the organizational standard for review tooling, produces shared training materials, maintains a cross-team knowledge base of AI patterns observed across the organization, and serves as escalation point for difficult review decisions. Critically, it does not own review for any specific team — that remains with the team itself.
- Communities of practice spread knowledge laterally. A regular forum — bi-weekly or monthly — where AI code reviewers from different teams share patterns they have observed, tools they have found effective, and approaches to common challenges is a lightweight but high-value structure. These forums do not require a centralized team to run; they require a facilitator, a communication channel, and a norm of contribution. The knowledge base from Session 7 scales naturally if multiple teams contribute to a shared instance.
- Tooling standardization is the highest-leverage scaling mechanism. If every team is running the same core set of automated analysis tools with a common baseline configuration, review quality across the organization rises to that baseline without requiring any team-by-team enforcement. A shared CI pipeline template that includes SAST, secrets scanning, and SCA with appropriate configurations is more effective than a policy document describing what teams should do. Configuration as code, version-controlled and distributed through an internal developer platform, is the practical implementation.
- AI adoption inflection points require deliberate review standard updates. Review standards designed for a team at 20% AI code generation will not be adequate for a team at 70%. As the proportion of AI-generated code in the codebase grows, the pressure on review processes intensifies and the stakes of review failures rise. Identify in advance the adoption thresholds that trigger a review of review standards — for example, when AI generates more than half of new code for any team — and treat those thresholds as scheduled governance events, not surprises.
Discussion Prompts
- In your organization, who currently has the mandate to set standards for how AI-generated code is reviewed? If nobody does, what would be needed to create that function?
- What aspects of the review practice you have designed in this course do you think would transfer well to other teams in your organization, and which would need significant adaptation? What drives that difference?
- Have you participated in cross-team communities of practice or guilds before? What made them effective or ineffective, and how would you design one for AI code review specifically?
- If your organization's AI adoption rate doubles in the next year, is your current review practice capable of scaling with it? What breaks first, and how would you address it?
This session benefits from inviting participants to think organizationally rather than just about their immediate team. Many will be in a position to influence how their team's practice develops, but not to mandate standards across other teams. The practical question for them is: what can I do from my position to spread the practices in this course, without authority over other teams? Community of practice participation, sharing the team knowledge base, and contributing to shared tooling configurations are all answers to that question.
The "configuration as code" point resonates strongly with technically senior participants. If the organization has an internal developer platform or a shared CI template, position shared review tooling configuration as a contribution to that platform — it frames the work in terms the organization already values.
Close the course on an action-oriented note. Each participant should leave with a clear sense of one thing they will do differently next week and one larger initiative they will propose or champion in their organization. The knowledge from this course is most valuable when it creates behavior change, not when it is merely understood.
Timing Guide
Closing Remarks
This course has moved from the design of a review workflow through its criteria, tooling, measurement, human development, conflict resolution, knowledge capture, and organizational scaling. Every topic connects to the central premise: the speed AI enables is only sustainable if the quality practices that surround it scale at the same rate. Teams that invest in review infrastructure now are building the capability to move fast with confidence. Teams that defer that investment are accumulating a risk debt that becomes harder to address with every AI-generated line that ships without rigorous review.
Encourage participants to share this course's frameworks with their teams, contribute to or start a community of practice in their organization, and revisit the knowledge base and metrics practices they design here in six months. The field is moving quickly, and the practices that work today will need to evolve — the habit of deliberate reflection is the most durable thing to take away.