Module 8 · Lesson 1

Why Checklists Fail — and What Works Instead

From the Boeing 737 MAX to your pull request queue: the science of review systems that actually catch things.

What separates a checklist that closes bugs from one that gets checkbox-clicked into oblivion?

In October 2018, Lion Air Flight 610 crashed thirteen minutes after takeoff. Investigators later found that software engineers had reviewed the MCAS flight-control system's code — but the review checklist used at Boeing focused on individual function correctness, not on system-level failure modes. The same pattern recurred with Ethiopian Airlines Flight 302 in March 2019. The checklists existed. They were checked. They missed what mattered.

The failure wasn't laziness. It was checklist design. Items were too broad, lacked falsifiability, and were disconnected from the actual failure modes the system could exhibit. This is the canonical industrial case for checklist architecture — and it applies directly to code review.

The Two Failure Modes of Code Review Checklists

Empirical research on code review outcomes, including work by Alberto Bacchelli and Christian Bird published in the 2013 ICSE proceedings ("Expectations, Outcomes, and Challenges of Modern Code Review"), identified that most review comments fall into a small number of recurring categories — and that reviewers without explicit structure default to style and surface issues while missing logic and security problems.

Checklists fail in two distinct ways. Type I failure is the checklist that is too generic: "Is error handling correct?" cannot be answered by looking at a diff without knowing the error-handling contract of the system. Type II failure is the checklist that is too long: research from the aviation and surgical domains (Atul Gawande's 2009 WHO Surgical Safety Checklist study) shows that checklists longer than roughly nine items suffer significant compliance degradation under time pressure — which describes almost every code review.

Research Finding

Bacchelli & Bird (2013, ICSE) found that in over 570 Microsoft code reviews studied, only 14% of useful review comments addressed logic defects — despite logic errors being the category developers most wanted peers to catch. The gap is structural, not motivational.

What Makes a Checklist Item Actionable

A checklist item is actionable when it can be answered yes or no by inspecting the diff alone — without requiring the reviewer to hold the entire codebase in working memory. This is the principle behind the WHO Surgical Safety Checklist's design: every item is a concrete observable state, not a judgment call.

For code review, this means translating judgment calls into binary observables. "Is authentication handled correctly?" becomes three items: (1) Does every new endpoint call the auth middleware? (2) Are user IDs taken from the authenticated session object, not from request parameters? (3) Are authorization checks present before any database write? Each can be answered by reading the diff.

The Role of AI in Checklist Execution

Large language models applied to code review — GitHub Copilot code review (released to GA in February 2025), Amazon CodeGuru, and Sourcegraph Cody — operate most effectively when they are given explicit, structured prompts that mirror a well-designed checklist. A 2024 study by Hasan et al. at Carnegie Mellon ("Automated Code Review with LLMs: A Controlled Experiment") found that models prompted with category-specific instructions produced 38% more actionable comments than models prompted with free-form "review this code" instructions.

This means your personal checklist serves double duty: it guides your own cognitive attention during manual review, and it becomes the prompt structure you use when directing AI review tools. The architecture of the checklist is the architecture of both processes.

Design Principle

Each checklist item should map to a known failure mode — a documented bug class, a real CVE category, a post-mortem pattern — not to an abstract quality attribute. "Security" is not a checklist item. "SQL query uses parameterized input, not string concatenation" is.

Key Terms

FalsifiabilityA checklist item is falsifiable when the code can demonstrate it is false — i.e., the item can be failed, not just skimmed past.

Type I checklist failureItems too abstract to evaluate from a diff without deep contextual knowledge, causing reviewers to skip or guess.

Type II checklist failureChecklists too long to complete under real review time pressure, causing compliance drift toward checkbox theater.

Prompt mirroringUsing your review checklist categories as the structure of AI review prompts, so manual and automated review share the same taxonomy.

Lesson 1 Quiz

Why Checklists Fail — and What Works Instead

According to Bacchelli & Bird's 2013 ICSE study of Microsoft code reviews, what percentage of useful review comments actually addressed logic defects?

Correct. Only 14% of useful comments addressed logic defects — the category developers most wanted peers to catch. This gap is structural, caused by the absence of explicit review structure.

Not quite. The study found only 14% of useful comments addressed logic defects, despite that being what developers most wanted from review. The structural deficit is the point.

What is a "Type II checklist failure" in the context of code review?

Correct. Type II failure: checklists exceeding roughly nine items suffer significant compliance degradation under time pressure. Gawande's surgical checklist research documents this effect.

Type II failure refers to length-driven compliance drift. When checklists are too long, reviewers under time pressure begin checkbox-clicking without real evaluation.

Which of the following is the best example of a falsifiable, actionable checklist item?

Correct. This item can be verified by reading the diff: either every new endpoint calls auth middleware or it doesn't. No additional context is required to evaluate it.

That item requires judgment about what "correct" or "needed" means in context. A falsifiable item can be answered yes/no from the diff alone — like "every new endpoint calls auth middleware."

The Hasan et al. (2024, CMU) study on LLM code review found that models prompted with category-specific instructions produced how many more actionable comments than those given free-form prompts?

Correct. 38% more actionable comments. This is why your review checklist categories should directly inform how you prompt AI review tools — the structure transfers.

The study found 38% more actionable comments with structured prompts. This finding is the empirical basis for "prompt mirroring" — using checklist categories as AI prompt structure.

Lab 1: Diagnosing Checklist Design

Practice converting abstract review criteria into falsifiable, diff-verifiable items with AI assistance.

Your Task

You will work with the AI to analyze a set of weak checklist items and transform them into falsifiable, actionable items following the principles from Lesson 1. Bring specific examples from your own domain or use the suggested prompts below.

Try: "Here is a checklist item from my team: 'Ensure database access is safe.' Help me rewrite it as two or three falsifiable items." — or — "What failure mode does the Boeing MCAS checklist case illustrate for code review design?"

AI Review Advisor

Lab 1

Welcome to Lab 1. I'm here to help you build falsifiable, actionable checklist items. Share a vague or abstract review criterion from your codebase — or one you've seen on a team — and we'll transform it together into items that can be evaluated directly from a diff. What would you like to start with?

Module 8 · Lesson 2

Anatomy of a High-Signal Checklist

The seven categories that cover 90% of production defects — derived from real post-mortems.

Which categories should anchor your personal checklist, and how do you weight them for your specific codebase?

In August 2012, Knight Capital Group lost $440 million in 45 minutes due to a deployment error that activated deprecated trading code. A post-mortem published by the SEC in 2013 noted that code review had not included a category for deployment flag verification — a gap invisible to reviewers focused on logic correctness. The category didn't exist in their checklist because it hadn't caused a problem before. After the incident, Knight's surviving team explicitly added deployment artifact review as a mandatory category. They learned the hard way what category design determines: what your process can see.

The Seven Core Categories

Analysis of post-mortems from Google's Site Reliability Engineering practices (published in the 2016 SRE book), Amazon's COE (Correction of Errors) database, and academic studies of open-source defect histories (Sliwerski et al., MSR 2005) converges on seven categories that account for the majority of production defects that code review could have caught:

Category	Signal Priority	What to Look For
Logic correctness	High	Off-by-one errors, inverted conditions, missing early returns, incorrect loop bounds
Input validation & trust boundaries	High	Data from external sources used without sanitization; user-controlled values reaching sensitive sinks
Error handling & failure modes	High	Exceptions swallowed silently, missing rollback on partial writes, cascade failure paths
Concurrency & state	Med	Shared mutable state accessed without synchronization; race conditions in async paths
Dependency & interface contracts	Med	Callers assuming non-guaranteed behaviors; breaking changes to public interfaces
Observability	Med	New error paths without logging; metrics not updated; distributed trace context dropped
Deployment artifacts	Low–High*	Feature flags, config keys, migration scripts — correct state for target environment

Note on Priority

Deployment artifact priority is marked Low–High because its importance is highly context-dependent: it is low signal in library code and extremely high signal in service deployments, database migrations, or any change involving feature flags. Your checklist should annotate this variability explicitly.

Weighting for Your Codebase

The seven categories are not equally applicable to every change. A migration script diff has near-zero concurrency surface but maximum deployment artifact surface. A new async message handler inverts that completely. Effective personal checklists include a scope qualifier for each category — a one-line description of which change types activate that category.

Google's internal code review guidelines (partially described in the public-facing Google Engineering Practices documentation) distinguish between "must review every CL" items and "review when applicable" items. This two-tier approach prevents checklist fatigue while ensuring critical items are never skipped.

Category Ordering Matters

A 2020 study by Czerwonka et al. at Microsoft ("Code Reviews Do Not Find Bugs") noted that reviewers tend to run out of attention by the third or fourth item on a list and that later items receive disproportionately less scrutiny. This means your highest-signal categories should appear first — not as a matter of organization, but as a cognitive load management decision. Logic correctness and input validation should never appear at items 6 and 7.

Practical Principle

Order your checklist by the historical frequency of defects in your own codebase, not by abstract severity. If your team's post-mortems show 60% of production incidents trace to missing error handling, that category earns slot 1 — regardless of what any generic template says.

Key Terms

Trust boundaryThe point in the code where data transitions from untrusted (external, user-controlled) to trusted (internal system) scope. A primary input validation checkpoint.

Scope qualifierA condition attached to a checklist category that specifies which change types trigger that category for review.

Cascade failureA sequence where one component's failure propagates to dependent components, amplifying impact — often enabled by swallowed exceptions or missing circuit breakers.

Lesson 2 Quiz

Anatomy of a High-Signal Checklist

What was the missing checklist category that enabled the Knight Capital Group $440M loss in August 2012?

Correct. The SEC post-mortem noted that review had no category for deployment flag verification, allowing deprecated code to be activated by an incorrectly set flag during deployment.

The post-mortem identified a missing deployment artifact/flag verification category. The logic itself had been reviewed — the problem was that deprecated code paths could be activated by deployment configuration that no review category examined.

According to the Czerwonka et al. (2020, Microsoft) study, why does category ordering matter in a review checklist?

Correct. Attention degrades through the list. This means high-signal categories must appear early — it is a cognitive load management decision, not an organizational preference.

The study found attention degrades through a checklist, with items 3–4 onward receiving disproportionately less scrutiny. High-signal categories must appear first for this reason.

What is a "scope qualifier" in the context of a personal review checklist?

Correct. A scope qualifier prevents checklist fatigue by activating categories only when applicable — for example, marking "deployment artifacts" as relevant only for service deployments, not library changes.

A scope qualifier is a condition you attach to each category: "review this category when the change includes [specific type of code]." It prevents irrelevant categories from wasting attention on every review.

Google's Engineering Practices documentation distinguishes between two tiers of checklist items. What are they?

Correct. This two-tier structure prevents fatigue on universal items while ensuring conditional items are not silently skipped — a practical architecture for sustainable review practice.

Google's approach uses "must review every CL" and "review when applicable" tiers. This prevents the cognitive load of a 20-item checklist while ensuring nothing critical is structurally invisible.

Lab 2: Mapping Categories to Your Codebase

Build your own weighted category structure using the seven-category framework and your team's defect history.

Your Task

Work with the AI to map the seven core checklist categories to your actual tech stack and team context. Add scope qualifiers, establish ordering by your defect history, and identify any domain-specific eighth category your codebase needs.

Try: "My team works on a Node.js REST API with PostgreSQL. Help me add scope qualifiers to the seven categories and reorder them by relevance." — or — "What domain-specific category should I add for a financial transactions service?"

AI Review Advisor

Lab 2

Welcome to Lab 2. Let's tailor the seven-category framework to your specific codebase. Tell me your tech stack, primary application type (API, frontend, data pipeline, embedded, etc.), and any recurring bug patterns your team has seen. I'll help you add scope qualifiers, reorder by relevance, and identify whether you need an eighth domain-specific category.

Module 8 · Lesson 3

Integrating AI Tools Into Your Checklist Workflow

How to assign checklist categories to AI, which to keep human, and how to avoid the automation complacency trap.

When you hand a category to an AI reviewer, what cognitive responsibility are you actually giving up — and how do you get it back when the AI is wrong?

In May 2023, Samsung engineers leaked proprietary source code and internal meeting notes by pasting them into ChatGPT during code review sessions, according to reporting by The Verge and confirmed by Samsung in a company-wide policy memo. The engineers were using AI as a review tool — but their personal checklist had no category for data classification before external tool use. The checklist they were following was optimized for code quality, not for the information security implications of the review process itself.

This case became a reference point for enterprise AI governance policies worldwide. Over 40% of Fortune 500 companies subsequently added AI tool use restrictions to their code review guidelines, according to a June 2023 survey by Cyberhaven.

What AI Tools Do Well in Code Review

Current generation AI code review tools — GitHub Copilot Code Review, Amazon CodeGuru Reviewer, and DeepCode (now Snyk Code) — demonstrate measurable advantage in specific categories. Amazon's published benchmarks for CodeGuru Reviewer show 89% recall on Java concurrency defects in their test suite. Snyk Code's 2023 benchmark report shows precision above 85% on known CWE vulnerability patterns across Java, Python, and JavaScript.

These tools are strongest on pattern-matching tasks: known vulnerability signatures (SQL injection, XSS, path traversal), common concurrency anti-patterns, and style/convention violations. They are weakest on semantic understanding tasks: whether the business logic is correct, whether an error is handled appropriately for the calling context, and whether an interface contract is being violated at the semantic level.

Assign to AI

Known vulnerability patterns (CWE top 25), dependency version checks, obvious null dereferences, missing input length checks, common crypto misuse, license compliance in dependencies, code style and formatting consistency.

Keep Human

Business logic correctness, system-level failure mode analysis, authorization model coherence, architectural boundary violations, semantic error handling adequacy, post-mortem pattern matching from your specific history.

The Automation Complacency Trap

Automation complacency — also called "automation bias" — is the well-documented human tendency to under-scrutinize outputs from automated systems. First identified in aviation context by Mosier & Skitka in their 1996 paper "Human Decision Makers and Automated Decision Aids", the effect has been reproduced in software contexts. A 2022 study by Liang et al. at Microsoft Research ("Is AI-Assisted Code Review Beneficial?") found that developers who received AI review suggestions reduced their own review time by 17% on average but also reduced detection of novel bugs by 23%.

The implication for checklist design is specific: categories you assign to AI must have a human verification step built into the checklist. Not "AI checked this" but "AI checked this AND I reviewed the AI's output for false negatives in these specific ways."

Process Design

For each AI-assigned category, your checklist should include a one-line verification instruction: what you skim the AI output for, what would indicate a missed finding, and the maximum time budget for that verification. This prevents AI assistance from becoming invisible rubber-stamping.

Data Classification as a Review Category

The Samsung case directly motivated the addition of a "review process security" category to enterprise checklists. Before pasting code into any AI tool — internal or external — the checklist should require: (1) Is this code classified as confidential or proprietary? (2) Is the target AI tool approved for this data classification? (3) Have I removed identifying comments, credentials, and internal endpoint references before submission?

GitHub Copilot Business and Enterprise tiers, as of 2023, include explicit data handling agreements and do not use customer code for model training. OpenAI's API with the zero-data-retention option provides similar guarantees. Consumer-tier tools (ChatGPT.com, Claude.ai) do not. Your checklist should encode which tools are approved for which data classifications — this is a falsifiable binary item for each AI-assisted review category.

Checklist Architecture Rule

For every AI-delegated category: (1) specify which tool is approved, (2) specify the verification step you take after reviewing the AI output, and (3) specify the data classification threshold above which human-only review applies. These three sub-items transform AI assistance from a trust-and-forget step into a managed handoff.

Key Terms

Automation biasThe tendency to over-rely on automated system outputs and under-apply independent scrutiny, identified by Mosier & Skitka (1996) in aviation and reproduced in software review contexts.

Human verification stepA structured action the reviewer takes after receiving AI output to check for false negatives, preventing automation bias from creating invisible blind spots.

Data classification gateA checklist item requiring the reviewer to confirm that code being submitted to an AI tool meets that tool's approved data classification level.

Lesson 3 Quiz

Integrating AI Tools Into Your Checklist Workflow

What did the 2022 Liang et al. (Microsoft Research) study find about developers who received AI code review suggestions?

Correct. Faster reviews, but fewer novel bugs caught — a classic automation bias effect. The implication is that AI-assisted categories need human verification steps built into the checklist.

The study found a 17% time reduction paired with a 23% reduction in novel bug detection. Speed improved but the cost was reduced independent scrutiny — exactly the automation complacency pattern.

The Samsung May 2023 code leak through ChatGPT revealed a missing checklist category. What was that category?

Correct. The engineers' checklists were optimized for code quality — they had no category for "is this code classified as proprietary, and is the tool I'm using approved for this classification?"

The missing category was data classification before AI tool use. The code being reviewed was proprietary — but the checklists had no gate for "is it safe to paste this into this tool?"

Which of these review tasks is best suited for AI tool delegation based on current tool capabilities?

Correct. SQL injection is a known CWE pattern — exactly the kind of signature-matching task where AI tools like Snyk Code and CodeGuru show high precision. Semantic and contextual judgments remain human tasks.

AI tools are strongest on known vulnerability patterns (like SQL injection via string concatenation) because these are signature-matching tasks. Business logic, error handling adequacy, and architectural semantics require human judgment.

What three sub-items should a well-designed checklist include for every AI-delegated review category?

Correct. These three items transform AI assistance from a trust-and-forget step into a managed handoff: (1) which approved tool, (2) how you verify the AI output, and (3) when the data is too sensitive for this tool.

The three required sub-items are: which tool is approved for this category, what verification step you take after reviewing AI output, and the data classification threshold above which human-only review applies.

Lab 3: Designing Your AI Handoff Protocol

Build the AI-delegation section of your checklist — with verification steps, tool approvals, and data classification gates.

Your Task

Work with the AI to design the AI-delegation section of your personal review checklist. For each category you plan to delegate, define the approved tool, the human verification step, and the data classification gate. Use your specific tech stack context.

Try: "I want to use GitHub Copilot code review for input validation checks. Help me write the verification step and data classification gate I should include in my checklist." — or — "My team works on healthcare data. What data classification gates should I include before any AI-assisted review?"

AI Review Advisor

Lab 3

Welcome to Lab 3. Let's design your AI handoff protocol — the section of your checklist that governs when and how you use AI tools in review. Tell me which AI review tools you have access to, what your team's data sensitivity level is (public product, internal enterprise, regulated industry, etc.), and which checklist categories you're considering delegating to AI. We'll build the three sub-items for each: approved tool, verification step, and data classification gate.

Module 8 · Lesson 4

Maintaining and Evolving Your Checklist

Checklists that don't update become noise. The feedback loops that keep your review system sharp over time.

How do you wire your checklist to your incident history so that every production problem makes the next review smarter?

In September 2021, Coinbase disclosed a critical bug in their advanced trading platform that could have allowed users to place orders without sufficient funds. The bug had passed multiple code reviews. In their public post-mortem, Coinbase's engineering team noted that their review checklist had no explicit item for invariant preservation in state transitions — specifically that balance checks needed to be atomic with order placement. They updated their checklist immediately. By their own account, the same class of defect had appeared in a slightly different form six months earlier and also passed review — because the checklist still didn't cover it.

This is the canonical case for post-mortem-driven checklist evolution: a defect class that recurs until it is explicitly encoded as a falsifiable checklist item.

The Checklist Feedback Loop

A review checklist without a feedback mechanism is a static artifact that decays in relevance as your codebase and team evolve. The mechanism for keeping a checklist current is a retrospective trigger: a defined event that initiates checklist review. Three triggers cover the vast majority of cases:

Production incident: Any defect that reached production that a code review could have caught triggers a checklist audit. The audit question: which category should have caught this, and was it absent, too abstract, or ordered too late?
Near-miss in review: A bug caught in review that was not explicitly covered by the checklist triggers a checklist addition. This is the positive signal — the reviewer found it, but the checklist didn't prompt them to look.
Quarterly scheduled review: A calendar-driven audit of the entire checklist to remove items that have been rendered obsolete by architectural changes, tooling improvements, or changed team practices.

Checklist Version Control

Storing your checklist in a version-controlled repository — even a personal dotfiles repository — creates an automatic audit trail of how your review practice has evolved. More importantly, it enables diff-based retrospectives: when a defect class recurs, you can inspect whether the relevant checklist item existed at the time of the earlier incident.

Teams at Netflix, as described in their engineering blog posts on review culture (2019–2022), maintained checklists in their team wiki with explicit change history annotations: each item includes the date it was added, the incident or near-miss that motivated it, and the name of the engineer who added it. This context prevents item staleness — when the engineer who added an item leaves and the context is lost, items tend to become cargo-cult checkboxes.

Anti-Pattern

The most common checklist evolution failure is addition without removal. Teams add items after every incident but rarely remove items after architectural changes make them obsolete. A checklist that grows monotonically will eventually exceed cognitive capacity and trigger the Type II failure described in Lesson 1.

Using AI to Audit Your Checklist

Large language models can serve as checklist auditors when given structured prompts. A productive workflow: provide the current checklist, the post-mortem summary of a recent incident, and ask the model to identify (1) which checklist item should have caught this defect, (2) whether that item is present and falsifiable, and (3) how to rewrite it if not.

This workflow was documented in a 2023 internal engineering blog post by Shopify (shared at SREcon 2023), where their reliability engineering team described using GPT-4 to audit their review checklists against a corpus of 18 months of post-mortems. The process identified 11 items that were too abstract to be falsifiable and 4 incident classes with no corresponding checklist coverage.

Calibration: When to Trust Your Checklist

A calibrated checklist is one whose coverage matches the actual defect distribution of your codebase. You can measure this by tracking, over a quarter, which checklist category caught each bug found in review — and comparing that distribution to which categories appear in your incident history. Persistent mismatches indicate either missing categories or miscalibrated ordering.

A well-calibrated checklist has a measurable effect: SmartBear's annual State of Code Review survey (2022 edition, N=1,035 developers) found that teams that performed explicit checklist reviews after incidents reported 34% fewer repeat defect classes over a 12-month period than teams that did not.

The Living Document Principle

Your checklist is not a finished artifact — it is a model of your team's accumulated knowledge about where defects live. Every incident that passes review is evidence the model is wrong. Treat checklist updates with the same rigor as code changes: version them, explain them, and review them with your team.

Key Terms

Retrospective triggerA defined event (incident, near-miss, or scheduled review) that initiates a structured audit of the current checklist for coverage gaps or staleness.

Item stalenessThe condition where a checklist item no longer corresponds to a real defect risk, usually because the architecture or tooling has changed, but the item was never removed.

CalibrationThe degree to which a checklist's category distribution matches the actual defect distribution of the codebase it is applied to.

Context annotationA note attached to each checklist item recording why it was added, when, and by whom — preventing items from becoming cargo-cult checkboxes after the original context is lost.

Lesson 4 Quiz

Maintaining and Evolving Your Checklist

The Coinbase September 2021 trading bug post-mortem revealed which missing checklist category?

Correct. The checklist had no item for atomicity of state transitions — and the same defect class had appeared six months earlier without prompting a checklist update, leading to the second occurrence.

The post-mortem identified a missing category around invariant preservation in state transitions — ensuring balance checks are atomic with order placement. The same defect class had recurred because the checklist wasn't updated after the first occurrence.

What is the most common checklist evolution failure described in Lesson 4?

Correct. Monotonic growth eventually causes Type II failure — the checklist becomes too long to complete under real time pressure, and compliance degrades back to checkbox theater.

The failure is addition without removal. Checklists accrete items from incidents but never shed items that have become obsolete. Eventually this triggers the Type II length failure described in Lesson 1.

What did the Shopify SREcon 2023 presentation report about using GPT-4 to audit their review checklists against 18 months of post-mortems?

Correct. 11 items rewritten for falsifiability, 4 missing coverage areas identified. This demonstrates the specific value of AI-assisted checklist auditing: pattern-matching abstract items and gap analysis against incident history.

The Shopify audit found 11 items too abstract to be falsifiable and 4 incident classes with no checklist coverage — demonstrating that even mature engineering organizations have systematic gaps in their review structures.

According to SmartBear's 2022 State of Code Review survey (N=1,035), teams that performed explicit checklist reviews after incidents experienced what outcome over 12 months?

Correct. 34% fewer repeat defect classes — the specific benefit of calibrated, incident-driven checklist evolution. Not fewer bugs overall, but fewer of the same bugs twice.

The survey found 34% fewer repeat defect classes — not fewer bugs in general, but significantly fewer recurring defect patterns. This is the measurable output of treating a checklist as a living model of where defects live.

Lab 4: Auditing and Evolving Your Checklist

Use AI to simulate a post-mortem-driven checklist audit and build your retrospective trigger protocol.

Your Task

Work with the AI to simulate the checklist audit process. Describe a real or hypothetical incident from your domain and use the AI to identify coverage gaps in your current checklist. Then design your three retrospective triggers and decide which checklist items need context annotations.

Try: "Here is a summary of an incident my team had: [describe it]. Help me identify which checklist category should have caught it and how to write a falsifiable item." — or — "Review this checklist and tell me which items are too abstract, which might be obsolete for a microservices architecture, and what I'm missing."

AI Review Advisor

Lab 4

Welcome to Lab 4 — the culminating lab for this module. We're going to simulate a post-mortem-driven checklist audit. Share either: (1) a real incident summary from your team where a bug reached production through code review, (2) your current checklist draft for gap analysis, or (3) a description of your architecture so we can identify likely missing categories. The goal is to leave this lab with a specific update to your checklist and a defined retrospective trigger protocol. What would you like to start with?

Module 8 Test

Building a Personal Review Checklist — 15 questions · 80% to pass

1. What does it mean for a checklist item to be "falsifiable" in code review?

Correct. Falsifiability means the item can be evaluated — pass or fail — from the diff itself, without needing to hold the entire codebase in working memory.

Falsifiability in checklist design means the item can be answered yes/no from the diff alone. It can be failed — not just skimmed past — because the evidence is directly observable.

2. Which Boeing aircraft incident is used in this module as the canonical industrial case for checklist architecture failure?

Correct. The MCAS review checklists focused on individual function correctness, not system-level failure modes — precisely the Type I failure that allows abstract items to miss real defects.

The module uses the Boeing 737 MAX / MCAS case — two crashes (Lion Air 610 and Ethiopian 302) where checklists existed but were too focused on function-level correctness to catch system-level failure modes.

3. Atul Gawande's WHO Surgical Safety Checklist research found that checklists longer than approximately how many items suffer significant compliance degradation under time pressure?

Correct. The ~9-item threshold from Gawande's surgical checklist work is the empirical basis for the Type II failure definition in code review checklist design.

The threshold is approximately 9 items. Beyond that, compliance degrades significantly under time pressure — which describes most real code review situations.

4. Of the seven core checklist categories, which is marked as variable priority (Low–High) depending on the type of change?

Correct. Deployment artifact review is low signal for library code and extremely high signal for service deployments or anything involving feature flags — the Knight Capital case is the anchor for this category.

Deployment artifacts are variable priority — low for pure library code, critical for service deployments, database migrations, or feature flag changes. The Knight Capital $440M loss is the canonical case for this category's importance.

5. Bacchelli and Bird's 2013 ICSE study found reviewers defaulted to which type of comments when lacking explicit review structure?

Correct. Without structure, reviewers gravitate toward surface-level style issues. Logic and security defects — the categories developers most want peers to catch — fall through.

Without explicit structure, reviewers default to style and surface issues. This is the core finding that motivates structured checklist design: the gap between what reviewers produce and what they're most needed to catch.

6. The Czerwonka et al. (2020) Microsoft study on code review found that reviewers run out of attention by which items in the checklist?

Correct. Attention degrades by items 3–4, with later items receiving disproportionately less scrutiny. This is why highest-signal categories must occupy the first positions.

Attention degrades by items 3–4. Items after that receive less scrutiny proportionally — so ordering is a cognitive load management decision, not just an organizational preference.

7. What is "prompt mirroring" as defined in this module?

Correct. Prompt mirroring means your checklist does double duty: guiding your own cognitive attention and structuring the prompts you give AI tools — both processes share the same category taxonomy.

Prompt mirroring is using checklist categories directly as AI prompt structure. The Hasan et al. study showed this produces 38% more actionable AI comments than free-form "review this code" prompts.

8. Amazon CodeGuru Reviewer's published benchmarks show approximately what recall rate for Java concurrency defects?

Correct. 89% recall on Java concurrency defects in their test suite. This is the kind of pattern-matching task where AI tools demonstrate measurable advantage over unstructured human review.

Amazon CodeGuru shows 89% recall on Java concurrency defects. High recall on known patterns is where AI tools provide real value — justifying their delegation in a well-designed checklist.

9. The Samsung May 2023 ChatGPT data leak involved which type of sensitive data?

Correct. Engineers pasted proprietary source code and meeting notes into ChatGPT.com during review. The missing category was data classification before external AI tool use.

Samsung engineers pasted proprietary source code and internal meeting notes into ChatGPT during code review. Their checklists had no data classification gate for AI tool use.

10. What three sub-items should accompany every AI-delegated checklist category?

Correct. These three transform AI assistance from trust-and-forget into a managed handoff with clear accountability for what the human reviewer still owns.

The three required sub-items are: (1) which tool is approved, (2) what human verification you perform after reviewing AI output, and (3) the data classification level above which this tool should not be used.

11. What does "item staleness" mean in the context of checklist maintenance?

Correct. Stale items accumulate when teams add after incidents but never remove after architectural evolution — triggering Type II failure as the checklist grows beyond usable length.

Item staleness is when an item survives past the architectural or tooling change that made its defect class irrelevant. Stale items contribute to Type II failure without providing any review value.

12. Netflix engineering's checklist practice (2019–2022 blog posts) included what annotation on each checklist item?

Correct. These context annotations prevent items from becoming cargo-cult checkboxes. When the context behind an item is preserved, reviewers understand why it matters and what they're actually looking for.

Netflix annotated each item with when it was added, what incident motivated it, and who added it. Without this context, items become cargo-cult checkboxes when the original engineer leaves.

13. The Coinbase 2021 trading bug recurred twice because of which failure?

Correct. No retrospective trigger fired after the first near-miss. The defect class recurred because there was no process to encode new defect classes into the checklist.

The defect class had no checklist coverage the first time — and because no update was made, it had no coverage the second time either. This is the canonical case for retrospective trigger protocols.

14. A "trust boundary" in the context of input validation checklist items refers to what?

Correct. The trust boundary is the primary checkpoint for input validation — the point where user-controlled data enters your system and must be sanitized before reaching internal sinks.

A trust boundary is the point where data transitions from external/user-controlled to internal system scope. Input validation checklist items should be anchored to these specific crossing points in the code.

15. SmartBear's 2022 State of Code Review survey (N=1,035) found that teams performing explicit post-incident checklist reviews experienced what specific improvement over 12 months?

Correct. 34% fewer repeat defect classes — not fewer bugs in general, but the specific elimination of recurring patterns. This is the measurable output of treating a checklist as a living model.

The outcome was 34% fewer repeat defect classes. The checklist-as-living-model approach doesn't necessarily reduce total bugs, but it systematically eliminates recurrence of known defect patterns.