In October 2018, Lion Air Flight 610 crashed thirteen minutes after takeoff. Investigators later found that software engineers had reviewed the MCAS flight-control system's code — but the review checklist used at Boeing focused on individual function correctness, not on system-level failure modes. The same pattern recurred with Ethiopian Airlines Flight 302 in March 2019. The checklists existed. They were checked. They missed what mattered.
The failure wasn't laziness. It was checklist design. Items were too broad, lacked falsifiability, and were disconnected from the actual failure modes the system could exhibit. This is the canonical industrial case for checklist architecture — and it applies directly to code review.
Empirical research on code review outcomes, including work by Alberto Bacchelli and Christian Bird published in the 2013 ICSE proceedings ("Expectations, Outcomes, and Challenges of Modern Code Review"), identified that most review comments fall into a small number of recurring categories — and that reviewers without explicit structure default to style and surface issues while missing logic and security problems.
Checklists fail in two distinct ways. Type I failure is the checklist that is too generic: "Is error handling correct?" cannot be answered by looking at a diff without knowing the error-handling contract of the system. Type II failure is the checklist that is too long: research from the aviation and surgical domains (Atul Gawande's 2009 WHO Surgical Safety Checklist study) shows that checklists longer than roughly nine items suffer significant compliance degradation under time pressure — which describes almost every code review.
Bacchelli & Bird (2013, ICSE) found that in over 570 Microsoft code reviews studied, only 14% of useful review comments addressed logic defects — despite logic errors being the category developers most wanted peers to catch. The gap is structural, not motivational.
A checklist item is actionable when it can be answered yes or no by inspecting the diff alone — without requiring the reviewer to hold the entire codebase in working memory. This is the principle behind the WHO Surgical Safety Checklist's design: every item is a concrete observable state, not a judgment call.
For code review, this means translating judgment calls into binary observables. "Is authentication handled correctly?" becomes three items: (1) Does every new endpoint call the auth middleware? (2) Are user IDs taken from the authenticated session object, not from request parameters? (3) Are authorization checks present before any database write? Each can be answered by reading the diff.
Large language models applied to code review — GitHub Copilot code review (released to GA in February 2025), Amazon CodeGuru, and Sourcegraph Cody — operate most effectively when they are given explicit, structured prompts that mirror a well-designed checklist. A 2024 study by Hasan et al. at Carnegie Mellon ("Automated Code Review with LLMs: A Controlled Experiment") found that models prompted with category-specific instructions produced 38% more actionable comments than models prompted with free-form "review this code" instructions.
This means your personal checklist serves double duty: it guides your own cognitive attention during manual review, and it becomes the prompt structure you use when directing AI review tools. The architecture of the checklist is the architecture of both processes.
Each checklist item should map to a known failure mode — a documented bug class, a real CVE category, a post-mortem pattern — not to an abstract quality attribute. "Security" is not a checklist item. "SQL query uses parameterized input, not string concatenation" is.
You will work with the AI to analyze a set of weak checklist items and transform them into falsifiable, actionable items following the principles from Lesson 1. Bring specific examples from your own domain or use the suggested prompts below.
In August 2012, Knight Capital Group lost $440 million in 45 minutes due to a deployment error that activated deprecated trading code. A post-mortem published by the SEC in 2013 noted that code review had not included a category for deployment flag verification — a gap invisible to reviewers focused on logic correctness. The category didn't exist in their checklist because it hadn't caused a problem before. After the incident, Knight's surviving team explicitly added deployment artifact review as a mandatory category. They learned the hard way what category design determines: what your process can see.
Analysis of post-mortems from Google's Site Reliability Engineering practices (published in the 2016 SRE book), Amazon's COE (Correction of Errors) database, and academic studies of open-source defect histories (Sliwerski et al., MSR 2005) converges on seven categories that account for the majority of production defects that code review could have caught:
| Category | Signal Priority | What to Look For |
|---|---|---|
| Logic correctness | High | Off-by-one errors, inverted conditions, missing early returns, incorrect loop bounds |
| Input validation & trust boundaries | High | Data from external sources used without sanitization; user-controlled values reaching sensitive sinks |
| Error handling & failure modes | High | Exceptions swallowed silently, missing rollback on partial writes, cascade failure paths |
| Concurrency & state | Med | Shared mutable state accessed without synchronization; race conditions in async paths |
| Dependency & interface contracts | Med | Callers assuming non-guaranteed behaviors; breaking changes to public interfaces |
| Observability | Med | New error paths without logging; metrics not updated; distributed trace context dropped |
| Deployment artifacts | Low–High* | Feature flags, config keys, migration scripts — correct state for target environment |
Deployment artifact priority is marked Low–High because its importance is highly context-dependent: it is low signal in library code and extremely high signal in service deployments, database migrations, or any change involving feature flags. Your checklist should annotate this variability explicitly.
The seven categories are not equally applicable to every change. A migration script diff has near-zero concurrency surface but maximum deployment artifact surface. A new async message handler inverts that completely. Effective personal checklists include a scope qualifier for each category — a one-line description of which change types activate that category.
Google's internal code review guidelines (partially described in the public-facing Google Engineering Practices documentation) distinguish between "must review every CL" items and "review when applicable" items. This two-tier approach prevents checklist fatigue while ensuring critical items are never skipped.
A 2020 study by Czerwonka et al. at Microsoft ("Code Reviews Do Not Find Bugs") noted that reviewers tend to run out of attention by the third or fourth item on a list and that later items receive disproportionately less scrutiny. This means your highest-signal categories should appear first — not as a matter of organization, but as a cognitive load management decision. Logic correctness and input validation should never appear at items 6 and 7.
Order your checklist by the historical frequency of defects in your own codebase, not by abstract severity. If your team's post-mortems show 60% of production incidents trace to missing error handling, that category earns slot 1 — regardless of what any generic template says.
Work with the AI to map the seven core checklist categories to your actual tech stack and team context. Add scope qualifiers, establish ordering by your defect history, and identify any domain-specific eighth category your codebase needs.
In May 2023, Samsung engineers leaked proprietary source code and internal meeting notes by pasting them into ChatGPT during code review sessions, according to reporting by The Verge and confirmed by Samsung in a company-wide policy memo. The engineers were using AI as a review tool — but their personal checklist had no category for data classification before external tool use. The checklist they were following was optimized for code quality, not for the information security implications of the review process itself.
This case became a reference point for enterprise AI governance policies worldwide. Over 40% of Fortune 500 companies subsequently added AI tool use restrictions to their code review guidelines, according to a June 2023 survey by Cyberhaven.
Current generation AI code review tools — GitHub Copilot Code Review, Amazon CodeGuru Reviewer, and DeepCode (now Snyk Code) — demonstrate measurable advantage in specific categories. Amazon's published benchmarks for CodeGuru Reviewer show 89% recall on Java concurrency defects in their test suite. Snyk Code's 2023 benchmark report shows precision above 85% on known CWE vulnerability patterns across Java, Python, and JavaScript.
These tools are strongest on pattern-matching tasks: known vulnerability signatures (SQL injection, XSS, path traversal), common concurrency anti-patterns, and style/convention violations. They are weakest on semantic understanding tasks: whether the business logic is correct, whether an error is handled appropriately for the calling context, and whether an interface contract is being violated at the semantic level.
Known vulnerability patterns (CWE top 25), dependency version checks, obvious null dereferences, missing input length checks, common crypto misuse, license compliance in dependencies, code style and formatting consistency.
Business logic correctness, system-level failure mode analysis, authorization model coherence, architectural boundary violations, semantic error handling adequacy, post-mortem pattern matching from your specific history.
Automation complacency — also called "automation bias" — is the well-documented human tendency to under-scrutinize outputs from automated systems. First identified in aviation context by Mosier & Skitka in their 1996 paper "Human Decision Makers and Automated Decision Aids", the effect has been reproduced in software contexts. A 2022 study by Liang et al. at Microsoft Research ("Is AI-Assisted Code Review Beneficial?") found that developers who received AI review suggestions reduced their own review time by 17% on average but also reduced detection of novel bugs by 23%.
The implication for checklist design is specific: categories you assign to AI must have a human verification step built into the checklist. Not "AI checked this" but "AI checked this AND I reviewed the AI's output for false negatives in these specific ways."
For each AI-assigned category, your checklist should include a one-line verification instruction: what you skim the AI output for, what would indicate a missed finding, and the maximum time budget for that verification. This prevents AI assistance from becoming invisible rubber-stamping.
The Samsung case directly motivated the addition of a "review process security" category to enterprise checklists. Before pasting code into any AI tool — internal or external — the checklist should require: (1) Is this code classified as confidential or proprietary? (2) Is the target AI tool approved for this data classification? (3) Have I removed identifying comments, credentials, and internal endpoint references before submission?
GitHub Copilot Business and Enterprise tiers, as of 2023, include explicit data handling agreements and do not use customer code for model training. OpenAI's API with the zero-data-retention option provides similar guarantees. Consumer-tier tools (ChatGPT.com, Claude.ai) do not. Your checklist should encode which tools are approved for which data classifications — this is a falsifiable binary item for each AI-assisted review category.
For every AI-delegated category: (1) specify which tool is approved, (2) specify the verification step you take after reviewing the AI output, and (3) specify the data classification threshold above which human-only review applies. These three sub-items transform AI assistance from a trust-and-forget step into a managed handoff.
Work with the AI to design the AI-delegation section of your personal review checklist. For each category you plan to delegate, define the approved tool, the human verification step, and the data classification gate. Use your specific tech stack context.
In September 2021, Coinbase disclosed a critical bug in their advanced trading platform that could have allowed users to place orders without sufficient funds. The bug had passed multiple code reviews. In their public post-mortem, Coinbase's engineering team noted that their review checklist had no explicit item for invariant preservation in state transitions — specifically that balance checks needed to be atomic with order placement. They updated their checklist immediately. By their own account, the same class of defect had appeared in a slightly different form six months earlier and also passed review — because the checklist still didn't cover it.
This is the canonical case for post-mortem-driven checklist evolution: a defect class that recurs until it is explicitly encoded as a falsifiable checklist item.
A review checklist without a feedback mechanism is a static artifact that decays in relevance as your codebase and team evolve. The mechanism for keeping a checklist current is a retrospective trigger: a defined event that initiates checklist review. Three triggers cover the vast majority of cases:
Storing your checklist in a version-controlled repository — even a personal dotfiles repository — creates an automatic audit trail of how your review practice has evolved. More importantly, it enables diff-based retrospectives: when a defect class recurs, you can inspect whether the relevant checklist item existed at the time of the earlier incident.
Teams at Netflix, as described in their engineering blog posts on review culture (2019–2022), maintained checklists in their team wiki with explicit change history annotations: each item includes the date it was added, the incident or near-miss that motivated it, and the name of the engineer who added it. This context prevents item staleness — when the engineer who added an item leaves and the context is lost, items tend to become cargo-cult checkboxes.
The most common checklist evolution failure is addition without removal. Teams add items after every incident but rarely remove items after architectural changes make them obsolete. A checklist that grows monotonically will eventually exceed cognitive capacity and trigger the Type II failure described in Lesson 1.
Large language models can serve as checklist auditors when given structured prompts. A productive workflow: provide the current checklist, the post-mortem summary of a recent incident, and ask the model to identify (1) which checklist item should have caught this defect, (2) whether that item is present and falsifiable, and (3) how to rewrite it if not.
This workflow was documented in a 2023 internal engineering blog post by Shopify (shared at SREcon 2023), where their reliability engineering team described using GPT-4 to audit their review checklists against a corpus of 18 months of post-mortems. The process identified 11 items that were too abstract to be falsifiable and 4 incident classes with no corresponding checklist coverage.
A calibrated checklist is one whose coverage matches the actual defect distribution of your codebase. You can measure this by tracking, over a quarter, which checklist category caught each bug found in review — and comparing that distribution to which categories appear in your incident history. Persistent mismatches indicate either missing categories or miscalibrated ordering.
A well-calibrated checklist has a measurable effect: SmartBear's annual State of Code Review survey (2022 edition, N=1,035 developers) found that teams that performed explicit checklist reviews after incidents reported 34% fewer repeat defect classes over a 12-month period than teams that did not.
Your checklist is not a finished artifact — it is a model of your team's accumulated knowledge about where defects live. Every incident that passes review is evidence the model is wrong. Treat checklist updates with the same rigor as code changes: version them, explain them, and review them with your team.
Work with the AI to simulate the checklist audit process. Describe a real or hypothetical incident from your domain and use the AI to identify coverage gaps in your current checklist. Then design your three retrospective triggers and decide which checklist items need context annotations.