An Amazon internal post-mortem on CodeWhisperer-assisted development noted a recurring problem: generated functions were syntactically correct and passed unit tests, yet reviewers consistently flagged them for missing intent documentation — no explanation of why a particular algorithm was chosen, what edge cases the author had consciously accepted, or what the function was not designed to handle. The code said what; nothing said why.
Documentation in professional codebases serves at least three distinct purposes: it records intent (what the author meant to accomplish), constraints (what the code explicitly does not handle), and rationale (why this approach rather than alternatives). A language model trained to predict syntactically plausible completions has no mechanism for generating authentic content in any of these categories.
When an LLM writes a docstring, it is producing a statistically likely description of what a function with that name and signature would probably do — not a record of deliberate design choices. This distinction becomes critical during incident response: engineers reading generated code have no documented trail of the original assumptions.
A model can describe what code does by reading it. It cannot describe why the code was written this way rather than another — because that reason lives in the human's mind, not the token stream.
A 2021 analysis by researchers at NYU studying early Copilot output found that generated code produced docstrings at a rate comparable to human-written code — but those docstrings were almost exclusively descriptive (restating what parameters were) rather than normative (explaining constraints, failure modes, or intent). Human-written docstrings in the same dataset included constraint language ("does not handle null input"), assumption language ("assumes sorted input"), and rationale ("uses insertion sort for n<10 for cache efficiency") at rates ten to twenty times higher.
Reviewers should watch for three specific absences in AI-generated documentation:
For every function in an AI-assisted pull request, ask: if the original developer were unavailable, could an engineer reading only this code understand what it deliberately excludes? If not, documentation is incomplete regardless of how syntactically correct the code is.
Code review is not merely defect detection — it is the primary checkpoint for ensuring that institutional knowledge is encoded into the codebase. When AI generates the code, no institutional knowledge was ever present to begin with. The reviewer must supply or elicit it. This means the reviewer's job expands: not only checking correctness but reconstructing and documenting intent before the PR merges.
Teams at Stripe and Shopify that publicly discussed their AI code review policies in 2023 each independently converged on the same rule: AI-assisted PRs require a mandatory documentation section in the PR description that a human author explicitly writes — not generated by the model.
You'll be given short AI-generated code snippets. Your job is to identify what documentation is missing and explain what a human reviewer would need to add before the PR could be approved.
The AI assistant will give you a code snippet, you analyze the documentation gaps, then it will respond with feedback and a new snippet. Complete at least 3 exchanges.
Air Canada's AI chatbot told a customer that a bereavement fare discount could be applied retroactively after purchase. The chatbot's response was generated from policy documents — but the model inferred a policy behavior that did not exist, encoding an assumption about how the refund system worked. The assumption was never documented. Air Canada's customer service team had no record that the bot was operating on this inference. A British Columbia tribunal ruled Air Canada liable for the refund.
The deeper engineering lesson: the system's assumed behavior was nowhere written down — not in the bot's configuration, not in any system document, not in any code comment. Nobody had audited what the model silently believed about the refund workflow.
An assumption audit is the practice of systematically identifying the implicit beliefs encoded in AI-generated code — beliefs about data formats, environmental conditions, system states, user behavior, and external service behavior that the code relies on but does not verify or document.
Unlike a traditional code review that checks correctness against stated requirements, an assumption audit asks: what would this code need to be true about the world in order to work correctly? The answers are almost never written in the code itself.
An assumption audit catalogs every implicit precondition in a piece of code — data shape, range, encoding, concurrency model, API contract, user privilege level — and verifies that each assumption is either validated at runtime or explicitly documented as a known constraint.
The 2014 NASA/NHTSA analysis of Toyota's Electronic Throttle Control System source code — which involved 400,000 lines of embedded C — identified over 7,000 violations of MISRA C coding standards. More critically, the expert witnesses testified that the code contained numerous implicit assumptions about task scheduling timing that were never documented. When real-world timing violated those undocumented assumptions, the system could enter unintended states. This case, though predating LLMs, established the legal and engineering precedent: undocumented assumptions in safety-relevant code constitute a defect, regardless of whether the code is otherwise syntactically correct.
A structured audit should examine five categories:
The code expects a specific JSON structure, column order, or array length. The model generated code matching its training data patterns — not your actual schema.
Numeric ranges (age must be positive), string encodings (assumes UTF-8), date formats (assumes ISO 8601) — none verified, none documented.
The code assumes single-threaded execution, or that a shared resource is accessed sequentially, without documenting or enforcing that constraint.
The code assumes an API will always return a specific field, or that a service will respond within a timeout window, without defensive handling or documentation.
A practical assumption audit proceeds in three passes. In the first pass, read the function signature and body and list every variable or parameter that the code does not validate before using. In the second pass, trace every external call and ask what the code assumes about the response. In the third pass, read any error handling — the absence of error handling often reveals the most consequential assumptions: that the operation will always succeed.
Microsoft's responsible AI team documented this three-pass method in their internal engineering playbooks in 2023 after finding that AI-generated service integration code consistently omitted handling for partial failures — encoding the assumption that external calls either fully succeed or throw an exception, ignoring the common case of partial or malformed responses.
Every place in AI-generated code where there is no validation, no guard clause, and no error branch — there is an assumption. The reviewer's job is to name it, decide if it's acceptable, and document it either as a constraint or as a guard that needs to be added.
You'll apply the three-pass audit method to AI-generated code snippets. For each snippet, identify data shape assumptions, range/encoding assumptions, concurrency assumptions, and external service assumptions.
The assistant will provide code, you audit it using the three-pass method, and it will evaluate your audit and provide the next snippet. Complete at least 3 exchanges.
Google DeepMind's 2023 internal engineering guidance for AI-assisted code specifically introduced the concept of retroactive design documentation — a requirement that any PR where more than 50% of the code was AI-generated must include a "Decision Record" attachment. The record had to answer three questions: why this approach rather than alternatives, what constraints were accepted, and what a future engineer would need to know to safely modify this code. The model could not fill in this document; the human author had to write it.
Architecture Decision Records — popularized by Michael Nygard's 2011 blog post and widely adopted at Thoughtworks — are short documents that record a significant architectural decision: the context, the decision, the consequences, and the alternatives considered. They were designed to preserve institutional memory when human engineers make complex choices.
The challenge with AI-generated code is that ADRs assume a human made a decision. When the model generated the code, there was no decision — there was a generation. The reviewer's task is to reverse-engineer the decision that would justify the code, assess whether that decision is defensible, and write it down. This is fundamentally different from documenting a decision you made.
Reviewing AI code without writing a retroactive decision record leaves a gap in institutional memory that cannot be reconstructed later. Future engineers will not know whether an implementation detail was intentional, incidental, or a model artifact.
During the 2015–2016 Volkswagen emissions scandal investigation, regulatory engineers discovered that the defeat device code contained no documentation distinguishing intentional behavior (detecting test cycles) from normal operation. Investigators had to reconstruct intent forensically — examining branch conditions, timing logic, and sensor thresholds to infer what the code was designed to do. The absence of documentation did not make the behavior legal; it made reconstruction expensive and left Volkswagen unable to credibly argue any alternative interpretation.
While the VW case involved intentional fraud, the documentation lesson applies to AI-generated code: undocumented behavior will be reconstructed by others under adversarial conditions. Better to document intent precisely when writing than to leave it to forensic inference.
A retroactive decision record for AI-generated code should answer four questions explicitly:
Sometimes the retroactive decision record process reveals that the AI's implementation choice cannot be defended for the target context. The model used a linked list where the use case demands O(1) access. The model chose a recursive implementation for a stack that could overflow. The model hardcoded a timeout that is wrong for the production SLA.
In these cases, the correct response is not to write a document defending the indefensible — it is to reject the AI-generated implementation and require a human-authored replacement. The retroactive decision record process is diagnostic: it will surface implementations that were statistically plausible but contextually wrong.
If you cannot write a defensible retroactive decision record for an AI-generated implementation, do not merge it. The inability to justify the decision is evidence that the implementation is wrong for your context, not merely under-documented.
Decision records can live inline (as extended block comments above the function) or externally (as ADR files in a docs/ directory). The choice depends on team convention. What matters is that the record is version-controlled alongside the code — so that when the code changes, the record must be updated. A decision record in a separate wiki page will drift and become misleading; one in the repository will at minimum be visible during code review of future changes.
Given AI-generated code snippets, you will write a retroactive decision record answering: what non-obvious choices were made, what alternatives exist, what constraints are accepted, and what a future engineer needs to know.
The assistant will provide code, evaluate your decision record, and guide you toward completeness. Complete at least 3 exchanges.
Palantir's 2023 engineering blog post on AI code integration described a documentation review checklist they developed after six months of incidents with AI-assisted PRs. The checklist had three sections: Intent Documentation (does the code explain what it is designed to do and not do), Assumption Documentation (are all implicit preconditions named), and Change Safety (does the documentation give a future engineer enough context to safely modify the code without breaking undocumented invariants). PRs missing any section were returned without review.
Atul Gawande's 2009 research on surgical checklists — and the subsequent World Health Organization Surgical Safety Checklist adoption — demonstrated that expert practitioners under cognitive load systematically skip steps they know to be important. The same dynamic applies to code review: senior engineers reviewing complex AI-generated code will focus cognitive effort on correctness and security, and documentation gaps will be deprioritized under time pressure.
A documentation checklist forces explicit attention to documentation quality as a separate review pass, not an afterthought. It also creates a common standard across a team — preventing the situation where documentation rigor depends entirely on individual reviewer preference.
Based on documented practices from Palantir, Stripe, and Google DeepMind, the following five elements should appear on every AI code documentation review checklist:
The 2012 Knight Capital Group incident — in which a software deployment error caused $440 million in losses in 45 minutes — was partially attributable to undocumented legacy code behavior. A flag that had been repurposed in new code retained its original name from a system called SMARS; no documentation connected the old behavior to the new deployment. Engineers had no way to know from documentation alone that reusing the flag would activate dormant code. Knight Capital filed for bankruptcy within days.
While this predates LLMs, it illustrates the category of failure: when code behavior is not documented, future modifications operate on incomplete information. In AI-generated code, this risk is compounded — the original implementation had no human author who could be consulted. The documentation checklist is the only defense.
Effective integration requires three structural changes to PR review workflow. First, the checklist should be embedded in the PR template — not a separate document, but a section authors must complete before requesting review. Second, reviewers should make a separate documentation review pass before the correctness review pass, not simultaneously. Third, documentation failures should be blocking — the same status as a failing test — not advisory comments that authors can resolve at their discretion.
The most common reason documentation checklists fail is that they are treated as advisory rather than blocking. If a reviewer can approve a PR while noting "documentation could be improved," the checklist will not change behavior. Make it blocking or don't use it.
Not all AI-generated code warrants the same documentation rigor. A configuration helper script that is trivially replaceable carries different risk than a payment processing function or a data retention policy enforcement routine. Teams should define documentation tiers — typically three: low-risk utilities, medium-risk business logic, and high-risk safety/security/compliance-adjacent code — and apply proportionally scaled checklists. High-risk code should require all five elements plus external ADR files; low-risk code might require only an intent statement and assumption catalog.
Documentation review for AI-generated code is not about style or thoroughness — it is about ensuring that the codebase contains enough information for the organization to understand, defend, and safely evolve every system component, even after the people who reviewed it have left.
You'll design a documentation review checklist for a specific context the assistant gives you — either a payment processing service, a data pipeline, or a public API. Your checklist must include all five elements, specify blocking vs. advisory status for each, and calibrate rigor to risk tier.
The assistant will evaluate your checklist against real-world cases and ask you to refine it. Complete at least 3 exchanges.