In the 2023 Stack Overflow Developer Survey, 70% of respondents said they use or plan to use AI coding tools. Yet in the same survey, only 42% said they "highly trust" the accuracy of AI answers. The gap reveals a shared intuition that practitioners haven't yet turned into systematic skill: AI prose and AI correctness are not the same thing.
Large language models are trained to produce text that is statistically coherent. Fluency — smooth grammar, confident tone, well-structured paragraphs — is baked into the objective. Accuracy is not directly optimized. The result is that an LLM generating wrong code often does so with the same confident phrasing as when it generates correct code.
GitHub's internal research on Copilot (published at ICSE 2023) showed that suggestion acceptance rates do not correlate with suggestion correctness. Developers accept fluent-looking completions at similar rates regardless of whether the suggestion contains a logic error. The implication: fluency is a poor proxy for correctness, yet it dominates human first-pass judgment.
A 2022 Stanford study (Perry et al., "Do Users Write More Insecure Code with AI Assistants?") found that participants using GitHub Copilot produced security vulnerabilities more often than the control group — and were more confident their code was secure. Fluency bred overconfidence, not correctness.
When an AI writes "The correct approach here is…" or "You should always use…", it is reproducing phrasing that appeared in high-confidence training text. The phrase encodes no epistemic state. The model has no internal meter that flags when it is near the edge of its knowledge.
Token probability — the internal score the model assigns to each word — does correlate weakly with correctness on well-represented topics. But on niche APIs, recent library versions, or domain-specific edge cases, probability scores remain high even when the model is essentially hallucinating from plausible-looking patterns.
Experienced reviewers learn to read linguistic tells. None is definitive alone, but clusters raise the alert level:
df.fillna_advanced() in pandas — doesn't exist).A practical technique: after receiving AI code, probe the AI's own confidence explicitly. Ask: "How certain are you this function exists in version 2.x of this library? What would I search to verify?" A well-calibrated response will produce a specific doc URL or acknowledge uncertainty. A poorly calibrated one will double down with more confident-sounding prose — which is itself diagnostic.
Google's DeepMind team used exactly this approach when evaluating Gemini's code generation reliability (2024): they measured whether the model could correctly flag its own uncertain completions when prompted. Models that could not self-flag were far more likely to produce plausible-looking but incorrect code in production contexts.
Treat AI output as a first draft from a confident junior developer — someone who writes fluently, gets most things right, but has blind spots they can't see. Your job is not to distrust everything; it is to know which categories of claim require verification.
Below is a simulated AI code suggestion. Your job is to interrogate the AI assistant about the confidence level of specific claims in the suggestion. Ask about version compatibility, whether function names actually exist, and what you'd search to verify. The AI will respond as a code-review tutor.
df.interpolate_sparse()? How would I verify this?" — or probe any other claim you notice.Amazon's engineering teams reported internally (later disclosed in a 2023 re:Invent session) that CodeWhisperer suggestions accepted by developers had a syntactic correctness rate above 90% — but a much lower rate of "correct integration," meaning the code worked in isolation but mishandled the surrounding system's contracts: wrong error types propagated, incorrect assumptions about caller state, or resource handles left open. Syntax reviewers missed these because they stopped at the function boundary.
LLMs generate code token by token, optimizing local coherence. A function body looks internally consistent because the model conditions on what it has already written. But the model has weaker signal about the calling context — what the caller assumes, what invariants must hold at entry, what cleanup must happen at exit.
This is why AI-generated code frequently fails at boundaries: function signatures match but postconditions don't, resources are allocated but not released in error paths, or exceptions are caught but silently swallowed.
Does the function's return type, exception behavior, and side effects match what callers will expect? AI often widens or narrows contracts silently — returning None where callers expect an object, or raising a new exception type not listed in the interface.
Check every try/except or catch block. AI frequently catches broad exception classes and then fails to re-raise, log, or handle. The "happy path" is usually correct; error paths are where AI generation degrades sharply.
File handles, network connections, database transactions, locks. AI often opens resources correctly but leaves them unclosed in non-happy paths. Look specifically at exception branches and early returns.
Does the code assume shared mutable state is in a particular condition? AI-generated functions sometimes assume initialization that the actual system hasn't performed, or mutate state that callers expected to be unchanged.
In a 2023 analysis by Pearce et al. ("Examining Zero-Shot Vulnerability Repair with LLMs"), the most common structural failure in AI-generated security patches was incomplete error path coverage: the model fixed the happy path but left error branches with the original vulnerable pattern. The fix looked correct at first glance because the reviewer's eye went to the changed lines, not the untouched catch blocks.
When reviewing AI code, deliberately trace the error path first, not the happy path. The happy path is where the AI's training signal was strongest. Error paths are where coverage drops. Ask: "If an exception fires on line 3, what happens to every resource opened on lines 1–2?"
Beyond individual functions, AI sometimes generates code that is architecturally mismatched: it solves the problem but introduces a pattern that conflicts with the existing codebase's conventions. Signs include:
Examine the AI-generated database connection code below. Identify structural problems across the four review layers: contracts, error paths, resource lifecycle, and state assumptions. Discuss your findings with the tutor.
In 2023, security researcher Bar Lanyado documented that GitHub Copilot suggested non-existent npm package names in import statements. When researchers published these package names, Copilot users who copied the suggestions and ran npm install downloaded newly created malicious packages that had been registered to match the hallucinated names. The hallucination wasn't random noise — it was a plausible name assembled from real package naming patterns, which made it both convincing and exploitable.
The word "hallucination" is used loosely to mean any AI error. But different error types have different detection strategies and different risk profiles. Treating all AI errors as the same category leads to inefficient review — checking the wrong things, in the wrong order, at the wrong level of scrutiny.
AttributeError or import failure — immediately visible. Severity: Low if caught in testing.
> instead of <, AND instead of OR, wrong variable used in a comparison. Most dangerous because it passes compilation, linting, and sometimes even tests. Detection requires reading logic against requirements, not just reading code.
MD5 labeled as "hashing for verification" in a context where collision resistance matters. Documented extensively by Pearce et al. and the Stanford Perry study.
Type 1 Fabricated identifiers fail at import or first call. Any run of the code surfaces them. Low priority in static review — testing catches them.
Type 3 Semantic inversions look correct to a casual reader and may pass tests designed by the same person who wrote the requirements. Require deliberate logic tracing.
Type 5 Security confabulations pass all functional tests (they work for the happy path), only fail under adversarial conditions or cryptographic analysis. May sit undetected for months.
Type 2 Stale API references may work correctly on the developer's local machine if their library version is old, then fail in production on the latest version. CI environment version matters.
A 2024 Snyk report analyzed 500 AI-generated code samples across five languages. Type 5 (security confabulation) accounted for 28% of identified vulnerabilities and had the longest mean time to detection — an average of 47 days between introduction and discovery, compared to under 24 hours for Type 1.
Rather than reviewing AI code uniformly, prioritize by type and context:
requirements.txt or package.json, not just "the library's docs."The hallucination that costs the most is not the one that crashes the program — it's the one that runs silently with wrong results. Type 3 and Type 5 are your highest-priority review targets in AI-generated code.
Review the two AI-generated snippets below. Each contains a different hallucination type. Classify the error type in each, explain why it's dangerous, and describe how you'd verify or fix it. Discuss with the tutor.
In a 2024 blog post and associated internal engineering documentation (referenced at the ICML 2024 workshop on AI-assisted development), Google DeepMind described how teams reviewing AI-generated code adopted tiered review protocols rather than uniform scrutiny. Code touching cryptography, authentication, or external data parsing received mandatory secondary human review plus automated static analysis. Code for internal utilities received expedited single-reviewer pass. The tiering reduced review time by approximately 30% without increasing post-deployment defect rates.
Individual reviewers applying their own intuition produce inconsistent outcomes. One engineer checks error paths thoroughly; another focuses on naming and style. Neither approach is complete, and the gaps aren't visible until a bug ships. When AI tooling increases code output volume — which it reliably does — the gaps scale proportionally.
Microsoft's DevDiv team published findings in 2023 showing that teams using Copilot without updated review processes saw a higher rate of review-escaped defects than teams that also updated their review checklists to account for AI-specific failure modes. The tooling improved throughput; the review process hadn't kept pace.
Building on Lesson 1, expert reviewers standardize calibration probing for Tier 1 code. Before accepting a security-relevant AI suggestion, they ask the AI itself:
If the AI produces confident-sounding answers without citing verifiable sources, that response itself is a signal to increase scrutiny — not a green light.
Personal protocols become organizational value when they're written down and shared. The practical format is a review checklist embedded directly in your pull-request template — not a separate document, but a required field in the PR description. Teams at Stripe, Shopify, and Atlassian have published variants of this approach publicly in engineering blog posts (2023–2024), citing improved review consistency and faster onboarding of new reviewers.
Add a required checkbox section: "If AI-generated code is included, confirm: [ ] Tier classified [ ] Hallucination sweep completed [ ] Error paths traced [ ] Security primitives verified against docs."
GitHub, GitLab, and Phabricator all support custom review checklists. Marking these as required before merge creates an audit trail and prevents review steps from being silently skipped under deadline pressure.
AI tooling raises code velocity. Faster generation creates implicit pressure to review faster too. The reviewers who resist this pressure — maintaining protocol depth even as merge volume increases — are the ones who catch the expensive bugs. Protect Phase 3 time explicitly; it is the first thing cut under pressure and the last thing that should be.
Reading AI output like an expert is not a single skill — it is a stack. Fluency-illusion awareness (L1) → structural pattern recognition (L2) → hallucination taxonomy (L3) → systematic protocol (L4). Each layer catches what the layers above miss. The protocol encodes all four into a repeatable, teachable process.
Design your own three-phase review protocol for AI-generated code in a specific context — your current team, a side project, or a hypothetical company. Describe the tier classification criteria, the hallucination sweep priorities, and the integration check scope. The tutor will push back on gaps and help you refine it.