Lesson 1 · Module 2

Confidence vs. Correctness

Why AI output sounds certain even when it is wrong — and how to detect the gap.

How do you tell the difference between an AI that knows and an AI that merely sounds like it knows?

In the 2023 Stack Overflow Developer Survey, 70% of respondents said they use or plan to use AI coding tools. Yet in the same survey, only 42% said they "highly trust" the accuracy of AI answers. The gap reveals a shared intuition that practitioners haven't yet turned into systematic skill: AI prose and AI correctness are not the same thing.

The Fluency Illusion

Large language models are trained to produce text that is statistically coherent. Fluency — smooth grammar, confident tone, well-structured paragraphs — is baked into the objective. Accuracy is not directly optimized. The result is that an LLM generating wrong code often does so with the same confident phrasing as when it generates correct code.

GitHub's internal research on Copilot (published at ICSE 2023) showed that suggestion acceptance rates do not correlate with suggestion correctness. Developers accept fluent-looking completions at similar rates regardless of whether the suggestion contains a logic error. The implication: fluency is a poor proxy for correctness, yet it dominates human first-pass judgment.

Key Finding

A 2022 Stanford study (Perry et al., "Do Users Write More Insecure Code with AI Assistants?") found that participants using GitHub Copilot produced security vulnerabilities more often than the control group — and were more confident their code was secure. Fluency bred overconfidence, not correctness.

What "Confidence" Actually Encodes

When an AI writes "The correct approach here is…" or "You should always use…", it is reproducing phrasing that appeared in high-confidence training text. The phrase encodes no epistemic state. The model has no internal meter that flags when it is near the edge of its knowledge.

Token probability — the internal score the model assigns to each word — does correlate weakly with correctness on well-represented topics. But on niche APIs, recent library versions, or domain-specific edge cases, probability scores remain high even when the model is essentially hallucinating from plausible-looking patterns.

Surface Signals of False Confidence

Experienced reviewers learn to read linguistic tells. None is definitive alone, but clusters raise the alert level:

🔴

Absolute phrasing without citation: "This is the standard way," "always," "never," "guaranteed to work."

🔴

Version-agnostic claims on version-sensitive APIs: Saying a method exists without specifying which version of the library introduced or removed it.

🔴

Fabricated function names that sound real: Names assembled from plausible components (e.g., df.fillna_advanced() in pandas — doesn't exist).

🟡

Hedged phrasing buried after a confident opener: "The correct solution is X (though you may need to adjust for your specific case)." The hedge is real; the opener misleads.

🟢

Explicit uncertainty markers: "I'm not certain which version introduced this," "you should verify in the docs." These are honest signals — treat them seriously.

The Calibration Test

A practical technique: after receiving AI code, probe the AI's own confidence explicitly. Ask: "How certain are you this function exists in version 2.x of this library? What would I search to verify?" A well-calibrated response will produce a specific doc URL or acknowledge uncertainty. A poorly calibrated one will double down with more confident-sounding prose — which is itself diagnostic.

Google's DeepMind team used exactly this approach when evaluating Gemini's code generation reliability (2024): they measured whether the model could correctly flag its own uncertain completions when prompted. Models that could not self-flag were far more likely to produce plausible-looking but incorrect code in production contexts.

Expert Habit

Treat AI output as a first draft from a confident junior developer — someone who writes fluently, gets most things right, but has blind spots they can't see. Your job is not to distrust everything; it is to know which categories of claim require verification.

Fluency illusion — The tendency for readers to conflate well-written text with accurate text; a core risk when reviewing AI-generated code.

Calibration — The degree to which a model's expressed confidence matches its actual accuracy; well-calibrated models are uncertain when they're likely wrong.

Epistemic opacity — The inability of LLMs to accurately report their own knowledge state; the model cannot distinguish "I know this" from "this sounds right."

L1 Quiz — Confidence vs. Correctness

3 questions · select the best answer

1. The 2022 Stanford study (Perry et al.) on AI assistants and security found that Copilot users:

Correct. Perry et al. (2022) showed AI-assisted participants introduced more vulnerabilities and simultaneously rated their code as more secure — a direct demonstration of the fluency illusion.

Not quite. Perry et al. found the opposite: AI users introduced more vulnerabilities while feeling more confident. This is the fluency illusion in action.

2. When an AI writes "This is always the correct approach," the phrasing primarily reflects:

Correct. LLMs reproduce phrasing from their training distribution. "Always correct" was common phrasing in authoritative-sounding text, so the model produces it regardless of actual accuracy.

Not quite. LLMs have no internal truth-verification. Confident phrasing reproduces patterns from training data, not verified facts or explicit confidence scores.

3. Which of the following is the most reliable signal that an AI response may be unreliable on a specific claim?

Correct. Version-agnostic claims on version-sensitive APIs are a red-flag pattern. Library APIs change between major versions; a model trained on older data can confidently recommend removed or renamed methods.

Not quite. Length and formatting don't correlate with accuracy. Explicit uncertainty followed by an attempt is actually a good sign. Version-agnostic absolute claims on version-sensitive APIs are the strongest unreliability signal listed.

Lab 1 — Calibration Interrogation

Practice detecting false confidence in AI code suggestions

Your Task

Below is a simulated AI code suggestion. Your job is to interrogate the AI assistant about the confidence level of specific claims in the suggestion. Ask about version compatibility, whether function names actually exist, and what you'd search to verify. The AI will respond as a code-review tutor.

Try asking: "Does pandas actually have a method called df.interpolate_sparse()? How would I verify this?" — or probe any other claim you notice.

# AI suggestion for handling missing values
import pandas as pd

df = pd.read_csv('data.csv')
# Use interpolate_sparse for memory-efficient gap filling
df_clean = df.interpolate_sparse(method='linear', limit=5)
df_clean = df_clean.fillna(df_clean.mean())
print("Missing values handled successfully.")

AI Review Tutor

Confidence Analysis

I'm your code review tutor for this lab. The code snippet above contains at least one fabricated method name — a classic fluency-illusion trap. Ask me about any specific claim in the code: function existence, version compatibility, or how you'd verify it. What do you want to probe first?

Lesson 2 · Module 2

Reading Structure, Not Just Syntax

AI code often passes a syntax check but fails at the architectural level — here is how to see it.

What structural patterns in AI-generated code predict integration failures before you run a single test?

Amazon's engineering teams reported internally (later disclosed in a 2023 re:Invent session) that CodeWhisperer suggestions accepted by developers had a syntactic correctness rate above 90% — but a much lower rate of "correct integration," meaning the code worked in isolation but mishandled the surrounding system's contracts: wrong error types propagated, incorrect assumptions about caller state, or resource handles left open. Syntax reviewers missed these because they stopped at the function boundary.

The Boundary Problem

LLMs generate code token by token, optimizing local coherence. A function body looks internally consistent because the model conditions on what it has already written. But the model has weaker signal about the calling context — what the caller assumes, what invariants must hold at entry, what cleanup must happen at exit.

This is why AI-generated code frequently fails at boundaries: function signatures match but postconditions don't, resources are allocated but not released in error paths, or exceptions are caught but silently swallowed.

Four Structural Review Layers

Layer 1 · Contracts

Does the function's return type, exception behavior, and side effects match what callers will expect? AI often widens or narrows contracts silently — returning None where callers expect an object, or raising a new exception type not listed in the interface.

Layer 2 · Error Paths

Check every try/except or catch block. AI frequently catches broad exception classes and then fails to re-raise, log, or handle. The "happy path" is usually correct; error paths are where AI generation degrades sharply.

Layer 3 · Resource Lifecycle

File handles, network connections, database transactions, locks. AI often opens resources correctly but leaves them unclosed in non-happy paths. Look specifically at exception branches and early returns.

Layer 4 · State Assumptions

Does the code assume shared mutable state is in a particular condition? AI-generated functions sometimes assume initialization that the actual system hasn't performed, or mutate state that callers expected to be unchanged.

A Documented Example Pattern

In a 2023 analysis by Pearce et al. ("Examining Zero-Shot Vulnerability Repair with LLMs"), the most common structural failure in AI-generated security patches was incomplete error path coverage: the model fixed the happy path but left error branches with the original vulnerable pattern. The fix looked correct at first glance because the reviewer's eye went to the changed lines, not the untouched catch blocks.

// Common AI-generated pattern — structural issue
public String readConfig(String path) {
    try {
        FileReader fr = new FileReader(path);
        // ... reads file correctly ...
        fr.close();         // closed on happy path ✓
        return result;
    } catch (IOException e) {
        // fr is never closed here ✗
        return null; // exception swallowed ✗
    }
}

Review Technique

When reviewing AI code, deliberately trace the error path first, not the happy path. The happy path is where the AI's training signal was strongest. Error paths are where coverage drops. Ask: "If an exception fires on line 3, what happens to every resource opened on lines 1–2?"

Architectural Mismatch Signals

Beyond individual functions, AI sometimes generates code that is architecturally mismatched: it solves the problem but introduces a pattern that conflicts with the existing codebase's conventions. Signs include:

Introducing a new third-party import for a utility the codebase already has internally.
Using synchronous I/O in an async codebase (or vice versa).
Applying a design pattern (e.g., Singleton) where the codebase explicitly avoids it.
Creating a new exception class when the project's convention is to reuse existing domain exceptions.
Hardcoding values that the rest of the codebase reads from configuration.

Contract — The implicit or explicit agreement between a function and its callers about inputs, outputs, exceptions, and side effects.

Error path — The execution path followed when an operation fails; AI generation quality degrades significantly on error paths relative to happy paths.

L2 Quiz — Reading Structure

3 questions · select the best answer

1. Amazon's CodeWhisperer internal study (2023) found that accepted suggestions had high syntactic correctness but low "correct integration." The most common integration failure involved:

Correct. The Amazon study found that code worked in isolation but failed at integration points — wrong exception types, incorrect assumptions about caller state, resource handles left open.

Not quite. The failures were structural integration issues: error propagation, caller state assumptions, and resource lifecycle — not naming or documentation problems.

2. The Pearce et al. (2023) study on AI-generated security patches found the most common structural failure was:

Correct. Pearce et al. found that AI fixed the happy path but left error branches with the original vulnerable code — the reviewer's eye went to changed lines and missed untouched catch blocks.

Not quite. Pearce et al. specifically found incomplete error path coverage: happy paths were fixed, but error branches retained the vulnerable original pattern.

3. When reviewing AI-generated code structurally, the recommended approach is to:

Correct. Because AI training signal is strongest on happy paths, error paths degrade first. Deliberately reviewing error paths and resource lifecycle in non-happy branches catches what automated and cursory reviews miss.

Not quite. The lesson recommends reviewing error paths first — that's where AI quality degrades most. Tests alone don't cover all structural integration failures.

Lab 2 — Structural Pattern Diagnosis

Practice tracing error paths and resource lifecycle in AI-generated code

Your Task

Examine the AI-generated database connection code below. Identify structural problems across the four review layers: contracts, error paths, resource lifecycle, and state assumptions. Discuss your findings with the tutor.

Start by asking: "What structural problems exist in the error path of this code?" or trace the resource lifecycle aloud and ask if your analysis is correct.

# AI-generated: database query helper
import psycopg2

def get_user_record(user_id, conn_string):
    conn = psycopg2.connect(conn_string)
    cursor = conn.cursor()
    try:
        cursor.execute(
            "SELECT * FROM users WHERE id = %s", (user_id,)
        )
        row = cursor.fetchone()
        conn.commit()
        return row
    except Exception as e:
        print(f"Error: {e}")
        return None
    # No finally block
    # cursor and conn not closed in except branch

AI Review Tutor

Structural Analysis

This lab focuses on structural review. The code above has multiple issues across the four layers we covered: error paths, resource lifecycle, contracts, and state assumptions. Tell me what you see — or ask me to walk through a specific layer with you. What would you like to analyze first?

Lesson 3 · Module 2

A Taxonomy of AI Hallucinations in Code

Not all AI errors are the same — classifying hallucination type changes how you review and verify.

Which category of AI hallucination is hardest to catch, and why does the category matter for your review process?

In 2023, security researcher Bar Lanyado documented that GitHub Copilot suggested non-existent npm package names in import statements. When researchers published these package names, Copilot users who copied the suggestions and ran npm install downloaded newly created malicious packages that had been registered to match the hallucinated names. The hallucination wasn't random noise — it was a plausible name assembled from real package naming patterns, which made it both convincing and exploitable.

Why a Taxonomy Matters

The word "hallucination" is used loosely to mean any AI error. But different error types have different detection strategies and different risk profiles. Treating all AI errors as the same category leads to inefficient review — checking the wrong things, in the wrong order, at the wrong level of scrutiny.

The Five Hallucination Categories

Type 1 · Fabricated identifiers — Function names, class names, or module names that don't exist. These sound real because they follow real naming conventions. Detection: search the library's actual documentation or source. Risk: runtime AttributeError or import failure — immediately visible. Severity: Low if caught in testing.

Type 2 · Stale API references — Real functions that existed in an older version but were removed or renamed. The model's training data contains the old API; it reproduces it confidently. Detection: check changelog or migration guide for the current major version. Risk: silent failure if the old function still exists with changed behavior, or overt crash if removed.

Type 3 · Semantic inversion — Logic that is syntactically and structurally correct but inverted: > instead of <, AND instead of OR, wrong variable used in a comparison. Most dangerous because it passes compilation, linting, and sometimes even tests. Detection requires reading logic against requirements, not just reading code.

Type 4 · Plausible-but-wrong algorithms — A real algorithm applied to the wrong problem, or a correct algorithm with a subtle off-by-one error or incorrect termination condition. Common in sorting, search, and numeric code. Detection: trace through with a concrete example, especially edge cases (empty input, single element, maximum values).

Type 5 · Security-pattern confabulation — Code that mimics the visual pattern of secure code but uses the wrong primitive or configuration. Example: encrypting with AES but using ECB mode (which reveals patterns), or using MD5 labeled as "hashing for verification" in a context where collision resistance matters. Documented extensively by Pearce et al. and the Stanford Perry study.

Detection Difficulty and Review Priority

Easy to Detect

Type 1 Fabricated identifiers fail at import or first call. Any run of the code surfaces them. Low priority in static review — testing catches them.

Hard to Detect

Type 3 Semantic inversions look correct to a casual reader and may pass tests designed by the same person who wrote the requirements. Require deliberate logic tracing.

Insidious

Type 5 Security confabulations pass all functional tests (they work for the happy path), only fail under adversarial conditions or cryptographic analysis. May sit undetected for months.

Version-Dependent

Type 2 Stale API references may work correctly on the developer's local machine if their library version is old, then fail in production on the latest version. CI environment version matters.

Documented Risk

A 2024 Snyk report analyzed 500 AI-generated code samples across five languages. Type 5 (security confabulation) accounted for 28% of identified vulnerabilities and had the longest mean time to detection — an average of 47 days between introduction and discovery, compared to under 24 hours for Type 1.

Building a Hallucination-Aware Review Checklist

Rather than reviewing AI code uniformly, prioritize by type and context:

For any cryptography, authentication, or authorization code: assume Type 5 risk and verify every primitive against current security best-practices documentation (OWASP, NIST).
For any numerical or comparison logic: trace Type 3 risk by substituting concrete values and checking output against expected behavior.
For any third-party library calls: verify against the specific version in your requirements.txt or package.json, not just "the library's docs."
For import statements and package names: verify the package exists in the registry before installing, especially for less common names.
For algorithm implementations: test with empty input, single-element input, and maximum-size input as minimum edge case coverage.

Expert Principle

The hallucination that costs the most is not the one that crashes the program — it's the one that runs silently with wrong results. Type 3 and Type 5 are your highest-priority review targets in AI-generated code.

L3 Quiz — Hallucination Taxonomy

3 questions · select the best answer

1. The 2023 npm package hallucination documented by Bar Lanyado was particularly dangerous because:

Correct. This is a Type 1 hallucination weaponized: fabricated but plausible package names were hallucinated, then malicious actors registered those exact names in the npm registry.

Not quite. The hallucination mechanism was Copilot suggesting non-existent package names that followed real naming conventions — attackers then registered those names to intercept developers who copied the suggestions.

2. Which hallucination type is hardest to detect through code review and has the longest mean time to discovery according to the 2024 Snyk report?

Correct. Security confabulations pass all functional tests and only fail under adversarial or cryptographic analysis — making them the highest-priority and hardest-to-catch hallucination type.

Not quite. The Snyk report found Type 5 (security confabulation) had the longest detection time — 47 days on average — because these errors pass functional tests and only manifest under adversarial conditions.

3. When reviewing AI-generated code that uses AES encryption, the primary Type 5 risk to check is:

Correct. This is textbook Type 5 confabulation: the code uses a real, correct function (AES) but with an insecure configuration (ECB mode) that visually resembles correct security code while being cryptographically weak.

Not quite. Type 5 security confabulation is about using real primitives with wrong configurations. AES-ECB is the canonical example — it uses real AES encryption but in a mode that reveals patterns in the plaintext.

Lab 3 — Hallucination Classification

Identify and classify hallucination types in real-looking AI code samples

Your Task

Review the two AI-generated snippets below. Each contains a different hallucination type. Classify the error type in each, explain why it's dangerous, and describe how you'd verify or fix it. Discuss with the tutor.

Try: "What type of hallucination is in Snippet A?" — or compare the risk severity of both snippets and explain which you'd prioritize reviewing first.

## Snippet A — Authentication token check
import hashlib

def verify_token(token, stored_hash):
    # Compare submitted token to stored hash
    return hashlib.md5(token.encode()).hexdigest() == stored_hash

## Snippet B — List deduplication
def deduplicate_sorted(items):
    # Remove duplicates from sorted list
    result = [items[0]]
    for i in range(1, len(items)):
        if items[i] != items[i-1]:   # correct logic ✓
            result.append(items[i])
    return result
    # Bug: crashes on empty list — items[0] raises IndexError

AI Review Tutor

Hallucination Classification

Two snippets, two different hallucination types from our taxonomy. Snippet A involves a security primitive — think about which type targets security-pattern confabulation. Snippet B is a logic issue — which type covers plausible-but-wrong algorithms? Tell me your classification for either one, and I'll help you sharpen the analysis.

Lesson 4 · Module 2

Building Your Personal Review Protocol

How expert reviewers systematize what they've learned about AI output into repeatable, teachable process.

What separates ad-hoc code review from a systematic protocol — and why does the difference matter at team scale?

In a 2024 blog post and associated internal engineering documentation (referenced at the ICML 2024 workshop on AI-assisted development), Google DeepMind described how teams reviewing AI-generated code adopted tiered review protocols rather than uniform scrutiny. Code touching cryptography, authentication, or external data parsing received mandatory secondary human review plus automated static analysis. Code for internal utilities received expedited single-reviewer pass. The tiering reduced review time by approximately 30% without increasing post-deployment defect rates.

Why Ad-Hoc Review Fails at Scale

Individual reviewers applying their own intuition produce inconsistent outcomes. One engineer checks error paths thoroughly; another focuses on naming and style. Neither approach is complete, and the gaps aren't visible until a bug ships. When AI tooling increases code output volume — which it reliably does — the gaps scale proportionally.

Microsoft's DevDiv team published findings in 2023 showing that teams using Copilot without updated review processes saw a higher rate of review-escaped defects than teams that also updated their review checklists to account for AI-specific failure modes. The tooling improved throughput; the review process hadn't kept pace.

Protocol Architecture: The Three-Phase Review

Phase 1 — Risk Classification (30 seconds). Before reading any code, classify the change by risk tier. Security-touching, data-touching, and external-interface-touching code is Tier 1. Internal logic without security implications is Tier 2. Cosmetic or configuration changes are Tier 3. Tier determines depth of subsequent phases.
Phase 2 — Hallucination Sweep (type-prioritized). For Tier 1: check Type 5 (security primitives, modes, configurations) and Type 3 (logic inversions) before reading comprehensively. For Tier 2: check Type 4 (algorithm correctness on edge cases) and Type 2 (API version compatibility). For Tier 3: light scan only.
Phase 3 — Structural Integration Check. For any tier touching shared state or external systems: trace error paths explicitly, verify resource lifecycle, confirm contract alignment with callers. This phase is most often skipped under time pressure — which is precisely when it matters most.

Calibration Probes as a Standard Step

Building on Lesson 1, expert reviewers standardize calibration probing for Tier 1 code. Before accepting a security-relevant AI suggestion, they ask the AI itself:

"What version of this library introduced this method?"

"Is this encryption mode considered secure for this use case?"

"What does the OWASP guidance say about this approach?"

"What would you search to verify this claim?"

If the AI produces confident-sounding answers without citing verifiable sources, that response itself is a signal to increase scrutiny — not a green light.

Team-Level Protocol Adoption

Personal protocols become organizational value when they're written down and shared. The practical format is a review checklist embedded directly in your pull-request template — not a separate document, but a required field in the PR description. Teams at Stripe, Shopify, and Atlassian have published variants of this approach publicly in engineering blog posts (2023–2024), citing improved review consistency and faster onboarding of new reviewers.

In the PR Template

Add a required checkbox section: "If AI-generated code is included, confirm: [ ] Tier classified [ ] Hallucination sweep completed [ ] Error paths traced [ ] Security primitives verified against docs."

In Code Review Tools

GitHub, GitLab, and Phabricator all support custom review checklists. Marking these as required before merge creates an audit trail and prevents review steps from being silently skipped under deadline pressure.

The Throughput Trap

AI tooling raises code velocity. Faster generation creates implicit pressure to review faster too. The reviewers who resist this pressure — maintaining protocol depth even as merge volume increases — are the ones who catch the expensive bugs. Protect Phase 3 time explicitly; it is the first thing cut under pressure and the last thing that should be.

Module 2 Synthesis

Reading AI output like an expert is not a single skill — it is a stack. Fluency-illusion awareness (L1) → structural pattern recognition (L2) → hallucination taxonomy (L3) → systematic protocol (L4). Each layer catches what the layers above miss. The protocol encodes all four into a repeatable, teachable process.

L4 Quiz — Review Protocol

3 questions · select the best answer

1. Microsoft's DevDiv team (2023) found that teams using Copilot without updated review processes experienced:

Correct. The tool improved throughput but the review process hadn't adapted to AI-specific failure modes — defects escaped at a higher rate until teams updated their review checklists to match.

Not quite. Microsoft found that teams using Copilot without updating review processes saw more review-escaped defects — the tooling raised throughput but the unchanged review process created gaps.

2. In the three-phase review protocol, which phase is described as "most often skipped under time pressure and last thing that should be"?

Correct. Phase 3 — tracing error paths, verifying resource lifecycle, and confirming contract alignment — is the most time-intensive and the most commonly sacrificed under deadline pressure.

Not quite. Phase 3 (Structural Integration Check) is called out as the phase most often skipped under time pressure and the most important to protect — it catches integration failures that Phases 1 and 2 don't.

3. Google DeepMind's tiered review protocol (2024) achieved approximately a 30% reduction in review time without increasing post-deployment defect rates by:

Correct. Tiered protocols concentrate deep review where risk is highest and allow faster review where risk is lower — achieving efficiency gains without sacrificing defect detection on high-risk code.

Not quite. DeepMind's approach was tiered scrutiny: mandatory secondary human review plus automated analysis for security-touching code; expedited review for lower-risk changes. This concentrates effort where it matters most.

Lab 4 — Protocol Design Workshop

Build and stress-test your personal AI code review protocol

Your Task

Design your own three-phase review protocol for AI-generated code in a specific context — your current team, a side project, or a hypothetical company. Describe the tier classification criteria, the hallucination sweep priorities, and the integration check scope. The tutor will push back on gaps and help you refine it.

Start by stating your context: "I'm designing a protocol for a fintech startup's Python backend" — or ask the tutor to walk you through building one step by step. After 3 exchanges, you've completed this lab.

AI Review Tutor

Protocol Design

Let's build your review protocol. Start by telling me: what kind of codebase or project are you designing this for? What languages, what risk domains (payments, healthcare, consumer data, internal tools)? The tier classifications will flow from your actual risk profile, not a generic template. Go ahead — describe your context.

Module 2 — Test

15 questions · 80% to pass · covers all four lessons

1. The "fluency illusion" in AI code review refers to:

Correct. Fluency illusion is the core cognitive risk: smooth, confident prose reads as accurate even when it isn't.

The fluency illusion is the human cognitive bias of equating polished, confident-sounding text with factual accuracy.

2. The 2022 Stanford Perry et al. study showed that developers using AI assistants:

Correct. Perry et al. is a landmark study: AI users had more vulnerabilities and higher confidence simultaneously.

Perry et al. found AI-assisted developers introduced more vulnerabilities and were simultaneously more confident their code was secure — the fluency illusion in a controlled study.

3. A "calibration probe" in AI code review means:

Correct. Asking "How certain are you? What would I search to verify?" is a calibration probe — and the AI's response quality is itself diagnostic.

A calibration probe asks the AI to report its own uncertainty on a specific claim and provide a verification path. The quality of that response is itself a signal.

4. Why does AI-generated code fail most often at function boundaries rather than within function bodies?

Correct. Token-by-token local coherence optimization is strong within the function body; the calling context is more distant in the conditioning window.

LLMs generate token-by-token with strong local coherence. The calling context — what callers assume, what invariants hold at entry — has weaker signal, which is why boundaries are the failure point.

5. Amazon CodeWhisperer's internal 2023 study found that accepted suggestions had:

Correct. High syntax correctness + low integration correctness is the key pattern — it's why structural review can't stop at syntax.

The Amazon study found >90% syntactic correctness but significantly lower "correct integration" — the code worked in isolation but failed at system integration points.

6. The recommended structural review technique when examining AI-generated code is to:

Correct. Error paths first — because AI training signal is strongest on happy paths, and error paths degrade first.

Trace error paths first. AI training signal is concentrated on happy paths; error paths are where quality degrades and where the most critical structural failures hide.

7. Type 1 hallucinations (fabricated identifiers) are considered low severity primarily because:

Correct. Type 1 errors are self-revealing — they crash loudly, making them the easiest to catch and cheapest to fix.

Type 1 hallucinations fail loudly at runtime — AttributeError, ImportError, NameError. They're immediately visible, unlike Type 3 and Type 5 which can run silently with wrong results.

8. The 2023 Bar Lanyado research on npm package hallucinations demonstrated which hallucination type being actively exploited?

Correct. This was Type 1 weaponized: the hallucinated package names were plausible (followed real naming patterns) and attackers pre-registered them to intercept installations.

This was Type 1 (fabricated identifiers) used as an attack vector. Copilot hallucinated plausible-but-nonexistent package names; attackers registered those names with malicious code.

9. Type 5 hallucinations (security-pattern confabulation) are the highest-priority review target because:

Correct. Silent correctness, functional test passage, and months-long detection windows make Type 5 the most dangerous in practice.

Type 5 errors are highest priority because they run correctly in all functional tests and only fail under adversarial conditions — giving them the longest mean detection time in the Snyk study.

10. A Type 2 (stale API) hallucination is particularly dangerous in which scenario?

Correct. Local-old, production-new version mismatch is the scenario where Type 2 silently passes local development and breaks in CI or production.

The dangerous scenario for Type 2 is when local development uses an old library version (making the stale API work) while production runs the latest version (where the API was removed or renamed).

11. In the three-phase review protocol, Phase 1 (Risk Classification) should take approximately:

Correct. Risk classification is a quick, high-level categorization — security-touching, data-touching, external-interface-touching vs. internal utility vs. cosmetic.

Phase 1 is designed to be fast — approximately 30 seconds to classify the change by risk tier before reading the code. This classification determines how deeply to apply Phases 2 and 3.

12. Google DeepMind's tiered review protocol achieved a ~30% reduction in review time without increasing defects by:

Correct. Tiering concentrates deep review where risk is highest and permits faster review where risk is lower — achieving efficiency without sacrificing quality on critical code.

DeepMind's protocol applied mandatory secondary human review plus automated analysis to security-touching code, and an expedited single-reviewer pass to internal utilities — concentrating effort where it matters.

13. Embedding the review checklist directly in the pull-request template (rather than a separate document) is recommended because:

Correct. Inline checklists create visibility, auditability, and consistency — the three things that make protocols work at team scale.

Embedding the checklist in the PR template creates an audit trail, makes skipping steps visible, and helps new reviewers learn AI-specific review habits through the workflow itself.

14. "The throughput trap" in AI-assisted development refers to:

Correct. Faster generation → more PRs → review pressure → Phase 3 skipped → structural failures ship. Protecting Phase 3 time explicitly is the countermeasure.

The throughput trap is when faster AI code generation creates implicit pressure to review faster too — and Phase 3 (the most time-intensive structural check) is the first thing cut under pressure.

15. The "four structural review layers" framework covers contracts, error paths, resource lifecycle, and state assumptions. Which combination of layers most directly catches the failure pattern from Amazon's CodeWhisperer study?

Correct. The Amazon failures were multi-layer: error propagation (error paths), unclosed resources (resource lifecycle), and caller assumption mismatches (contracts + state assumptions). All four layers are needed.

The Amazon study findings spanned all four layers: error paths (wrong exception types propagated), resource lifecycle (handles left open), and contracts (incorrect caller assumptions). No single layer catches all of it.