L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 6 · Lesson 1

The Anatomy of an AI–Human Disagreement

Why AI tools and experienced engineers reach different conclusions — and what that tells you about both.
When your static analysis tool contradicts your senior engineer, which one is wrong?

In 2017, the Chromium project's automated tooling flagged a memory management pattern in V8 as a potential use-after-free vulnerability. A senior V8 engineer marked it WONTFIX, arguing the lifetime guarantees were enforced elsewhere. Months later, the pattern became the root cause of CVE-2017-5053, a remotely exploitable heap corruption bug. The human had domain knowledge the tool lacked context for — but the tool had identified the structural risk correctly. Neither was simply wrong.

Three Root Causes of Disagreement

AI-assisted code audit tools — whether traditional static analyzers like Coverity and CodeQL or LLM-based reviewers — reach conclusions through pattern matching, dataflow analysis, and statistical inference. Human reviewers bring architectural intent, business context, and runtime intuition. The gaps between these modes of reasoning produce predictable categories of conflict.

Context Blindness

AI tools analyze what is written, not what is meant. A function that appears to double-free a pointer may be architecturally guaranteed to only execute one branch — but that guarantee lives in a design document, not the code.

Pattern Overfitting

Training data and rule sets encode past vulnerabilities. Novel idioms, domain-specific invariants, or intentionally unsafe-looking code (e.g., SIMD intrinsics, embedded firmware) will trigger false positives at high rates.

Human Overconfidence

The most dangerous case. Engineers who built a system believe they understand all its invariants. CVE-2014-0160 (Heartbleed) existed in OpenSSL for two years while expert maintainers reviewed the code — they simply didn't see what wasn't there.

Scope Mismatch

AI tools typically reason at file or function scope. Humans reason at system scope. A finding may be correct at the local level but irrelevant at the system level — or vice versa. Neither perspective is complete.

The False Positive Problem — Real Numbers

Microsoft's internal research, published in the 2019 SOUPS proceedings, measured developer responses to static analysis warnings across Windows components. Engineers marked approximately 52% of all static analysis warnings as false positives without investigating them — not because they were definitely wrong, but because alarm fatigue had eroded trust. Of those dismissed warnings, later audit found roughly 8% contained real issues that shipped.

This creates a structural problem: the act of disagreeing with AI tooling has a non-trivial false negative cost that is invisible at the moment of disagreement. Your team needs a framework for disagreement that is neither reflexive acceptance nor reflexive dismissal.

Key Insight

Disagreement between AI and human reviewers is not a failure state — it is a signal. The question is what kind of signal: a gap in tool context, a gap in human attention, or a genuine ambiguity that requires deeper investigation. Each type demands a different response.

Classifying the Disagreement Before Resolving It

The worst practice is to resolve disagreements by seniority: whoever has more authority wins. The correct practice is to classify the disagreement by type before deciding who, if anyone, is right.

Type A: Tool flags, human has architectural context that makes the finding irrelevant. Resolution: document the context, suppress with justification.
Type B: Tool flags, human disagrees but cannot articulate why. Resolution: treat as unresolved risk. Escalate or investigate further.
Type C: Tool does not flag, human suspects a problem. Resolution: human judgment should not be suppressed by tool silence. File manually.
Type D: Both agree there is a problem but disagree on severity. Resolution: use a pre-agreed scoring rubric (CVSS, internal scale) rather than negotiation.
Documented Pattern

The 2022 GitHub Copilot security study (NYU/Stanford research, published at IEEE S&P) found that developers using AI code completion were no more likely to write secure code than unassisted developers — but were significantly more confident their code was secure. The AI's lack of a warning was being interpreted as a positive safety signal. This is the Type C failure mode at scale.

Quiz — Lesson 1

The Anatomy of an AI–Human Disagreement
The 2017 Chromium V8 use-after-free bug (CVE-2017-5053) is used in this lesson primarily to illustrate which point?
Correct. The lesson uses this case to show that the tool identified a structural risk correctly while the engineer had domain knowledge the tool lacked — neither was simply wrong, yet the vulnerability shipped.
Not quite. The case is about the complexity of disagreement, not a simple ranking of tools vs. humans.
Microsoft research found engineers marked ~52% of static analysis warnings as false positives without investigation. What was identified as the primary driver?
Correct. The lesson attributes the dismissal behavior to alarm fatigue — not tool quality or training gaps.
The lesson specifically identifies alarm fatigue as the driver, not tool quality or training issues.
A Type B disagreement is defined as: tool flags, human disagrees but cannot articulate why. What is the prescribed resolution?
Correct. When a human cannot articulate why a tool finding is wrong, that inability is itself a signal of unresolved risk requiring investigation, not dismissal.
The inability to articulate a justification is a red flag — the finding should be treated as unresolved, not dismissed in either direction.
The 2022 GitHub Copilot security study found developers using AI completion were more confident their code was secure despite no improvement in actual security. This exemplifies which failure mode?
Correct. Type C is when the tool does not flag something but a human should suspect a problem — the Copilot study showed developers were trusting tool silence as safety confirmation.
Type C covers the case where tool silence is misread as a safety endorsement. That is exactly what the Copilot study documented.

Lab 1 — Classifying Disagreements

Practice identifying disagreement types and prescribing the right resolution path.

Your Task

You will be presented with code audit scenarios where an AI tool and a human reviewer reach different conclusions. Classify each as Type A, B, C, or D, and describe what the correct resolution process should be.

The AI assistant will present scenarios, evaluate your classifications, and give feedback. Complete at least 3 exchanges to finish the lab.

Start by asking for your first scenario, or describe a disagreement you have encountered in practice.
Audit Disagreement Classifier
Lab 1
Hello. I'm your lab assistant for classifying AI–human disagreements in code audits. I'll give you realistic scenarios — your job is to identify the disagreement type (A, B, C, or D from the lesson) and describe the correct resolution. Ready for your first scenario?
Module 6 · Lesson 2

Building the Tiebreaker Protocol

How teams can establish structured, pre-agreed processes for resolving AI–human conflicts before they happen — not after.
If you wait until a disagreement to decide how to resolve it, you have already lost.

When Google's Project Zero team adopted automated vulnerability research tooling in 2014, they explicitly designed a disagreement protocol before the tools went live. Tool findings required a documented human response in one of three categories: Confirmed, Deferred (with a dated re-review commitment), or Suppressed (with a named engineer and written justification). The suppression rate and its correlation with later confirmed bugs became a quarterly metric. This structure — not the tool itself — is what kept the false negative rate measurable and accountable.

Why Pre-Agreement Matters

When a disagreement is resolved ad hoc, the resolution reflects the social dynamics of the moment: who is more senior, who is more confident, who has more time. These are not reliable proxies for correctness. Pre-agreed protocols replace social dynamics with process, and make the resolution auditable.

The NIST Secure Software Development Framework (SSDF, SP 800-218) explicitly requires organizations to "respond to identified vulnerabilities" with documented evidence. A tiebreaker protocol is the mechanism that operationalizes this requirement when AI and human reviewers produce conflicting signals.

The Three Components of a Tiebreaker Protocol
  • Mandatory Classification: Every AI–human conflict must be categorized (using a type system like the one in Lesson 1 or your organization's equivalent) before any resolution action is taken. The category determines the response path.
  • Named Accountability: Every suppressed or deferred finding must name a specific engineer — not a team — who is accountable for that decision. This is a social engineering choice: named accountability changes how carefully people think before dismissing a finding.
  • Temporal Commitment: Deferred findings require a concrete re-review date, not an open-ended "we'll look at it later." Google Project Zero's model used 90-day maximum deferrals. After 90 days, a deferred finding automatically escalated to the security lead.
  • Calibrating Escalation Thresholds

    Not every disagreement should escalate to the same level. Over-escalation creates noise that burns out reviewers; under-escalation lets risks slip through. The calibration should be based on two dimensions: severity of the finding (if confirmed) and confidence of the human's counter-argument.

    Finding Severity Human Confidence Escalation Path
    Critical / High Low (can't articulate) Immediate security lead review. No suppression permitted.
    Critical / High High (documented context) Second engineer must independently confirm suppression justification.
    Medium Low Deferral with 30-day re-review. Flagged in weekly team audit report.
    Medium High Single-engineer suppression with written justification logged.
    Low / Info Any Single-engineer decision. Justification optional but encouraged.
    Real Failure Mode

    The Capital One breach (2019) involved misconfigured AWS security groups that multiple automated scanning tools had flagged at Medium severity. The flags were suppressed by individual engineers without second review. A tiebreaker protocol requiring peer confirmation on Medium suppressions in production infrastructure would have changed the outcome.

    Documenting the Protocol

    The protocol should live in three places simultaneously: your team wiki (for reference), your code review tooling (as a mandatory field when suppressing a finding), and your onboarding materials (so new engineers understand the system before they first encounter a disagreement). A protocol that exists only in documentation and not in tooling will not be followed consistently under deadline pressure.

    Implementation Note

    GitHub's Security Lab team published their internal review escalation criteria in 2021 as part of their CodeQL documentation. Their key insight: escalation thresholds calibrated too conservatively (escalate everything) produced reviewer burnout within 6 weeks. Thresholds calibrated to severity × reproducibility sustained engagement across a 12-month period.

    Key Terms
    Suppression: A documented decision to close a tool finding as not applicable. Must include classification, named accountable engineer, and justification text.
    Deferral: A documented decision to delay resolution of a finding with a committed re-review date. Not a dismissal — the finding remains open.
    Escalation Threshold: The pre-agreed criteria that trigger moving a disagreement to a higher authority level. Should be defined by severity × confidence, not by social dynamics.

    Quiz — Lesson 2

    Building the Tiebreaker Protocol
    Google Project Zero's disagreement protocol classified human responses to tool findings into three categories. Which of the following correctly lists all three?
    Correct. The lesson specifically describes Project Zero's three required response categories: Confirmed, Deferred (with dated re-review), and Suppressed (with named engineer and justification).
    The lesson describes Project Zero's three categories as Confirmed, Deferred, and Suppressed.
    The Capital One 2019 breach is cited in this lesson as an example of what specific protocol failure?
    Correct. Multiple tools flagged the misconfiguration at Medium severity; individual engineers suppressed the flags without second review — exactly the gap a tiebreaker protocol with peer confirmation requirements would close.
    The lesson attributes the failure to individual suppression of Medium findings without peer review confirmation — not tool failure or over-escalation.
    According to the lesson, escalation thresholds calibrated "too conservatively" (escalate everything) produced what documented outcome at GitHub Security Lab?
    Correct. GitHub Security Lab found that over-escalation caused burnout in 6 weeks. Calibrating to severity × reproducibility sustained engagement over 12 months.
    The lesson states over-escalation caused reviewer burnout within 6 weeks — not improved quality or other outcomes.
    Where should a tiebreaker protocol live to be effective under deadline pressure, according to the lesson?
    Correct. The lesson specifies all three locations — wiki, tooling (as a mandatory field), and onboarding — because a protocol in documentation alone will not be followed consistently under pressure.
    Documentation alone is insufficient. The lesson requires the protocol to exist in tooling as a mandatory field to survive deadline pressure.

    Lab 2 — Designing Tiebreaker Protocols

    Draft and stress-test escalation protocols for your own team context.

    Your Task

    Work with the AI assistant to draft a tiebreaker protocol for a specific team context you describe. The assistant will probe your protocol for gaps, test it against realistic scenarios, and help you refine it.

    Describe your team's context (size, stack, tooling, compliance requirements if any) and the assistant will guide you through building a working protocol structure.

    Start by describing your team context, or ask the assistant to give you a hypothetical context to work with.
    Protocol Design Workshop
    Lab 2
    Welcome to the protocol design lab. I'll help you build a tiebreaker protocol — the structured process your team uses when AI tooling and human reviewers disagree. Tell me about your team context (size, codebase, existing tooling, compliance requirements) and we'll build something real. Or I can give you a hypothetical team to work with.
    Module 6 · Lesson 3

    When the AI Is Right and the Human Is Wrong

    The documented patterns of human override errors — and the cognitive traps that produce them.
    Why do experienced engineers confidently dismiss correct findings?

    CVE-2014-0160 — Heartbleed. The OpenSSL maintainers had reviewed the bounds-checking code in tls1_process_heartbeat() multiple times across two years. The missing bounds check on the payload length variable was structurally obvious to Codenomicon's automated fuzzing tools in 2014 but invisible to expert humans who had built the surrounding code. The experts' familiarity with intent — knowing what the code was supposed to do — prevented them from seeing what it actually did. This is not incompetence; it is a well-documented cognitive effect called the tunnel of expertise.

    The Cognitive Traps Behind Override Errors

    Human reviewers make systematic, predictable errors when overriding AI findings. Understanding these patterns does not prevent them completely, but it allows teams to build structural checks that compensate for them.

    Familiarity Bias

    Code you wrote or reviewed before is harder to see clearly now. The brain pattern-matches to memory rather than reading the actual text. Heartbleed is the canonical case. Mitigation: never be the sole reviewer of your own code, regardless of seniority.

    Intent Projection

    Reviewers see what the code is meant to do and unconsciously fill in gaps. A missing bounds check is invisible when the reviewer already knows bounds are supposed to be checked. Mitigation: ask "what does this code actually do" separately from "what should it do."

    Seniority Override

    Junior reviewers rarely push back on a senior's dismissal even when they have a legitimate concern. The 1996 Ariane 5 Flight 501 failure involved a software team that had raised concerns about operand overflow; management override dismissed them. Mitigation: anonymous escalation paths.

    Scope Tunnel

    A finding may be dismissed as "not a problem in this context" based on local reasoning that does not account for how the code is called. SQL injection vulnerabilities in internal APIs are routinely dismissed this way — until the API gets exposed. Mitigation: require callers analysis before suppression of injection-class findings.

    Base Rates and Confidence Calibration

    The Veracode State of Software Security report (2023) analyzed suppression rates across 750,000+ code scans. High-severity findings suppressed by human review were later confirmed as real vulnerabilities at a rate of 23%. Medium-severity suppressions were confirmed at 31%. This is not a rounding error — nearly one in three human override decisions on medium-severity AI findings was incorrect.

    This should be the baseline assumption when a reviewer says "this is a false positive" with medium confidence: there is roughly a 25–30% chance they are wrong. That probability does not mean the reviewer is bad at their job; it means the disagreement protocol must not rely on human confidence alone.

    Key Principle

    Confidence is not evidence. A reviewer who confidently dismisses a finding should trigger the same protocol as one who is uncertain. The protocol should route based on finding severity, not on how sure the reviewer sounds.

    Structural Safeguards

    Several engineering organizations have documented structural interventions that measurably reduce false negative rates from human overrides:

  • Blind Re-review: A second engineer reviews the finding and the code without seeing the first reviewer's conclusion. Used by Apple's Security Engineering and Architecture (SEAR) team for critical-severity findings. Reduces confirmation bias in the review chain.
  • Adversarial Mode: Before suppressing a finding, the reviewer must spend 15 minutes actively trying to construct an exploit scenario, not just arguing why it cannot be exploited. Mozilla's security team documented this practice in their 2019 audit standards.
  • Historical Suppression Audit: Quarterly review of suppressed findings from the prior quarter against the current known-vulnerability database. Amazon Web Services runs this as part of their internal security review cadence. If suppressed findings are later confirmed at a rate above your team's threshold, re-open them.
  • Anonymous Escalation: Any engineer can escalate a disagreement to the security lead without it being attributed to them in the normal review system. Reduces the social cost of junior engineers pushing back on senior decisions.
  • Calibration Data Point

    The 2021 NSA/CISA joint advisory on software security noted that organizations with formal suppression audit processes had measurably lower mean time to detect (MTTD) for injected vulnerabilities in penetration tests — average 4.2 days vs. 23.7 days in organizations without formal audit processes. The audit creates accountability that changes suppression behavior even before vulnerabilities are confirmed.

    Quiz — Lesson 3

    When the AI Is Right and the Human Is Wrong
    The Heartbleed bug (CVE-2014-0160) is used in this lesson to illustrate which cognitive trap?
    Correct. The lesson describes the "tunnel of expertise" — familiarity with the code's intent made the bounds-checking error invisible to the experts who had reviewed the same code repeatedly.
    The lesson attributes Heartbleed to familiarity with intent making the actual bug invisible — a combination of familiarity bias and intent projection.
    Veracode's 2023 analysis of 750,000+ scans found medium-severity suppressions were later confirmed as real vulnerabilities at what rate?
    Correct. 31% of medium-severity human suppressions were later confirmed as real vulnerabilities — roughly one in three override decisions was wrong.
    The lesson states medium-severity suppressions were confirmed as real at 31% — a critical figure for understanding how unreliable human confidence is as a sole decision criterion.
    Mozilla's security team documented which specific structural safeguard in their 2019 audit standards?
    Correct. Mozilla's 2019 audit standards required adversarial mode — actively trying to exploit before dismissing — not just arguing why exploitation isn't possible.
    The lesson attributes adversarial mode specifically to Mozilla's 2019 standards. Blind re-review is attributed to Apple SEAR; historical audit to AWS.
    According to the NSA/CISA 2021 advisory data cited, organizations with formal suppression audit processes had what mean time to detect (MTTD) for injected vulnerabilities?
    Correct. 4.2 days vs. 23.7 days — a roughly 5.6x improvement in detection speed attributable to the accountability created by formal suppression audit processes.
    The lesson states 4.2 days for organizations with formal suppression audits vs. 23.7 days without. The 23.7 days is the unstructured baseline, not the structured result.

    Lab 3 — Adversarial Mode Practice

    Practice the adversarial mindset: argue for the finding before you argue against it.

    Your Task

    The AI assistant will present a code snippet with a finding that a human reviewer has already dismissed as a false positive. Your job is to apply adversarial mode: spend time constructing the strongest possible argument that the finding IS real before deciding.

    The assistant will evaluate your adversarial argument and then discuss whether the original dismissal was warranted or whether your argument reveals a real risk.

    Ask for your first code scenario, or describe a real finding you have previously dismissed that you want to stress-test.
    Adversarial Mode Trainer
    Lab 3
    Welcome to the adversarial mode lab. I'll give you a code scenario with a dismissed finding. Your job is NOT to decide immediately — it's to first construct the strongest possible argument that the finding is real. Then we'll evaluate whether the dismissal was justified. Ready for your first scenario?
    Module 6 · Lesson 4

    When the Human Is Right and the AI Is Wrong

    Managing false positives without building a culture of dismissal — and how to feed disagreements back into tool improvement.
    Every suppressed false positive is either a safe outcome or a missed calibration opportunity. How do you tell which?

    When Facebook (now Meta) deployed Infer — their open-source static analyzer — across the main mobile codebase in 2015, the initial deployment produced thousands of findings per week. The engineering teams' response was to build a systematic false positive feedback loop, not to lower alert thresholds globally. Each dismissed finding required a category tag. Over 18 months, the tag data allowed the Infer team to retrain and tune the analyzer, reducing false positive rates by 60% without reducing true positive detection. The feedback loop turned disagreements into tool improvements.

    What Makes a Legitimate False Positive

    A false positive is only definitively a false positive when you can state, specifically, why the tool's analysis model does not apply to this code in this context. Vague dismissals ("this is fine," "we handle this elsewhere") are not false positives — they are unverified suppressions that carry ongoing risk.

    The four legitimate grounds for suppression are:

  • Demonstrated unreachability: The code path the tool analyzed is provably unreachable in production — not assumed unreachable, but proven by control flow analysis or architecture documentation.
  • External mitigation: The vulnerability class is mitigated by infrastructure outside the code (WAF rule, network segmentation, runtime sandbox) with documented evidence that the mitigation is active and monitored.
  • Tool model mismatch: The tool's analysis model does not apply to this language pattern or runtime (e.g., a C-style bounds check flagged in a context where the runtime provides its own guaranteed bounds enforcement). Requires specific technical documentation.
  • Accepted risk with compensating control: The risk is acknowledged, a compensating control is in place, and the residual risk is formally accepted by a named accountable person at the appropriate authority level.
  • The Feedback Loop Architecture

    Meta's Infer deployment succeeded not because engineers suppressed findings wisely, but because suppression data was systematically fed back to the tool team. This requires three infrastructure choices that most teams skip:

    Structured Suppression Tags

    Every suppression must select a category from a controlled vocabulary — not free text. Categories map directly to tool improvement strategies: "unreachable path" maps to reachability analysis improvements; "model mismatch" maps to rule tuning.

    Suppression Rate Dashboards

    Track suppression rates by rule, by codebase area, and by engineer. High suppression rates on a specific rule are a signal that the rule is miscalibrated for your codebase. This is actionable data, not noise.

    Quarterly Rule Reviews

    Rules with suppression rates above a threshold (e.g., 70%) should be reviewed and either retuned or removed. Keeping high-suppression rules active trains engineers to dismiss findings globally.

    Tool Team Access

    The team that maintains your analysis tooling should have read access to anonymized suppression data. This is how Infer improved — the tool team could see which rules engineers found untrustworthy and why.

    Anti-Pattern to Avoid

    Globally lowering tool sensitivity to reduce alert volume is the worst possible response to a high false positive rate. It eliminates the false positives and an unknown number of true positives simultaneously. You cannot measure what you have lost. Per-rule suppression with documented feedback is always preferable to sensitivity reduction.

    Communicating False Positive Decisions Upward

    Security and engineering leadership need visibility into disagreement patterns — not individual findings, but aggregate trends. A monthly report showing "tool X produced 340 findings; 218 confirmed, 89 suppressed with justification, 33 under investigation" gives leadership the data to assess tool investment and team process health without requiring them to review individual cases.

    The OWASP Software Assurance Maturity Model (SAMM) version 2.0, released in 2020, explicitly identifies this reporting capability as a Level 3 maturity indicator in its Code Review practice. Teams that report aggregate AI–human disagreement metrics externally (to CISO, board, or compliance auditors) demonstrate higher security process maturity than those that report only confirmed findings.

    Synthesis

    The goal across all four lessons in this module is not to make AI tools or human reviewers win every disagreement — it is to make every disagreement productive. Classified, documented, and fed back into process improvement, disagreements become the primary mechanism by which both human judgment and AI tooling improve over time. A team that treats every conflict as a data point will outperform a team that treats every conflict as an inconvenience.

    Key Terms
    False Positive: A finding the tool raised that does not represent an actual risk. Only legitimate when a specific, documented reason explains why the tool's model does not apply.
    Suppression Rate: The percentage of findings for a given rule or category that human reviewers dismiss. A meaningful metric for tool calibration, not just a volume indicator.
    Feedback Loop: The structured process by which suppression decisions and their categories are returned to tool maintainers to improve analyzer precision over time.

    Quiz — Lesson 4

    When the Human Is Right and the AI Is Wrong
    Meta's Infer deployment achieved a 60% reduction in false positive rates over 18 months through what mechanism?
    Correct. The lesson specifically describes how categorized suppression tags fed back to the Infer tool team enabled precision improvements — not sensitivity reduction or tool replacement.
    The lesson credits the improvement to the feedback loop — structured suppression data returned to the tool team — not to sensitivity changes or tool replacement.
    Which of the following is listed in the lesson as NOT a legitimate ground for suppression?
    Correct. The lesson explicitly calls out vague dismissals as "not false positives — they are unverified suppressions that carry ongoing risk." Legitimate grounds require specific, documented reasoning.
    The lesson explicitly states that vague dismissals like "this is fine" or "we handle this elsewhere" are not legitimate false positive classifications.
    The OWASP SAMM v2.0 (2020) identifies reporting aggregate AI–human disagreement metrics as what maturity indicator?
    Correct. SAMM v2.0 places this reporting capability at Level 3 in the Code Review practice — the highest maturity tier, distinguishing advanced programs from standard ones.
    The lesson states OWASP SAMM v2.0 identifies this as a Level 3 maturity indicator in the Code Review practice.
    Why does the lesson argue that globally lowering tool sensitivity is the "worst possible response" to a high false positive rate?
    Correct. Global sensitivity reduction silently removes both noise and signal. Unlike per-rule suppression with documented feedback, it provides no data about what was lost and cannot be audited or reversed with precision.
    The lesson's argument is about measurement: sensitivity reduction removes true positives along with false positives invisibly. You cannot know what you missed.

    Lab 4 — Building the Feedback Loop

    Design a suppression feedback system that turns disagreements into tool improvements.

    Your Task

    Work with the AI assistant to design a suppression feedback loop for a team using one or more static analysis or AI audit tools. You will define suppression tag taxonomies, dashboard metrics, and quarterly review processes.

    The assistant will challenge your design with realistic edge cases and help you identify gaps before implementation.

    Describe the audit tooling your team uses (or ask for a hypothetical), and we'll build the feedback loop together.
    Feedback Loop Designer
    Lab 4
    Welcome to the feedback loop lab. We're going to design the system that turns your team's suppression decisions into tool improvement data — the mechanism Meta used to cut Infer's false positive rate by 60%. Tell me what analysis tooling your team uses, or I can give you a hypothetical setup. We'll build the tag taxonomy, dashboard metrics, and review cadence together.

    Module 6 — Final Test

    When AI and Human Disagree · 15 questions · 80% to pass
    1. A Type C disagreement occurs when:
    Correct.
    Type C is defined as: tool does not flag, human suspects a problem. Human judgment should not be suppressed by tool silence.
    2. What percentage of dismissed static analysis warnings in Microsoft's Windows research later contained real issues?
    Correct. Of the ~52% dismissed without investigation, roughly 8% later contained real issues that shipped.
    The lesson states approximately 8% of dismissals without investigation contained real issues that shipped.
    3. The correct resolution for a Type D disagreement (both agree there is a problem but disagree on severity) is:
    Correct. Pre-agreed rubrics remove social dynamics from severity decisions.
    Type D should be resolved by pre-agreed rubric (CVSS or equivalent), not by seniority or default escalation.
    4. Google Project Zero's maximum deferral period before automatic escalation was:
    Correct. Project Zero used 90-day maximum deferrals before automatic escalation to the security lead.
    The lesson specifies 90 days as Project Zero's deferral limit before automatic escalation.
    5. For a Critical/High severity finding where the human reviewer has LOW confidence in their counter-argument, the prescribed escalation path is:
    Correct. Critical/High + low human confidence = immediate security lead review, no suppression permitted.
    The matrix prescribes immediate security lead review with no suppression permitted for this combination.
    6. The "tunnel of expertise" described in relation to Heartbleed refers to:
    Correct. The tunnel of expertise is the cognitive effect where knowing what code is supposed to do prevents seeing what it actually does.
    The tunnel of expertise is the documented phenomenon where deep familiarity with intent creates blindness to actual defects.
    7. Veracode's 2023 data showed high-severity suppressions were later confirmed as real vulnerabilities at what rate?
    Correct. High-severity suppressions confirmed at 23%; medium-severity at 31%.
    High-severity suppressions were confirmed at 23%. Medium-severity was the higher 31% figure.
    8. Apple's Security Engineering and Architecture (SEAR) team uses blind re-review for critical-severity findings. "Blind" means:
    Correct. Blind re-review means the second engineer sees the code and finding but not the first reviewer's conclusion, reducing confirmation bias.
    Blind re-review means the second reviewer does not see the first reviewer's conclusion — they evaluate independently to reduce confirmation bias.
    9. Meta's Infer feedback loop required suppression tags from a controlled vocabulary rather than free text. The primary benefit of structured tags over free text is:
    Correct. "Unreachable path" maps to reachability analysis improvements; "model mismatch" maps to rule tuning — structured categories enable targeted tool improvement.
    The key benefit is that structured categories translate directly into specific analyzer improvement strategies, which free text does not.
    10. A tiebreaker protocol that exists only in the team wiki but not in code review tooling will likely fail because:
    Correct. The lesson states that protocols existing only in documentation and not in tooling will not be followed consistently under deadline pressure.
    The lesson's argument is behavioral: deadline pressure causes people to skip documentation-only processes. Tooling enforcement is required for consistency.
    11. The 2022 GitHub Copilot security study found that AI-assisted developers were no more secure than unassisted developers but were significantly more confident. This finding most directly argues for which process change?
    Correct. The study shows that tool silence is being misread as safety endorsement (Type C failure), requiring process changes that prevent this false inference.
    The study identifies the misreading of tool silence as a safety signal. The process fix is treating silence as neutral and maintaining independent review.
    12. Which structural safeguard requires engineers to actively try to construct an exploit before they are permitted to suppress a finding?
    Correct. Adversarial mode (documented by Mozilla) requires 15 minutes actively constructing exploit scenarios before suppression is permitted.
    Adversarial mode is the practice — document by Mozilla — of requiring exploit construction attempts before suppression.
    13. "Accepted risk with compensating control" is listed as a legitimate suppression ground. What additional requirement must be met?
    Correct. Accepted risk requires a named accountable person at the appropriate authority level — not just an engineer's informal judgment.
    The lesson requires formal acceptance by a named accountable person at the appropriate authority level for this suppression category.
    14. A suppression rate of 70% or higher on a specific analysis rule should trigger what action, according to the lesson?
    Correct. Rules with ≥70% suppression rates should be reviewed for retuning or removal — keeping them active conditions engineers to dismiss findings broadly.
    The lesson prescribes review for retuning or removal — not automatic deletion or severity escalation.
    15. According to the module's synthesis, what is the primary goal of a mature AI–human disagreement process?
    Correct. The synthesis explicitly states: "A team that treats every conflict as a data point will outperform a team that treats every conflict as an inconvenience."
    The synthesis frames the goal as making every disagreement productive and using conflicts as data for improvement — not minimizing findings or defaulting to AI.