In 2017, the Chromium project's automated tooling flagged a memory management pattern in V8 as a potential use-after-free vulnerability. A senior V8 engineer marked it WONTFIX, arguing the lifetime guarantees were enforced elsewhere. Months later, the pattern became the root cause of CVE-2017-5053, a remotely exploitable heap corruption bug. The human had domain knowledge the tool lacked context for — but the tool had identified the structural risk correctly. Neither was simply wrong.
AI-assisted code audit tools — whether traditional static analyzers like Coverity and CodeQL or LLM-based reviewers — reach conclusions through pattern matching, dataflow analysis, and statistical inference. Human reviewers bring architectural intent, business context, and runtime intuition. The gaps between these modes of reasoning produce predictable categories of conflict.
AI tools analyze what is written, not what is meant. A function that appears to double-free a pointer may be architecturally guaranteed to only execute one branch — but that guarantee lives in a design document, not the code.
Training data and rule sets encode past vulnerabilities. Novel idioms, domain-specific invariants, or intentionally unsafe-looking code (e.g., SIMD intrinsics, embedded firmware) will trigger false positives at high rates.
The most dangerous case. Engineers who built a system believe they understand all its invariants. CVE-2014-0160 (Heartbleed) existed in OpenSSL for two years while expert maintainers reviewed the code — they simply didn't see what wasn't there.
AI tools typically reason at file or function scope. Humans reason at system scope. A finding may be correct at the local level but irrelevant at the system level — or vice versa. Neither perspective is complete.
Microsoft's internal research, published in the 2019 SOUPS proceedings, measured developer responses to static analysis warnings across Windows components. Engineers marked approximately 52% of all static analysis warnings as false positives without investigating them — not because they were definitely wrong, but because alarm fatigue had eroded trust. Of those dismissed warnings, later audit found roughly 8% contained real issues that shipped.
This creates a structural problem: the act of disagreeing with AI tooling has a non-trivial false negative cost that is invisible at the moment of disagreement. Your team needs a framework for disagreement that is neither reflexive acceptance nor reflexive dismissal.
Disagreement between AI and human reviewers is not a failure state — it is a signal. The question is what kind of signal: a gap in tool context, a gap in human attention, or a genuine ambiguity that requires deeper investigation. Each type demands a different response.
The worst practice is to resolve disagreements by seniority: whoever has more authority wins. The correct practice is to classify the disagreement by type before deciding who, if anyone, is right.
The 2022 GitHub Copilot security study (NYU/Stanford research, published at IEEE S&P) found that developers using AI code completion were no more likely to write secure code than unassisted developers — but were significantly more confident their code was secure. The AI's lack of a warning was being interpreted as a positive safety signal. This is the Type C failure mode at scale.
You will be presented with code audit scenarios where an AI tool and a human reviewer reach different conclusions. Classify each as Type A, B, C, or D, and describe what the correct resolution process should be.
The AI assistant will present scenarios, evaluate your classifications, and give feedback. Complete at least 3 exchanges to finish the lab.
When Google's Project Zero team adopted automated vulnerability research tooling in 2014, they explicitly designed a disagreement protocol before the tools went live. Tool findings required a documented human response in one of three categories: Confirmed, Deferred (with a dated re-review commitment), or Suppressed (with a named engineer and written justification). The suppression rate and its correlation with later confirmed bugs became a quarterly metric. This structure — not the tool itself — is what kept the false negative rate measurable and accountable.
When a disagreement is resolved ad hoc, the resolution reflects the social dynamics of the moment: who is more senior, who is more confident, who has more time. These are not reliable proxies for correctness. Pre-agreed protocols replace social dynamics with process, and make the resolution auditable.
The NIST Secure Software Development Framework (SSDF, SP 800-218) explicitly requires organizations to "respond to identified vulnerabilities" with documented evidence. A tiebreaker protocol is the mechanism that operationalizes this requirement when AI and human reviewers produce conflicting signals.
Not every disagreement should escalate to the same level. Over-escalation creates noise that burns out reviewers; under-escalation lets risks slip through. The calibration should be based on two dimensions: severity of the finding (if confirmed) and confidence of the human's counter-argument.
| Finding Severity | Human Confidence | Escalation Path |
|---|---|---|
| Critical / High | Low (can't articulate) | Immediate security lead review. No suppression permitted. |
| Critical / High | High (documented context) | Second engineer must independently confirm suppression justification. |
| Medium | Low | Deferral with 30-day re-review. Flagged in weekly team audit report. |
| Medium | High | Single-engineer suppression with written justification logged. |
| Low / Info | Any | Single-engineer decision. Justification optional but encouraged. |
The Capital One breach (2019) involved misconfigured AWS security groups that multiple automated scanning tools had flagged at Medium severity. The flags were suppressed by individual engineers without second review. A tiebreaker protocol requiring peer confirmation on Medium suppressions in production infrastructure would have changed the outcome.
The protocol should live in three places simultaneously: your team wiki (for reference), your code review tooling (as a mandatory field when suppressing a finding), and your onboarding materials (so new engineers understand the system before they first encounter a disagreement). A protocol that exists only in documentation and not in tooling will not be followed consistently under deadline pressure.
GitHub's Security Lab team published their internal review escalation criteria in 2021 as part of their CodeQL documentation. Their key insight: escalation thresholds calibrated too conservatively (escalate everything) produced reviewer burnout within 6 weeks. Thresholds calibrated to severity × reproducibility sustained engagement across a 12-month period.
Work with the AI assistant to draft a tiebreaker protocol for a specific team context you describe. The assistant will probe your protocol for gaps, test it against realistic scenarios, and help you refine it.
Describe your team's context (size, stack, tooling, compliance requirements if any) and the assistant will guide you through building a working protocol structure.
CVE-2014-0160 — Heartbleed. The OpenSSL maintainers had reviewed the bounds-checking code in tls1_process_heartbeat() multiple times across two years. The missing bounds check on the payload length variable was structurally obvious to Codenomicon's automated fuzzing tools in 2014 but invisible to expert humans who had built the surrounding code. The experts' familiarity with intent — knowing what the code was supposed to do — prevented them from seeing what it actually did. This is not incompetence; it is a well-documented cognitive effect called the tunnel of expertise.
Human reviewers make systematic, predictable errors when overriding AI findings. Understanding these patterns does not prevent them completely, but it allows teams to build structural checks that compensate for them.
Code you wrote or reviewed before is harder to see clearly now. The brain pattern-matches to memory rather than reading the actual text. Heartbleed is the canonical case. Mitigation: never be the sole reviewer of your own code, regardless of seniority.
Reviewers see what the code is meant to do and unconsciously fill in gaps. A missing bounds check is invisible when the reviewer already knows bounds are supposed to be checked. Mitigation: ask "what does this code actually do" separately from "what should it do."
Junior reviewers rarely push back on a senior's dismissal even when they have a legitimate concern. The 1996 Ariane 5 Flight 501 failure involved a software team that had raised concerns about operand overflow; management override dismissed them. Mitigation: anonymous escalation paths.
A finding may be dismissed as "not a problem in this context" based on local reasoning that does not account for how the code is called. SQL injection vulnerabilities in internal APIs are routinely dismissed this way — until the API gets exposed. Mitigation: require callers analysis before suppression of injection-class findings.
The Veracode State of Software Security report (2023) analyzed suppression rates across 750,000+ code scans. High-severity findings suppressed by human review were later confirmed as real vulnerabilities at a rate of 23%. Medium-severity suppressions were confirmed at 31%. This is not a rounding error — nearly one in three human override decisions on medium-severity AI findings was incorrect.
This should be the baseline assumption when a reviewer says "this is a false positive" with medium confidence: there is roughly a 25–30% chance they are wrong. That probability does not mean the reviewer is bad at their job; it means the disagreement protocol must not rely on human confidence alone.
Confidence is not evidence. A reviewer who confidently dismisses a finding should trigger the same protocol as one who is uncertain. The protocol should route based on finding severity, not on how sure the reviewer sounds.
Several engineering organizations have documented structural interventions that measurably reduce false negative rates from human overrides:
The 2021 NSA/CISA joint advisory on software security noted that organizations with formal suppression audit processes had measurably lower mean time to detect (MTTD) for injected vulnerabilities in penetration tests — average 4.2 days vs. 23.7 days in organizations without formal audit processes. The audit creates accountability that changes suppression behavior even before vulnerabilities are confirmed.
The AI assistant will present a code snippet with a finding that a human reviewer has already dismissed as a false positive. Your job is to apply adversarial mode: spend time constructing the strongest possible argument that the finding IS real before deciding.
The assistant will evaluate your adversarial argument and then discuss whether the original dismissal was warranted or whether your argument reveals a real risk.
When Facebook (now Meta) deployed Infer — their open-source static analyzer — across the main mobile codebase in 2015, the initial deployment produced thousands of findings per week. The engineering teams' response was to build a systematic false positive feedback loop, not to lower alert thresholds globally. Each dismissed finding required a category tag. Over 18 months, the tag data allowed the Infer team to retrain and tune the analyzer, reducing false positive rates by 60% without reducing true positive detection. The feedback loop turned disagreements into tool improvements.
A false positive is only definitively a false positive when you can state, specifically, why the tool's analysis model does not apply to this code in this context. Vague dismissals ("this is fine," "we handle this elsewhere") are not false positives — they are unverified suppressions that carry ongoing risk.
The four legitimate grounds for suppression are:
Meta's Infer deployment succeeded not because engineers suppressed findings wisely, but because suppression data was systematically fed back to the tool team. This requires three infrastructure choices that most teams skip:
Every suppression must select a category from a controlled vocabulary — not free text. Categories map directly to tool improvement strategies: "unreachable path" maps to reachability analysis improvements; "model mismatch" maps to rule tuning.
Track suppression rates by rule, by codebase area, and by engineer. High suppression rates on a specific rule are a signal that the rule is miscalibrated for your codebase. This is actionable data, not noise.
Rules with suppression rates above a threshold (e.g., 70%) should be reviewed and either retuned or removed. Keeping high-suppression rules active trains engineers to dismiss findings globally.
The team that maintains your analysis tooling should have read access to anonymized suppression data. This is how Infer improved — the tool team could see which rules engineers found untrustworthy and why.
Globally lowering tool sensitivity to reduce alert volume is the worst possible response to a high false positive rate. It eliminates the false positives and an unknown number of true positives simultaneously. You cannot measure what you have lost. Per-rule suppression with documented feedback is always preferable to sensitivity reduction.
Security and engineering leadership need visibility into disagreement patterns — not individual findings, but aggregate trends. A monthly report showing "tool X produced 340 findings; 218 confirmed, 89 suppressed with justification, 33 under investigation" gives leadership the data to assess tool investment and team process health without requiring them to review individual cases.
The OWASP Software Assurance Maturity Model (SAMM) version 2.0, released in 2020, explicitly identifies this reporting capability as a Level 3 maturity indicator in its Code Review practice. Teams that report aggregate AI–human disagreement metrics externally (to CISO, board, or compliance auditors) demonstrate higher security process maturity than those that report only confirmed findings.
The goal across all four lessons in this module is not to make AI tools or human reviewers win every disagreement — it is to make every disagreement productive. Classified, documented, and fed back into process improvement, disagreements become the primary mechanism by which both human judgment and AI tooling improve over time. A team that treats every conflict as a data point will outperform a team that treats every conflict as an inconvenience.
Work with the AI assistant to design a suppression feedback loop for a team using one or more static analysis or AI audit tools. You will define suppression tag taxonomies, dashboard metrics, and quarterly review processes.
The assistant will challenge your design with realistic edge cases and help you identify gaps before implementation.