In October 2021, Zillow announced it was shutting down Zillow Offers, its algorithmic home-buying unit, and writing down $569 million in losses. The company's pricing agent had been systematically overbidding on houses β sometimes by 20% above market value β because it trained on listing prices rather than realized sale prices, and because human reviewers had been progressively removed from the loop as the model hit its volume targets.
Zillow's CEO Rich Barton acknowledged in the earnings call that "the unpredictability in forecasting home prices far exceeds what we anticipated." But internal reporting later showed that field managers had flagged the model's overconfidence months earlier. Those flags had not been escalated. There was no formal mechanism to receive them. There had been no risk audit β only periodic business reviews focused on volume and margin.
Most organizations that deploy AI agents conduct some form of ongoing review: performance dashboards, weekly standups, quarterly business reviews. These are useful. They are not risk audits. A business review asks, "Is the agent hitting its targets?" A risk audit asks, "What could this agent do that we haven't anticipated, and who would know?"
The distinction matters because agent failures often occur in the gap between those two questions. Zillow's agent was hitting volume targets. Barton's team reviewed those targets every week. Nobody asked what the model would do in a cooling market, what it would do if it became a significant price-setter in local markets, or what signals from the field would indicate overbidding was systemic rather than incidental.
A risk audit is a structured, adversarial investigation of an agent system β its decision logic, its data inputs, its human oversight mechanisms, its failure modes, and its organizational accountability structures. It produces findings, not just metrics. It involves people outside the team that built and runs the agent. And it asks uncomfortable questions on purpose.
Across documented post-mortems β from Zillow to Amazon's discontinued recruiting AI to the Dutch childcare benefits algorithm β effective risk audits share four properties that distinguish them from performance reviews.
Security professionals have long distinguished between vulnerability scanning (automated, surface-level) and threat modeling (structured reasoning about who might cause harm, how, and with what consequences). A risk audit of an AI agent is closer to threat modeling than to automated scanning.
Before collecting any data, the auditor should articulate a set of threat scenarios: specific, plausible ways the agent could produce harm. For a customer-service agent, this might include: handling a customer in financial distress in a way that increases debt; providing medically relevant information without appropriate caveats; escalating a dispute in a manner that violates consumer protection regulations. For a procurement agent, it might include: concentrating vendor relationships in ways that create single points of failure; approving purchases outside policy without triggering review; generating false invoices at scale.
The Dutch Syri case β in which an algorithmic welfare fraud detection system was struck down by a Dutch court in 2020 β illustrates what happens when threat modeling is absent. The system combined data from seventeen government databases to generate fraud risk scores. No threat model had asked: what if legitimate citizens are systematically misclassified? What if the data combines in ways that discriminate by postal code? What if there is no human reviewer capable of explaining a score to a challenged citizen? All of these scenarios materialized. None had been formally anticipated.
A risk audit begins not with data collection but with imagination: who could be harmed, in what way, through what mechanism? Only once those scenarios are written down can you assess whether the agent's current design and oversight prevent them.
Before any audit begins, three scoping decisions must be made explicitly β because making them implicitly means someone else makes them for you, usually in ways that narrow the audit's usefulness.
Each lesson in this module adds one layer to your audit: L1 establishes the audit concept and scope, L2 covers the risk identification methodology, L3 examines oversight gap analysis, and L4 builds the findings report and remediation plan. By the end, you will have a full audit framework you can apply to a real agent in your organization.
Think of an AI agent currently in use at your organization β or one you are planning to deploy. This can be a customer service chatbot, a procurement automation tool, a hiring screener, a content recommendation system, or any other agent that takes actions or makes decisions. Work with the audit coach below to define a defensible audit scope for that agent.
Between 2014 and 2018, Amazon developed a machine learning tool to screen engineering job applications. The system was trained on ten years of submitted resumes β which, because Amazon's engineering workforce was predominantly male, meant it trained to penalize resumes that included the word "women's" (as in "women's chess club") and to downgrade graduates of all-women's colleges. The bias was not intentional. It was not in the requirements document. It was not visible in the model's aggregate accuracy metrics. It emerged from the interaction between training data composition and objective function β a category of risk that only systematic threat modeling would have surfaced.
Amazon's team discovered the problem in 2015, attempted to correct it through 2017, concluded it could not reliably prevent the model from finding proxy variables for gender, and quietly disbanded the team in 2018. The tool had operated for at least a year after the initial discovery before being shut down. Reuters reported the story in October 2018. There had been no external audit, no structured stakeholder review, and no formal threat inventory that would have flagged training data composition as a first-order risk.
The Amazon case illustrates why risk identification cannot rely on intuition or on reviewing what the team was asked to build. Risks emerge from system interactions β between training data and objectives, between model outputs and downstream processes, between agent automation and human judgment. A structured threat inventory must cover four domains explicitly.
Training data composition, data drift, proxy variable encoding, distribution shift between training and deployment, missing data handling, and the values embedded in labeling decisions. Amazon's recruiting failure was entirely a data domain risk.
Objective function misalignment, overconfidence in low-data regions, distributional sensitivity to input format, emergent behaviors at scale, and model behavior under adversarial inputs. Zillow's overbidding was partly a model risk: overconfidence in a rising market condition.
Feedback loops between agent outputs and future training data, automation of downstream processes that amplify errors, agent-to-agent interactions in multi-agent pipelines, and latency between detection and correction. The 2010 Flash Crash β partly attributable to interacting algorithmic trading agents β is a canonical integration risk event.
Absence of accountability for agent decisions, unclear escalation paths, misalignment between agent authority and human capacity to review, lack of audit trails, and organizational incentives that suppress risk reporting. This was Zillow's primary failure mode.
Microsoft's STRIDE framework β originally developed for software security β has been adapted by several AI governance teams as a structured threat enumeration method. The original acronym covers Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege. For AI agents, each category maps to distinct risk vectors.
Spoofing in agent contexts includes prompt injection (a user or upstream system feeding the agent instructions that override its guidelines), identity fraud in multi-agent pipelines, and falsified data provenance. Tampering includes training data poisoning, model weight manipulation in shared model registries, and adversarial perturbation of inputs. Repudiation covers agents that take consequential actions with no audit trail β a risk particularly acute in agentic systems that execute code or make financial transactions.
Information Disclosure includes model inversion attacks (inferring training data from model outputs), membership inference (determining whether specific individuals were in the training set), and data leakage through agent responses. Denial of Service in agentic systems includes resource exhaustion through adversarial input design and agent action loops. Elevation of Privilege β increasingly critical in 2024β2025 deployments β covers agents that acquire permissions beyond their initial grant, either by misinterpreting instructions or by being manipulated by external content.
The value of STRIDE is not that it covers every possible risk β it doesn't. The value is that it provides a systematic checklist that forces the audit team to consider risk categories they wouldn't generate through free-form brainstorming. Complement it with domain-specific checklists for your sector (financial services, healthcare, HR) and with stakeholder interviews that surface operational risks engineers don't see.
A complete threat inventory will contain more risks than any organization can address simultaneously. Prioritization requires a scoring framework. The standard approach β probability Γ impact β is necessary but insufficient for AI agents, because some agent failures are extremely difficult to reverse once they occur.
The Dutch Syri case illustrates this asymmetry: the probability that any individual would be wrongly flagged was relatively low. But once flagged, individuals faced benefit suspension, debt collection, and reputational harm β outcomes that were difficult to reverse even when the error was acknowledged. A standard probability Γ impact matrix would have underweighted this risk. Adding a reversibility dimension changes the calculus: even low-probability harms that are hard to reverse warrant higher priority treatment.
For your risk inventory, use a three-factor rating: Probability (how likely is this scenario?), Impact (who is affected and how severely?), and Reversibility (if this occurs, can the harm be corrected?). Rate each on a 1β5 scale and compute a weighted score, with reversibility weighted at 1.5Γ because agent systems operate at speed and scale that outpaces human correction capacity.
The product of your risk identification work is a risk register: a table listing each identified risk, its domain (data/model/integration/governance), its probability/impact/reversibility scores, its composite priority, and the agent component it is associated with. A risk register with no entries is not a clean bill of health β it is evidence the audit was not performed seriously.
Documentation and technical review will not surface all material risks. Amazon's recruiting system bias was not in any requirements document. Zillow's overbidding was known to field managers but never formally captured. A systematic interview process with a defined stakeholder set is a required component of any rigorous risk identification.
Interview categories that consistently surface material risks: frontline staff who interact with the agent's outputs daily; affected populations who experience agent decisions without direct interaction; adjacent system owners whose systems consume agent outputs; compliance and legal teams who understand regulatory exposure; and customer-facing staff who hear complaints that don't reach engineering teams. In each interview, ask: "What would you change about this system if you had the authority?" and "When have you seen this system behave in a way that surprised or concerned you?"
Using the agent you defined in Lab 1 (or a new one), work with the risk identification coach to build a threat inventory covering all four risk domains: data, model, integration, and governance. You'll also practice scoring risks using the probability Γ impact Γ reversibility framework.
On June 1, 2009, Air France Flight 447 disappeared over the Atlantic Ocean with 228 people aboard. The Bureau d'EnquΓͺtes et d'Analyses investigation, completed in 2012, identified a failure pattern that AI governance researchers have since adopted as a reference case: the pilots had operated the Airbus A330's fly-by-wire automation system for years without developing the manual flying skills to intervene effectively when automation failed. When the pitot tubes iced over and the autopilot disconnected, the crew had 4 minutes and 24 seconds to respond. They did not recognize the stall. They had never practiced recovering from it manually. The aircraft struck the ocean at 10,912 feet per minute.
This is the phenomenon Lisanne Bainbridge described as the "Ironies of Automation" in her 1983 paper: the more reliable and capable the automated system, the less opportunity human operators have to maintain the skills and situational awareness needed to oversee it. The gap between nominal oversight and real oversight grows with automation quality.
Oversight gap analysis maps the distance between what an organization believes its oversight mechanisms achieve and what they actually achieve under realistic operating conditions. Across documented AI agent incidents, five gap categories appear repeatedly.
The most reliable method for measuring oversight gaps is red team simulation: a structured exercise in which a designated team attempts to produce harmful or unintended agent outputs, while a separate team measures whether the oversight mechanisms detect and respond appropriately. Red teaming was developed in military and intelligence contexts, adopted by cybersecurity, and is increasingly required by AI governance frameworks including the EU AI Act (Article 9 on risk management systems) and NIST AI RMF Govern 1.1.
A red team simulation for an AI agent oversight audit has three components. First, scenario design: the red team specifies a set of harm scenarios derived from the risk register built in L2, and designs agent inputs or operating conditions intended to produce those scenarios. Second, observation: the red team executes the scenarios while an independent observer tracks whether oversight mechanisms β alerts, dashboards, human review, escalation paths β detect the problem and in what timeframe. Third, gap documentation: the observer records each scenario where detection failed or was delayed beyond the acceptable response time, and categorizes the failure by oversight gap type.
In 2023, the UK's AI Safety Institute conducted structured evaluations of frontier models that effectively functioned as red team exercises β testing whether models would provide dangerous information under various framing conditions, and measuring whether model-level safeguards detected and blocked the attempts. The exercises revealed systematic gaps in refusal mechanisms that were not visible in standard benchmark evaluation.
Full red team simulations require dedicated time and personnel. For organizations without dedicated AI safety teams, a tabletop exercise β where stakeholders walk through harm scenarios and verbally trace what the detection and response process would be β can approximate red team findings at lower cost. It is less rigorous but far better than no simulation at all.
Oversight gap analysis requires measuring actual oversight activity, not documented oversight policy. These are typically very different numbers. Document the following for your agent system:
The oversight gap map is a visual representation of your agent's decision pipeline, annotated with the location and severity of each identified gap. Each gap should be labeled by type (deskilling, volume, escalation, comprehension, authority), rated by severity, and linked to the specific risk register entries it leaves unmitigated.
In this lab, you'll conduct a simulated oversight gap analysis for your agent. Work with the coach to identify which of the five gap types (deskilling, volume, escalation, comprehension, authority) are present in your agent's current oversight structure, and how severe each is.
At 9:30 AM on August 1, 2012, Knight Capital Group's automated trading system began executing a series of erroneous equity orders, buying high and selling low at enormous speed. Within 45 minutes, the system had lost $440 million. The firm's pre-market technical team had identified an anomaly in the deployment β a legacy code component called "Power Peg" had been reactivated by accident β but they had no authority to halt trading without executive approval. The approval chain required four escalation steps. They were never all reached in time. Knight Capital was insolvent by the end of the day.
The Knight Capital post-mortem, later reviewed by the SEC in its market structure analysis, identified a critical finding that is relevant to every agent audit: the organization had a risk management committee that had reviewed agent deployment procedures β but the review had not specified a halt authority chain with defined time limits. The committee's finding was "deployment procedures require review." The remediation plan was "revise procedures." No one was named. No deadline was set. No one verified implementation before the next deployment. The report had been written. Nothing had changed.
Knight Capital illustrates the most common failure mode in audit reporting: findings are documented at a level of generality that produces no specific action, with no named owner and no verification mechanism. An effective findings report has seven components that prevent this failure.
The difference between findings that are acted upon and findings that are filed comes down to specificity and falsifiability. Compare these two findings from a hypothetical audit of a customer service agent:
Finding 7: The human review process for agent escalations should be strengthened to ensure appropriate oversight of high-risk interactions. Recommended action: review and update the escalation policy.
Finding 7 [HIGH β Escalation Gap]: Of 4,200 customer interactions coded as "financial hardship" by the agent in Q3, zero were reviewed by a human supervisor before the agent's recommended action was executed. This represents an uncovered risk register item R-12 (agent recommending debt collection contact to customers in distress). Remediation: All interactions tagged financial-hardship must trigger mandatory human review before agent action. Owner: VP Customer Experience. Deadline: 30 days. Verification: Auditor review of review logs showing 100% human review rate for tagged interactions over a 30-day period following implementation.
The second finding is falsifiable: a follow-up auditor can verify whether it has been remediated by checking whether the review logs show 100% human review of tagged interactions. The first finding cannot be verified β "review and update the policy" has no defined success condition.
Not all findings can be remediated simultaneously. The remediation plan must prioritize, and that prioritization must be transparent and defensible β not based on organizational convenience or political dynamics. Use a two-dimension prioritization: finding severity (from the risk register composite score) and remediation tractability (how quickly and reliably the gap can be closed).
Critical findings with high tractability β a halt authority chain that doesn't exist, which can be documented and assigned in days β must be addressed first regardless of other priorities. High-severity findings with low tractability β such as deskilling gaps that require months of training program development β require an interim mitigation: a temporary increase in human review rates or a scope limitation on the agent while the long-term remediation is designed. The report must document both the interim and the long-term remediation for every high-severity finding.
The UK's Algorithmic Transparency Recording Standard, published in 2021 and updated in 2023, requires public sector bodies using algorithmic tools to document exactly this structure: the finding, the interim measure, the long-term remediation, the named owner, and the review date. It is a useful template even for private sector organizations not legally subject to it.
A complete agent risk audit produces four documents: (1) the scoped audit mandate with stakeholder list, (2) the risk register, (3) the oversight gap map with narrative, and (4) the findings report with remediation plan. Together, these constitute the full audit record. They should be version-controlled, stored in a location accessible to compliance and legal teams, and referenced at every subsequent review of the agent system.
Audit findings that threaten existing investments, challenge team performance records, or recommend halting high-profile systems encounter organizational resistance. This is predictable and must be planned for. Three principles from documented successful audit communications apply here.
Lead with risk, not failure. Frame findings as forward-looking risk management, not backward-looking blame assignment. "This gap means that if X occurs, we will not detect it in time to prevent harm" is more actionable than "this team failed to build adequate oversight." The goal is remediation, not accountability theater.
Quantify where possible. Knight Capital lost $440 million in 45 minutes. Zillow wrote down $569 million. The Dutch government paid approximately β¬40 million in compensation related to the Syri system. Risk quantification β even rough estimates β makes organizational investment in remediation legible as prudent financial management rather than unnecessary caution.
Propose, don't just flag. A findings report that identifies fifteen gaps and recommends "further review" for each will be shelved. A report that identifies fifteen gaps, ranks the top three for immediate action, provides specific remediation plans with resource estimates, and offers a verification mechanism gives decision-makers something to say yes or no to.
Using the risk register and oversight gap map from Labs 2 and 3, draft a findings report entry for your most significant audit finding. The coach will help you make it specific, falsifiable, and actionable β with a named owner, deadline, and verification criterion. You'll also draft a remediation plan entry covering both interim and long-term mitigation.