On March 23, 2016, Microsoft launched Tay, a conversational AI agent built to learn from Twitter interactions and respond in the voice of a millennial. The system had been tested internally. Engineers had reviewed outputs. A team had signed off on deployment. Within sixteen hours, Tay was generating racist, antisemitic, and misogynistic content at scale.
The failure was not a surprise to anyone who had analyzed adversarial input risks systematically. But Microsoft had no formal adversarial stress-testing protocol in its pre-deployment review for Tay. The agent was evaluated for functional correctness—does it respond coherently?—not for behavioral safety under adversarial conditions. The distinction cost the company its reputation in that product category and forced the system offline permanently.
Before 2016, most AI deployment reviews were borrowed from software QA practice: find bugs, check for crashes, verify outputs match specifications. The assumption was that an AI system's risk profile resembled a deterministic program's—finite, enumerable, correctable by patching.
Tay demonstrated otherwise. AI agents are probabilistic, adaptive, and context-sensitive. Their failure modes aren't just functional; they're behavioral, emergent, and often adversarially activated. A model that passes every internal test can still catastrophically fail when real-world users probe it in ways the test suite never anticipated.
Post-Tay, organizations including Google DeepMind, OpenAI, Anthropic, and the UK AI Safety Institute began formalizing what pre-deployment review should actually encompass. The convergence produced a set of interlocking frameworks—not a single standard, but a family of structured approaches—that we now collectively call pre-deployment risk evaluation.
Traditional software review asks: "Does it work?" Pre-deployment AI risk review asks: "Does it work safely, across all populations, under adversarial conditions, at the scale we intend to deploy it?" These are categorically different questions requiring categorically different methodologies.
A complete pre-deployment risk evaluation for an AI agent typically addresses five distinct domains. The first is capability assessment: what can the agent actually do, including capabilities that emerge at scale or in combination that weren't present in smaller versions? OpenAI's evaluations of GPT-4 before its March 2023 release, for instance, specifically tested for uplift in biological and chemical weapons synthesis—capabilities that hadn't been a concern in GPT-3 but became plausible at the higher capability tier.
The second domain is behavioral alignment: does the agent behave consistently with its stated objectives across diverse contexts, including edge cases and adversarial prompts? The third is systemic impact: what happens when this agent interacts with existing social, economic, or technical systems? The fourth is failure mode enumeration: what are the specific ways this agent can go wrong, and how severe are those failures? The fifth is oversight adequacy: is there a sufficient mechanism to detect and correct failures after deployment begins?
One of the persistent tensions in pre-deployment evaluation is that the organizations best positioned to run rigorous reviews—the labs building the systems—also face commercial pressure to ship quickly. This conflict of interest produced the third-party evaluation model. In 2022, the UK government's Frontier AI Taskforce (later the AI Safety Institute) began commissioning independent evaluations of frontier models before deployment. The United States followed with AISI evaluations under the Biden executive order framework in 2023.
But third-party evaluation is expensive, slow, and often conducted under NDA conditions that limit public disclosure. Most AI agent deployments—not frontier models, but the thousands of enterprise agents built on top of them—receive no third-party evaluation at all. They rely entirely on the deploying organization's internal processes, which may be rigorous or may be a checklist someone filled out in an afternoon.
This is the gap that structured risk frameworks are designed to address: giving organizations without dedicated safety teams a systematic methodology for evaluating their agents before those agents touch real users.
Pre-deployment evaluation: A structured assessment of an AI agent's capabilities, behaviors, failure modes, and systemic impacts before it is made available to end users.
Adversarial stress testing: Deliberate attempts to elicit unsafe, harmful, or unintended behaviors from a system by simulating malicious or unexpected inputs.
Capability overhang: Emergent capabilities present in a deployed model that were not anticipated during evaluation, often because they appeared only at scale or in combination.
The Tay case involved a consumer chatbot with no ability to take real-world actions. Modern AI agents are deployed in contexts with vastly higher stakes: healthcare triage, financial advising, legal document generation, infrastructure monitoring, autonomous vehicle coordination. The cost of a behavioral failure in these contexts is not a PR crisis—it is patient harm, financial loss, wrongful legal advice, or physical danger.
In 2023, a healthcare AI agent deployed by a major US insurance provider was documented by ProPublica to be denying medical claims at a rate of 90% using an AI review model. The system had been evaluated for accuracy against historical data but had not been evaluated for the distributional shift between training data and the actual claim population it would process. Thousands of patients had claims wrongly denied before the system's behavior was identified and corrected. The company settled regulatory inquiries without admitting fault, but the incident illustrated precisely what happens when the evaluation framework misses a critical domain.
You are a pre-deployment risk analyst reviewing documented AI agent failures. For each case you describe or discuss, the AI advisor will help you identify which of the five evaluation domains (capability assessment, behavioral alignment, systemic impact, failure mode enumeration, oversight adequacy) were missing or insufficient—and what a complete review would have looked like.
On March 18, 2018, Elaine Herzberg became the first pedestrian killed by an autonomous vehicle. A self-driving Uber test car struck her while she was walking her bicycle across a four-lane road in Tempe at 9:58 PM. The vehicle's sensors detected an obstacle 5.6 seconds before impact. The system classified the object six times, cycling through categories including "vehicle," "bicycle," and "other"—before settling on no classification at all. The emergency braking system had been deliberately disabled by Uber engineers to prevent erratic braking during testing.
The National Transportation Safety Board's final report identified a taxonomy failure at the root. Uber's safety evaluation framework had no category for "object classification indecision" as a hazardous state. The system's oscillation between classifications—a clearly identified behavioral mode in the logs—was not a recognized failure type in the evaluation rubric. Because the failure mode had no name in the taxonomy, it had no test, and no corrective measure.
A risk taxonomy is a structured, exhaustive classification of the types of failures a system can experience. In AI agent evaluation, a well-designed taxonomy serves two functions. First, it acts as a completeness check: if a failure type doesn't appear in your taxonomy, your evaluation process probably has no test for it. Second, it acts as a communication tool: a shared vocabulary for risk allows engineers, legal teams, ethicists, and regulators to discuss the same failure with precision.
The history of AI safety evaluation shows a consistent pattern: early taxonomies are too narrow, focused on the obvious failure modes. Then a deployment fails in a way the taxonomy didn't cover. The taxonomy expands. This reactive cycle is costly. Systematic taxonomy design aims to break it by reasoning about failure categories before deployment rather than after.
The most widely adopted structured taxonomy for AI risk in the United States is the one embedded in NIST's AI Risk Management Framework (AI RMF), released in January 2023. NIST organizes AI risks along two primary axes: the source of the risk and the impact domain.
On the source axis, NIST identifies three origins: risks from the AI system itself (model errors, data quality failures, robustness problems), risks from human misuse (adversarial prompting, deliberate manipulation), and risks from organizational deployment context (misapplication, inadequate oversight, scope creep). This tripartite source structure is important because many organizations only evaluate the first category—system-level errors—and miss the deployment context risks entirely.
Where NIST's taxonomy organizes risks by source and impact domain, the EU AI Act (adopted June 2024) adds a second layer: classification by application severity. The Act creates four tiers—unacceptable risk (prohibited), high risk (heavily regulated), limited risk (transparency requirements), and minimal risk (largely unregulated)—and maps specific application domains to each tier.
High-risk applications under the Act include AI used in critical infrastructure, education, employment, essential services, law enforcement, migration, and the administration of justice. Each of these domains has a corresponding evaluation requirement: a conformity assessment, a risk management system, data governance requirements, and post-market monitoring. The taxonomy thus functions not just as an analytical tool but as a legal trigger: identifying your application as high-risk activates a specific set of mandatory evaluation steps.
The practical implication for AI agent developers is that taxonomy selection has regulatory consequences. Using the EU AI Act's taxonomy means accepting its regulatory structure. Organizations building for global markets often need to map their agents across multiple taxonomic frameworks simultaneously—a task that benefits enormously from having a systematic internal taxonomy that can be cross-referenced against regulatory ones.
If Uber's AV safety framework had included "classification indecision" as a named failure category—a state in which the system oscillates between classifications without resolving—it would have required a test for that state and a designed response. The NTSB report found the system experienced this state 1.3 seconds before impact. A taxonomy entry would not have guaranteed survival, but it would have generated a design requirement.
A practical taxonomy for a specific AI agent deployment doesn't require adopting the full NIST RMF or EU AI Act framework verbatim. What it requires is systematically working through four questions for every component of the agent's capability set. First: what happens if this capability produces a wrong output? Second: what happens if this capability is deliberately subverted? Third: what happens when this capability interacts with downstream systems? Fourth: what happens if this capability is used in a context different from the one it was designed for?
The answers to these four questions, mapped to a severity scale and a probability estimate, constitute a working risk taxonomy. It is less elegant than NIST's but more actionable for practitioners who need to make deployment decisions on specific systems with finite time and resources.
You will build a working risk taxonomy for a specific AI agent application. Choose any deployment context—a healthcare triage assistant, a customer service agent, an automated content moderation system, a financial advisory bot—and work through the taxonomy-building methodology with the AI advisor.
Before OpenAI released GPT-4 in March 2023, the company published a technical report that was unusual for its candor: it documented, in detail, what the red team had found. Over several months, a team of approximately 50 external red teamers—selected from diverse disciplines including biosecurity, cybersecurity, disinformation research, and chemistry—worked to identify the model's most dangerous capabilities. The red team successfully elicited detailed synthesis routes for chemical weapons precursors. They demonstrated uplift in cyberattack planning. They found the model could generate convincing disinformation at scale.
The report's conclusion was not that the model was safe. It was that OpenAI had identified the risks and implemented mitigations against each—refusals for specific query categories, output filters for certain content types, system prompt constraints for deployed versions. The red team's findings directly shaped the deployment configuration. This was the design: the evaluation was not a gate that the model passed or failed, but a process that generated the safety architecture of the deployed system.
Red-teaming originated in Cold War military strategy, where teams were assigned to think like adversaries and probe the vulnerabilities of defense plans. The NSA adapted the practice for cybersecurity in the 1990s, creating "tiger teams" that would attempt to penetrate their own classified networks before adversaries could. The practice migrated to financial services stress testing after the 2008 financial crisis, where regulators required banks to simulate adverse scenarios that internal risk managers might be too optimistic to imagine.
AI safety red-teaming shares the same core logic but requires adaptation for AI-specific failure modes. A network penetration test looks for exploitable code vulnerabilities. An AI red team looks for something more diffuse: behavioral vulnerabilities—combinations of inputs, contexts, and framings that produce dangerous or misaligned outputs. These are harder to enumerate because they are partially semantic, partially adversarial, and often emerge only through creative human probing.
A formal AI red-team exercise typically operates in three phases. The first is threat modeling: before anyone writes a prompt, the team produces a threat model that specifies the adversaries of concern, their motivations, their technical sophistication, and the harm categories they might pursue. For a healthcare agent, adversaries might include patients seeking to obtain medication advice that bypasses prescribing requirements, malicious actors seeking to generate false medical information, or internal users who might misuse the system's data access. Each adversary type requires different probing strategies.
The second phase is structured probing: red teamers attempt to elicit problematic behaviors using a systematic set of techniques. These include direct instruction (simply asking for harmful content), indirect framing (embedding harmful requests in legitimate contexts), roleplay exploitation (using fictional framings to bypass safety measures), prompt injection (hiding adversarial instructions in data the agent will process), and multi-turn manipulation (building toward a harmful goal across a series of ostensibly innocuous exchanges).
The third phase is findings synthesis: documenting each successful exploit, classifying it by severity and attack type, estimating the likelihood of real-world exploitation, and recommending mitigations. Critically, red team findings should be treated as design inputs, not test results. The GPT-4 example is instructive: the red team found the capability, and that finding drove the refusal training and output filtering that constrained it in the deployed model.
Internal red teams are constrained by the same organizational culture, assumptions, and blind spots as the engineering teams that built the system. External red teamers—especially those from adversarial communities, academic security research, or disciplines entirely outside AI—bring framings and attack vectors the internal team cannot easily generate. The UK AI Safety Institute's evaluations of GPT-4o, Claude 3, and Gemini Ultra all used external red teams specifically to probe for capability areas where the labs' own assumptions might create blind spots.
In 2022, the Distributed AI Research Institute (DAIR), led by Timnit Gebru, published a critique that has become important to how red-teaming is understood. DAIR argued that standard red-team exercises suffer from a structural problem: they are designed to find the failure modes the team already suspects, not to discover genuinely novel harms. The team's prior conceptions of "harmful output" shape what they probe for, which means red-teaming systematically under-discovers harms to populations the team has not thought about.
DAIR's proposed remediation was community-based evaluation: including affected communities—not just technical red teamers—in the evaluation process. This approach was partially adopted in several subsequent evaluations. The Holistic Evaluation of Language Models (HELM) framework at Stanford incorporated diverse evaluator demographics. Meta's LLaMA 2 red team explicitly included members of communities disproportionately affected by AI bias. Whether community-based evaluation fully solves the problem remains contested, but it has become a recognized element of comprehensive adversarial evaluation practice.
Red-teaming is necessary but not sufficient. It can only find failure modes that red teamers are capable of discovering—which is a function of their creativity, diversity, and domain knowledge. Novel attack vectors, emergent capabilities that appear post-deployment, and context-specific harms that arise from the interaction between the agent and a specific user population are all potentially outside a pre-deployment red team's reach.
This is why contemporary evaluation frameworks treat red-teaming as one component of a multi-method evaluation architecture that also includes automated capability evaluations, benchmark suites, post-deployment monitoring, and incident response systems. No single method is comprehensive; the goal is sufficient overlap that failure modes have multiple opportunities to be caught.
You are designing a pre-deployment red-team exercise for an AI agent. Work with the advisor to build a complete three-phase red-team plan: threat model (who are the adversaries, what are their goals), structured probing strategies (which techniques apply to this agent), and a findings synthesis template. The advisor will push you to include adversary types you might not have initially considered.
The Boeing 737 MAX was not an AI system. But the failure of its pre-deployment evaluation framework to convert findings into deployment decisions produced a case study that AI safety researchers have cited extensively. Boeing's Maneuvering Characteristics Augmentation System (MCAS) had been evaluated by internal teams who identified that the system could command nose-down pitch if a single Angle of Attack sensor failed. The finding was documented. The risk was classified as acceptable based on the assumption that pilots would follow emergency procedures within four seconds of recognizing the problem.
That assumption was never tested in evaluation. There was no structured mechanism requiring that findings be connected to assumptions, that assumptions be validated, or that unvalidated assumptions block deployment. The evaluation process generated a finding. The finding generated an assumption. The assumption generated no further testing. Lion Air Flight 610 and Ethiopian Airlines Flight 302 killed 346 people before the design flaw and evaluation failure were acknowledged. The FAA's post-investigation report identified the root cause as a "go/no-go" process that had no mechanism for elevating disputed risk classifications to decision-makers with authority to halt deployment.
Evaluation theater is the practice of conducting safety evaluations without allowing their results to materially affect deployment decisions. It persists for several reasons. First, organizational pressure to ship creates incentives to classify findings as acceptable rather than blocking. Second, evaluation processes are often owned by teams without authority over deployment decisions—so findings can be acknowledged and overridden. Third, in organizations where safety and commercial teams have adversarial dynamics, safety findings may be treated as negotiating positions rather than hard constraints.
The antidote is structural, not exhortative. Telling organizations to "take safety seriously" does not prevent evaluation theater. What prevents it is building evaluation processes with defined escalation paths, mandatory resolution requirements, and deployment authority vested in teams that include safety functions. The EU AI Act's conformity assessment requirement is one structural mechanism: it creates a mandatory external checkpoint that cannot be overridden by internal commercial pressure.
A go/no-go framework for AI agent deployment translates risk evaluation findings into one of three deployment decisions: deploy as evaluated, deploy with conditions, or do not deploy. The framework requires that every finding from the evaluation process be resolved—either mitigated, accepted with explicit rationale and authority, or treated as a deployment blocker—before the deployment decision is made.
| Finding Severity | Default Disposition | Override Condition | Required Authority |
|---|---|---|---|
| Critical (potential for irreversible harm) | Deploy blocker; no deployment until mitigated | None — no commercial override permitted | N/A |
| High (significant harm, reversible) | Deploy with mandatory conditions | Explicit written acceptance by C-level + safety lead | CEO/CTO + Head of Safety |
| Medium (limited harm, mitigable) | Deploy with monitoring requirements | Standard risk acceptance process | Product and Safety leads jointly |
| Low (minimal harm, acceptable) | Deploy; log finding in risk register | Not applicable | Standard deployment approval |
The "deploy with conditions" category is where most real-world deployment decisions land. Deployment conditions are operational constraints that allow a system to deploy despite unresolved risks, provided those constraints adequately bound the risk. Common deployment conditions for AI agents include: scope restrictions (the agent is authorized for a specific task set and must refuse requests outside it), user population restrictions (the agent is deployed only to verified professional users, not general consumers), output format constraints (the agent can analyze but not recommend), and human-in-the-loop requirements (a human must review and approve outputs before they are acted upon).
Google DeepMind's approach to deploying AlphaFold 2 in 2021 illustrates conditional deployment done well. The system's protein structure predictions were highly accurate but not infallible. DeepMind deployed with explicit documentation of accuracy limitations, mandatory confidence scoring on all outputs, and a design that positioned predictions as research tools for expert users rather than clinical decisions for practitioners. These conditions were not afterthoughts—they were part of the deployment architecture from the beginning, derived from the evaluation findings.
One of the most robust conditional deployment mechanisms is staged rollout: deploying first to a small, monitored population and expanding only when post-deployment monitoring confirms safe behavior at scale. OpenAI's deployment of ChatGPT used a staged approach—initially research access, then limited commercial access—that allowed observation of emergent behaviors before full-scale deployment. The staged approach doesn't eliminate pre-deployment evaluation; it extends the evaluation window into the early deployment period with real-world signal.
A complete go/no-go framework does not terminate at deployment. Each finding that was accepted with conditions, rather than mitigated before deployment, should generate a corresponding monitoring requirement. If an evaluation found that the agent produces occasional high-confidence errors in domain X, the deployment should include automated monitoring for domain X error patterns. If the evaluation found adversarial prompt vulnerability Y, post-deployment logging should flag inputs matching pattern Y for human review.
This connection between pre-deployment findings and post-deployment monitoring is often the weakest link in existing frameworks. Anthropic's Constitutional AI approach, published in 2022, explicitly addresses this: the constitutional principles that govern the model's training are also used as audit criteria for post-deployment behavior review. The same document that describes what the model should do also defines what a monitoring system should check. This architectural coherence between evaluation and monitoring is what distinguishes a genuine safety framework from a compliance exercise.
You are the Head of AI Safety at an organization preparing to deploy an AI agent. You have a set of evaluation findings and need to convert them into a deployment decision. Work with the advisor to classify each finding, determine its disposition (deploy blocker, conditioned acceptance, or acceptable), specify deployment conditions where needed, and define monitoring requirements that follow from each conditioned acceptance.