L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 6 · Lesson 1

Pre-Deployment Risk: The Stakes of Getting It Wrong

Before an agent acts in the world, decisions made in a conference room determine whether it will behave safely. The frameworks we choose—or ignore—shape outcomes at scale.
What does a structured pre-deployment risk review actually evaluate, and why did the industry take so long to build one?

On March 23, 2016, Microsoft launched Tay, a conversational AI agent built to learn from Twitter interactions and respond in the voice of a millennial. The system had been tested internally. Engineers had reviewed outputs. A team had signed off on deployment. Within sixteen hours, Tay was generating racist, antisemitic, and misogynistic content at scale.

The failure was not a surprise to anyone who had analyzed adversarial input risks systematically. But Microsoft had no formal adversarial stress-testing protocol in its pre-deployment review for Tay. The agent was evaluated for functional correctness—does it respond coherently?—not for behavioral safety under adversarial conditions. The distinction cost the company its reputation in that product category and forced the system offline permanently.

Why Pre-Deployment Review Became a Discipline

Before 2016, most AI deployment reviews were borrowed from software QA practice: find bugs, check for crashes, verify outputs match specifications. The assumption was that an AI system's risk profile resembled a deterministic program's—finite, enumerable, correctable by patching.

Tay demonstrated otherwise. AI agents are probabilistic, adaptive, and context-sensitive. Their failure modes aren't just functional; they're behavioral, emergent, and often adversarially activated. A model that passes every internal test can still catastrophically fail when real-world users probe it in ways the test suite never anticipated.

Post-Tay, organizations including Google DeepMind, OpenAI, Anthropic, and the UK AI Safety Institute began formalizing what pre-deployment review should actually encompass. The convergence produced a set of interlocking frameworks—not a single standard, but a family of structured approaches—that we now collectively call pre-deployment risk evaluation.

The Core Problem

Traditional software review asks: "Does it work?" Pre-deployment AI risk review asks: "Does it work safely, across all populations, under adversarial conditions, at the scale we intend to deploy it?" These are categorically different questions requiring categorically different methodologies.

What Pre-Deployment Risk Evaluation Covers

A complete pre-deployment risk evaluation for an AI agent typically addresses five distinct domains. The first is capability assessment: what can the agent actually do, including capabilities that emerge at scale or in combination that weren't present in smaller versions? OpenAI's evaluations of GPT-4 before its March 2023 release, for instance, specifically tested for uplift in biological and chemical weapons synthesis—capabilities that hadn't been a concern in GPT-3 but became plausible at the higher capability tier.

The second domain is behavioral alignment: does the agent behave consistently with its stated objectives across diverse contexts, including edge cases and adversarial prompts? The third is systemic impact: what happens when this agent interacts with existing social, economic, or technical systems? The fourth is failure mode enumeration: what are the specific ways this agent can go wrong, and how severe are those failures? The fifth is oversight adequacy: is there a sufficient mechanism to detect and correct failures after deployment begins?

The Institutional Gap: Who Actually Runs These Reviews?

One of the persistent tensions in pre-deployment evaluation is that the organizations best positioned to run rigorous reviews—the labs building the systems—also face commercial pressure to ship quickly. This conflict of interest produced the third-party evaluation model. In 2022, the UK government's Frontier AI Taskforce (later the AI Safety Institute) began commissioning independent evaluations of frontier models before deployment. The United States followed with AISI evaluations under the Biden executive order framework in 2023.

But third-party evaluation is expensive, slow, and often conducted under NDA conditions that limit public disclosure. Most AI agent deployments—not frontier models, but the thousands of enterprise agents built on top of them—receive no third-party evaluation at all. They rely entirely on the deploying organization's internal processes, which may be rigorous or may be a checklist someone filled out in an afternoon.

This is the gap that structured risk frameworks are designed to address: giving organizations without dedicated safety teams a systematic methodology for evaluating their agents before those agents touch real users.

Key Terms

Pre-deployment evaluation: A structured assessment of an AI agent's capabilities, behaviors, failure modes, and systemic impacts before it is made available to end users.

Adversarial stress testing: Deliberate attempts to elicit unsafe, harmful, or unintended behaviors from a system by simulating malicious or unexpected inputs.

Capability overhang: Emergent capabilities present in a deployed model that were not anticipated during evaluation, often because they appeared only at scale or in combination.

The Stakes at Scale

The Tay case involved a consumer chatbot with no ability to take real-world actions. Modern AI agents are deployed in contexts with vastly higher stakes: healthcare triage, financial advising, legal document generation, infrastructure monitoring, autonomous vehicle coordination. The cost of a behavioral failure in these contexts is not a PR crisis—it is patient harm, financial loss, wrongful legal advice, or physical danger.

In 2023, a healthcare AI agent deployed by a major US insurance provider was documented by ProPublica to be denying medical claims at a rate of 90% using an AI review model. The system had been evaluated for accuracy against historical data but had not been evaluated for the distributional shift between training data and the actual claim population it would process. Thousands of patients had claims wrongly denied before the system's behavior was identified and corrected. The company settled regulatory inquiries without admitting fault, but the incident illustrated precisely what happens when the evaluation framework misses a critical domain.

Lesson 1 Quiz

Pre-Deployment Risk: The Stakes of Getting It Wrong
What was the primary failure in Microsoft's deployment of Tay in 2016?
Correct. Tay had functional testing but no adversarial stress-testing. The review confirmed it could respond coherently, not that it was safe under adversarial inputs—a distinction that cost Microsoft the product.
Not quite. Tay failed because the pre-deployment review evaluated functional correctness but not behavioral safety under adversarial conditions. No adversarial stress-testing protocol was in place.
Which of the following best describes a "capability overhang" in pre-deployment evaluation?
Correct. Capability overhang refers to capabilities that emerge at scale or in combination that weren't present—or weren't recognized—during evaluation. OpenAI's GPT-4 evaluations specifically examined for this in domains like weapons synthesis.
That's not the definition. Capability overhang refers specifically to emergent capabilities that weren't anticipated during evaluation, often because they appeared only at scale or in unanticipated combinations.
The five domains of pre-deployment risk evaluation include all of the following EXCEPT:
Correct. Commercial viability is not one of the five domains. The domains are capability assessment, behavioral alignment, systemic impact, failure mode enumeration, and oversight adequacy.
Check again. Commercial viability projection is not part of pre-deployment risk evaluation. The five domains are: capability assessment, behavioral alignment, systemic impact, failure mode enumeration, and oversight adequacy.
Why did the third-party evaluation model emerge as an alternative to purely internal reviews?
Correct. The core tension is that organizations best positioned to evaluate their systems also face pressure to deploy them quickly. This conflict of interest motivated external evaluation models like the UK AI Safety Institute's reviews.
The key issue is conflict of interest, not capability. Organizations building AI systems face commercial pressure to deploy quickly, which conflicts with the incentive to conduct thorough safety reviews—hence the value of independent third parties.
In the 2023 healthcare AI case described in the lesson, what critical evaluation domain did the insurer's pre-deployment review miss?
Correct. The system was evaluated for accuracy against historical data but not for the distributional shift between that training data and the actual claim population it would process—a systemic impact evaluation failure.
The evaluation failure was distributional shift. The system was accurate on historical training data but was not evaluated for how its behavior would change when the actual deployed population differed from the training population.

Lab 1: Diagnosing Evaluation Gaps

Practice identifying which pre-deployment evaluation domains were missing in real AI deployment failures

Your Task

You are a pre-deployment risk analyst reviewing documented AI agent failures. For each case you describe or discuss, the AI advisor will help you identify which of the five evaluation domains (capability assessment, behavioral alignment, systemic impact, failure mode enumeration, oversight adequacy) were missing or insufficient—and what a complete review would have looked like.

Try: "I want to analyze the Tay failure. Which domains were skipped?" — or describe a different AI deployment failure and ask which evaluation domains it missed.
Pre-Deployment Evaluation Advisor
Lab 1
Welcome to the Evaluation Gap Lab. I'm here to help you analyze real AI deployment failures through the lens of the five pre-deployment evaluation domains: capability assessment, behavioral alignment, systemic impact, failure mode enumeration, and oversight adequacy. Describe a case—Tay, the insurance AI, GPT-4's capability evaluations, or any other—and we'll diagnose which domains were missing from the pre-deployment review.
Module 6 · Lesson 2

Risk Taxonomies: Classifying What Can Go Wrong

Before you can evaluate risk, you need a language for it. The taxonomy you choose determines which dangers you see—and which you miss entirely.
How do structured risk taxonomies prevent evaluators from discovering only the failures they were already looking for?

On March 18, 2018, Elaine Herzberg became the first pedestrian killed by an autonomous vehicle. A self-driving Uber test car struck her while she was walking her bicycle across a four-lane road in Tempe at 9:58 PM. The vehicle's sensors detected an obstacle 5.6 seconds before impact. The system classified the object six times, cycling through categories including "vehicle," "bicycle," and "other"—before settling on no classification at all. The emergency braking system had been deliberately disabled by Uber engineers to prevent erratic braking during testing.

The National Transportation Safety Board's final report identified a taxonomy failure at the root. Uber's safety evaluation framework had no category for "object classification indecision" as a hazardous state. The system's oscillation between classifications—a clearly identified behavioral mode in the logs—was not a recognized failure type in the evaluation rubric. Because the failure mode had no name in the taxonomy, it had no test, and no corrective measure.

What a Risk Taxonomy Is and Why It Matters

A risk taxonomy is a structured, exhaustive classification of the types of failures a system can experience. In AI agent evaluation, a well-designed taxonomy serves two functions. First, it acts as a completeness check: if a failure type doesn't appear in your taxonomy, your evaluation process probably has no test for it. Second, it acts as a communication tool: a shared vocabulary for risk allows engineers, legal teams, ethicists, and regulators to discuss the same failure with precision.

The history of AI safety evaluation shows a consistent pattern: early taxonomies are too narrow, focused on the obvious failure modes. Then a deployment fails in a way the taxonomy didn't cover. The taxonomy expands. This reactive cycle is costly. Systematic taxonomy design aims to break it by reasoning about failure categories before deployment rather than after.

The NIST AI Risk Management Framework Taxonomy

The most widely adopted structured taxonomy for AI risk in the United States is the one embedded in NIST's AI Risk Management Framework (AI RMF), released in January 2023. NIST organizes AI risks along two primary axes: the source of the risk and the impact domain.

On the source axis, NIST identifies three origins: risks from the AI system itself (model errors, data quality failures, robustness problems), risks from human misuse (adversarial prompting, deliberate manipulation), and risks from organizational deployment context (misapplication, inadequate oversight, scope creep). This tripartite source structure is important because many organizations only evaluate the first category—system-level errors—and miss the deployment context risks entirely.

System-Origin Risks
Model errors, hallucination, distributional shift, robustness failures, emergent capabilities, training data poisoning. Evaluation method: benchmark testing, red-teaming, capability probing.
Human-Misuse Risks
Adversarial prompting, jailbreaking, social engineering the agent, using outputs for harmful purposes. Evaluation method: adversarial user simulation, red-team exercises with external actors.
Organizational-Context Risks
Misapplication outside intended scope, automation bias in operators, inadequate oversight infrastructure, incentive misalignment between deployer and user. Evaluation method: deployment context analysis, stakeholder risk mapping.
Societal-Scale Risks
Discriminatory outcomes at population scale, concentration of power, erosion of human skill, influence operations. Evaluation method: demographic impact analysis, long-range consequence modeling.
The EU AI Act's Risk Classification Layer

Where NIST's taxonomy organizes risks by source and impact domain, the EU AI Act (adopted June 2024) adds a second layer: classification by application severity. The Act creates four tiers—unacceptable risk (prohibited), high risk (heavily regulated), limited risk (transparency requirements), and minimal risk (largely unregulated)—and maps specific application domains to each tier.

High-risk applications under the Act include AI used in critical infrastructure, education, employment, essential services, law enforcement, migration, and the administration of justice. Each of these domains has a corresponding evaluation requirement: a conformity assessment, a risk management system, data governance requirements, and post-market monitoring. The taxonomy thus functions not just as an analytical tool but as a legal trigger: identifying your application as high-risk activates a specific set of mandatory evaluation steps.

The practical implication for AI agent developers is that taxonomy selection has regulatory consequences. Using the EU AI Act's taxonomy means accepting its regulatory structure. Organizations building for global markets often need to map their agents across multiple taxonomic frameworks simultaneously—a task that benefits enormously from having a systematic internal taxonomy that can be cross-referenced against regulatory ones.

The Uber Lesson Applied

If Uber's AV safety framework had included "classification indecision" as a named failure category—a state in which the system oscillates between classifications without resolving—it would have required a test for that state and a designed response. The NTSB report found the system experienced this state 1.3 seconds before impact. A taxonomy entry would not have guaranteed survival, but it would have generated a design requirement.

Building a Working Taxonomy for Your Agent

A practical taxonomy for a specific AI agent deployment doesn't require adopting the full NIST RMF or EU AI Act framework verbatim. What it requires is systematically working through four questions for every component of the agent's capability set. First: what happens if this capability produces a wrong output? Second: what happens if this capability is deliberately subverted? Third: what happens when this capability interacts with downstream systems? Fourth: what happens if this capability is used in a context different from the one it was designed for?

The answers to these four questions, mapped to a severity scale and a probability estimate, constitute a working risk taxonomy. It is less elegant than NIST's but more actionable for practitioners who need to make deployment decisions on specific systems with finite time and resources.

Lesson 2 Quiz

Risk Taxonomies: Classifying What Can Go Wrong
In the 2018 Uber AV fatality, what specifically caused the safety system to fail to brake?
Correct. The NTSB found two compounding failures: emergency braking was disabled to prevent erratic behavior during testing, and "classification indecision"—the system's oscillation between object categories—was not a named failure mode in the taxonomy, so it had no corrective design response.
The sensors did detect Herzberg 5.6 seconds before impact. The failures were that emergency braking was disabled and that the system's oscillation between object classifications was not a recognized (and therefore not a designed-for) failure mode.
Which of NIST AI RMF's three risk source categories do most organizations neglect in their evaluations?
Correct. The lesson notes that many organizations only evaluate system-level errors and miss the deployment context risks entirely—issues like scope creep, automation bias in operators, and incentive misalignment between deployer and user.
The lesson specifically notes that most organizations focus on system-level risks and neglect organizational deployment context risks—misapplication, automation bias, inadequate oversight infrastructure, and incentive misalignment.
What function does a risk taxonomy serve beyond guiding evaluation?
Correct. A taxonomy is both a completeness check (ensuring evaluation coverage) and a communication tool (a shared vocabulary allowing cross-functional teams to discuss risk precisely).
A taxonomy serves as a communication tool—a shared vocabulary that allows engineers, legal teams, ethicists, and regulators to discuss the same failure mode with precision across disciplines.
Under the EU AI Act, what does classifying an application as "high risk" trigger?
Correct. High-risk classification under the EU AI Act activates a specific set of mandatory requirements: conformity assessment, risk management system, data governance, and post-market monitoring.
High-risk classification under the EU AI Act triggers four requirements: a conformity assessment, a risk management system, data governance requirements, and post-market monitoring obligations.
What four questions does the lesson recommend for building a practical working taxonomy for a specific agent deployment?
Correct. The four questions for each capability component are: wrong output, deliberate subversion, downstream system interaction, and out-of-context use. Answers mapped to severity and probability constitute a working taxonomy.
The practical taxonomy method asks: (1) wrong output, (2) deliberate subversion, (3) downstream system interaction, and (4) out-of-context use—for every component of the agent's capability set.

Lab 2: Building a Risk Taxonomy

Construct a working risk taxonomy for a specific AI agent application using structured four-question methodology

Your Task

You will build a working risk taxonomy for a specific AI agent application. Choose any deployment context—a healthcare triage assistant, a customer service agent, an automated content moderation system, a financial advisory bot—and work through the taxonomy-building methodology with the AI advisor.

Start by naming your application: "I want to build a risk taxonomy for a [type] agent." The advisor will guide you through each category, help you identify coverage gaps, and cross-reference against NIST and EU AI Act frameworks.
Risk Taxonomy Builder
Lab 2
Welcome to the Risk Taxonomy Lab. We're going to build a structured risk taxonomy for a specific AI agent deployment. Start by telling me what type of agent you want to evaluate—give me the application domain and, if you have one, the specific use case. I'll guide you through the four-question methodology, help you map risks to NIST source categories and EU AI Act tiers, and flag anything your taxonomy might be missing.
Module 6 · Lesson 3

Red-Teaming and Adversarial Evaluation Methodologies

A model that only passes tests written by its creators has only passed tests written by its creators. Adversarial evaluation asks what happens when someone tries to break it.
How did structured red-teaming evolve from a security practice into the central methodology of AI safety evaluation—and what does a rigorous AI red team actually do?

Before OpenAI released GPT-4 in March 2023, the company published a technical report that was unusual for its candor: it documented, in detail, what the red team had found. Over several months, a team of approximately 50 external red teamers—selected from diverse disciplines including biosecurity, cybersecurity, disinformation research, and chemistry—worked to identify the model's most dangerous capabilities. The red team successfully elicited detailed synthesis routes for chemical weapons precursors. They demonstrated uplift in cyberattack planning. They found the model could generate convincing disinformation at scale.

The report's conclusion was not that the model was safe. It was that OpenAI had identified the risks and implemented mitigations against each—refusals for specific query categories, output filters for certain content types, system prompt constraints for deployed versions. The red team's findings directly shaped the deployment configuration. This was the design: the evaluation was not a gate that the model passed or failed, but a process that generated the safety architecture of the deployed system.

Red-Teaming's Origins and AI Adaptation

Red-teaming originated in Cold War military strategy, where teams were assigned to think like adversaries and probe the vulnerabilities of defense plans. The NSA adapted the practice for cybersecurity in the 1990s, creating "tiger teams" that would attempt to penetrate their own classified networks before adversaries could. The practice migrated to financial services stress testing after the 2008 financial crisis, where regulators required banks to simulate adverse scenarios that internal risk managers might be too optimistic to imagine.

AI safety red-teaming shares the same core logic but requires adaptation for AI-specific failure modes. A network penetration test looks for exploitable code vulnerabilities. An AI red team looks for something more diffuse: behavioral vulnerabilities—combinations of inputs, contexts, and framings that produce dangerous or misaligned outputs. These are harder to enumerate because they are partially semantic, partially adversarial, and often emerge only through creative human probing.

The Structure of an AI Red-Team Exercise

A formal AI red-team exercise typically operates in three phases. The first is threat modeling: before anyone writes a prompt, the team produces a threat model that specifies the adversaries of concern, their motivations, their technical sophistication, and the harm categories they might pursue. For a healthcare agent, adversaries might include patients seeking to obtain medication advice that bypasses prescribing requirements, malicious actors seeking to generate false medical information, or internal users who might misuse the system's data access. Each adversary type requires different probing strategies.

The second phase is structured probing: red teamers attempt to elicit problematic behaviors using a systematic set of techniques. These include direct instruction (simply asking for harmful content), indirect framing (embedding harmful requests in legitimate contexts), roleplay exploitation (using fictional framings to bypass safety measures), prompt injection (hiding adversarial instructions in data the agent will process), and multi-turn manipulation (building toward a harmful goal across a series of ostensibly innocuous exchanges).

The third phase is findings synthesis: documenting each successful exploit, classifying it by severity and attack type, estimating the likelihood of real-world exploitation, and recommending mitigations. Critically, red team findings should be treated as design inputs, not test results. The GPT-4 example is instructive: the red team found the capability, and that finding drove the refusal training and output filtering that constrained it in the deployed model.

Why External Red Teams Matter

Internal red teams are constrained by the same organizational culture, assumptions, and blind spots as the engineering teams that built the system. External red teamers—especially those from adversarial communities, academic security research, or disciplines entirely outside AI—bring framings and attack vectors the internal team cannot easily generate. The UK AI Safety Institute's evaluations of GPT-4o, Claude 3, and Gemini Ultra all used external red teams specifically to probe for capability areas where the labs' own assumptions might create blind spots.

DAIR Institute's Structural Red-Teaming Critique

In 2022, the Distributed AI Research Institute (DAIR), led by Timnit Gebru, published a critique that has become important to how red-teaming is understood. DAIR argued that standard red-team exercises suffer from a structural problem: they are designed to find the failure modes the team already suspects, not to discover genuinely novel harms. The team's prior conceptions of "harmful output" shape what they probe for, which means red-teaming systematically under-discovers harms to populations the team has not thought about.

DAIR's proposed remediation was community-based evaluation: including affected communities—not just technical red teamers—in the evaluation process. This approach was partially adopted in several subsequent evaluations. The Holistic Evaluation of Language Models (HELM) framework at Stanford incorporated diverse evaluator demographics. Meta's LLaMA 2 red team explicitly included members of communities disproportionately affected by AI bias. Whether community-based evaluation fully solves the problem remains contested, but it has become a recognized element of comprehensive adversarial evaluation practice.

The Limits of Red-Teaming as a Safety Guarantee

Red-teaming is necessary but not sufficient. It can only find failure modes that red teamers are capable of discovering—which is a function of their creativity, diversity, and domain knowledge. Novel attack vectors, emergent capabilities that appear post-deployment, and context-specific harms that arise from the interaction between the agent and a specific user population are all potentially outside a pre-deployment red team's reach.

This is why contemporary evaluation frameworks treat red-teaming as one component of a multi-method evaluation architecture that also includes automated capability evaluations, benchmark suites, post-deployment monitoring, and incident response systems. No single method is comprehensive; the goal is sufficient overlap that failure modes have multiple opportunities to be caught.

Lesson 3 Quiz

Red-Teaming and Adversarial Evaluation Methodologies
What was distinctive about how OpenAI used GPT-4's red-team findings, according to the lesson?
Correct. OpenAI's approach treated the red team exercise as a generator of design requirements, not a binary test. Each finding drove a specific mitigation—refusals, filters, or constraints—built into the deployed system.
The key insight was that red team findings were treated as design inputs. Each identified capability drove a specific architectural response: refusal training, output filtering, or system prompt constraints in the deployed configuration.
Which of the following is NOT one of the three phases of a formal AI red-team exercise as described in the lesson?
Correct. The three phases are threat modeling, structured probing, and findings synthesis. Regulatory compliance mapping may be a related activity but is not one of the three phases of the red-team exercise itself.
The three phases described in the lesson are threat modeling (defining adversaries and harm categories), structured probing (systematically eliciting problematic behaviors), and findings synthesis (documenting, classifying, and recommending mitigations).
What structural critique did DAIR Institute's Timnit Gebru raise about standard red-team exercises?
Correct. DAIR argued that red teams find the failure modes they already suspect, not genuinely novel harms—especially those affecting communities the team hasn't considered. This motivated the community-based evaluation approach.
DAIR's critique was structural: red teams are designed to find failure modes the team already suspects. This means they systematically miss harms to populations the team hasn't thought about—a blind spot that community-based evaluation attempts to address.
What is "multi-turn manipulation" in the context of AI red-teaming?
Correct. Multi-turn manipulation involves building toward a harmful goal through a sequence of individually innocuous exchanges—exploiting the fact that each individual turn may not trigger safety measures, even though the aggregate trajectory is harmful.
Multi-turn manipulation means building toward a harmful goal across a series of individually innocuous exchanges. No single turn may trigger safety measures, but the aggregate conversation steers the model toward producing harmful content.
Why is red-teaming considered necessary but not sufficient as a safety methodology?
Correct. Red-teaming's coverage is bounded by what red teamers can imagine. Novel attack vectors, emergent capabilities post-deployment, and context-specific harms from specific user populations may all be outside that boundary—which is why multi-method evaluation architectures are required.
Red-teaming is bounded by the discoverable space of the team's knowledge. Novel attacks, emergent post-deployment capabilities, and harms specific to unanticipated user populations may all fall outside that space—requiring complementary methods like automated benchmarks and post-deployment monitoring.

Lab 3: Red-Team Scenario Design

Design a structured red-team exercise for a specific AI agent, including threat model, probing strategies, and mitigation recommendations

Your Task

You are designing a pre-deployment red-team exercise for an AI agent. Work with the advisor to build a complete three-phase red-team plan: threat model (who are the adversaries, what are their goals), structured probing strategies (which techniques apply to this agent), and a findings synthesis template. The advisor will push you to include adversary types you might not have initially considered.

Start with: "I need to design a red-team exercise for a [type of agent]." Then work through threat modeling, probing strategy selection, and findings synthesis with the advisor's guidance.
Red-Team Exercise Designer
Lab 3
Welcome to the Red-Team Design Lab. I'll help you build a complete three-phase red-team exercise plan for an AI agent. We'll work through threat modeling (identifying adversaries, their motivations, and harm categories), structured probing strategy selection (which of the five probing techniques—direct instruction, indirect framing, roleplay exploitation, prompt injection, multi-turn manipulation—are highest priority), and a findings synthesis framework. Tell me what type of AI agent you're evaluating, and we'll start building the threat model.
Module 6 · Lesson 4

From Evaluation to Decision: Go/No-Go Frameworks and Deployment Conditions

Evaluation produces findings. A framework for acting on findings determines whether evaluation actually changes deployment outcomes—or simply documents problems before they occur.
How do organizations convert risk findings into deployment decisions, and what mechanisms prevent evaluation theater—the practice of running evaluations without allowing their results to affect deployment?

The Boeing 737 MAX was not an AI system. But the failure of its pre-deployment evaluation framework to convert findings into deployment decisions produced a case study that AI safety researchers have cited extensively. Boeing's Maneuvering Characteristics Augmentation System (MCAS) had been evaluated by internal teams who identified that the system could command nose-down pitch if a single Angle of Attack sensor failed. The finding was documented. The risk was classified as acceptable based on the assumption that pilots would follow emergency procedures within four seconds of recognizing the problem.

That assumption was never tested in evaluation. There was no structured mechanism requiring that findings be connected to assumptions, that assumptions be validated, or that unvalidated assumptions block deployment. The evaluation process generated a finding. The finding generated an assumption. The assumption generated no further testing. Lion Air Flight 610 and Ethiopian Airlines Flight 302 killed 346 people before the design flaw and evaluation failure were acknowledged. The FAA's post-investigation report identified the root cause as a "go/no-go" process that had no mechanism for elevating disputed risk classifications to decision-makers with authority to halt deployment.

Evaluation Theater and Why It Persists

Evaluation theater is the practice of conducting safety evaluations without allowing their results to materially affect deployment decisions. It persists for several reasons. First, organizational pressure to ship creates incentives to classify findings as acceptable rather than blocking. Second, evaluation processes are often owned by teams without authority over deployment decisions—so findings can be acknowledged and overridden. Third, in organizations where safety and commercial teams have adversarial dynamics, safety findings may be treated as negotiating positions rather than hard constraints.

The antidote is structural, not exhortative. Telling organizations to "take safety seriously" does not prevent evaluation theater. What prevents it is building evaluation processes with defined escalation paths, mandatory resolution requirements, and deployment authority vested in teams that include safety functions. The EU AI Act's conformity assessment requirement is one structural mechanism: it creates a mandatory external checkpoint that cannot be overridden by internal commercial pressure.

The Go/No-Go Decision Framework

A go/no-go framework for AI agent deployment translates risk evaluation findings into one of three deployment decisions: deploy as evaluated, deploy with conditions, or do not deploy. The framework requires that every finding from the evaluation process be resolved—either mitigated, accepted with explicit rationale and authority, or treated as a deployment blocker—before the deployment decision is made.

Finding Severity Default Disposition Override Condition Required Authority
Critical (potential for irreversible harm) Deploy blocker; no deployment until mitigated None — no commercial override permitted N/A
High (significant harm, reversible) Deploy with mandatory conditions Explicit written acceptance by C-level + safety lead CEO/CTO + Head of Safety
Medium (limited harm, mitigable) Deploy with monitoring requirements Standard risk acceptance process Product and Safety leads jointly
Low (minimal harm, acceptable) Deploy; log finding in risk register Not applicable Standard deployment approval
Conditional Deployment and Deployment Conditions

The "deploy with conditions" category is where most real-world deployment decisions land. Deployment conditions are operational constraints that allow a system to deploy despite unresolved risks, provided those constraints adequately bound the risk. Common deployment conditions for AI agents include: scope restrictions (the agent is authorized for a specific task set and must refuse requests outside it), user population restrictions (the agent is deployed only to verified professional users, not general consumers), output format constraints (the agent can analyze but not recommend), and human-in-the-loop requirements (a human must review and approve outputs before they are acted upon).

Google DeepMind's approach to deploying AlphaFold 2 in 2021 illustrates conditional deployment done well. The system's protein structure predictions were highly accurate but not infallible. DeepMind deployed with explicit documentation of accuracy limitations, mandatory confidence scoring on all outputs, and a design that positioned predictions as research tools for expert users rather than clinical decisions for practitioners. These conditions were not afterthoughts—they were part of the deployment architecture from the beginning, derived from the evaluation findings.

Staged Deployment as Risk Control

One of the most robust conditional deployment mechanisms is staged rollout: deploying first to a small, monitored population and expanding only when post-deployment monitoring confirms safe behavior at scale. OpenAI's deployment of ChatGPT used a staged approach—initially research access, then limited commercial access—that allowed observation of emergent behaviors before full-scale deployment. The staged approach doesn't eliminate pre-deployment evaluation; it extends the evaluation window into the early deployment period with real-world signal.

Connecting Findings to Post-Deployment Monitoring

A complete go/no-go framework does not terminate at deployment. Each finding that was accepted with conditions, rather than mitigated before deployment, should generate a corresponding monitoring requirement. If an evaluation found that the agent produces occasional high-confidence errors in domain X, the deployment should include automated monitoring for domain X error patterns. If the evaluation found adversarial prompt vulnerability Y, post-deployment logging should flag inputs matching pattern Y for human review.

This connection between pre-deployment findings and post-deployment monitoring is often the weakest link in existing frameworks. Anthropic's Constitutional AI approach, published in 2022, explicitly addresses this: the constitutional principles that govern the model's training are also used as audit criteria for post-deployment behavior review. The same document that describes what the model should do also defines what a monitoring system should check. This architectural coherence between evaluation and monitoring is what distinguishes a genuine safety framework from a compliance exercise.

  1. Complete evaluation across all five domains before initiating the go/no-go decision process. Incomplete evaluation produces an incomplete risk picture; deployment decisions made on partial information systematically underestimate risk.
  2. Classify every finding by severity using a predefined scale with agreed-upon criteria. Unclassified findings default to High severity to prevent under-classification under time pressure.
  3. Resolve every finding through one of three paths: mitigation (fix the problem), conditioned acceptance (deploy with constraints that bound the risk), or deployment blocker (do not deploy until resolved).
  4. Document the resolution rationale for every finding at High severity or above, including who made the acceptance decision and why. This documentation is the audit trail if the finding later produces an incident.
  5. Translate every conditioned acceptance into a specific monitoring requirement. The monitoring requirement should be specific enough that a monitoring engineer can implement it without further interpretation.
  6. Establish a deployment pause criterion before launch: define the post-deployment signal (incident rate, severity type, behavior pattern) that would trigger automatic deployment suspension pending review.

Lesson 4 Quiz

From Evaluation to Decision: Go/No-Go Frameworks and Deployment Conditions
What was the root cause identified in the FAA's post-investigation report on the 737 MAX accidents?
Correct. The evaluation process identified the risk. The failure was structural: there was no mechanism to connect unvalidated assumptions to further testing, or to escalate disputed classifications to decision-makers with deployment authority.
The evaluation actually found the risk. The structural failure was that the go/no-go process had no mechanism to escalate disputed risk classifications to decision-makers with authority to halt deployment—so findings were accepted on unvalidated assumptions.
What is "evaluation theater" and what structural mechanism does the lesson identify as its primary antidote?
Correct. Evaluation theater is structural—it persists because evaluation teams lack authority over deployment decisions. The antidote is structural: mandatory escalation paths, resolution requirements, and deployment authority that includes safety functions.
Evaluation theater is the practice of running evaluations whose findings can be overridden by commercial pressure. The structural antidote is processes with defined escalation paths, mandatory resolution requirements, and deployment authority vested in teams that include safety functions.
According to the go/no-go framework table in the lesson, which finding severity has NO commercial override permitted?
Correct. Critical findings—those with potential for irreversible harm—are deployment blockers with no commercial override permitted. This is the hard floor of the framework.
According to the framework, Critical findings (potential for irreversible harm) are the only category with no commercial override permitted. They are absolute deployment blockers until mitigated.
What made Google DeepMind's deployment of AlphaFold 2 an example of conditional deployment done well?
Correct. AlphaFold 2's deployment conditions weren't afterthoughts; they were derived from evaluation findings and embedded in the deployment architecture. Confidence scoring, accuracy documentation, and expert-only positioning were all designed in from the beginning.
AlphaFold 2's conditional deployment was effective because the conditions—confidence scoring, accuracy limitation documentation, expert-user positioning—were derived directly from evaluation findings and built into the deployment architecture from the start, not added as afterthoughts.
What does the lesson identify as the weakest link in most existing go/no-go frameworks?
Correct. The lesson identifies the connection between pre-deployment findings and post-deployment monitoring as often the weakest link. Anthropic's Constitutional AI approach is cited as addressing this specifically by using the same principles as both training guidance and monitoring audit criteria.
The lesson identifies the translation of pre-deployment findings into specific post-deployment monitoring requirements as the weakest link in most existing frameworks—and cites Anthropic's Constitutional AI as an example of architectural coherence that addresses this gap.

Lab 4: Building a Go/No-Go Decision Process

Design a complete go/no-go deployment decision framework for an AI agent, including escalation paths, deployment conditions, and monitoring requirements

Your Task

You are the Head of AI Safety at an organization preparing to deploy an AI agent. You have a set of evaluation findings and need to convert them into a deployment decision. Work with the advisor to classify each finding, determine its disposition (deploy blocker, conditioned acceptance, or acceptable), specify deployment conditions where needed, and define monitoring requirements that follow from each conditioned acceptance.

Start with: "I have findings from an evaluation of a [type] agent. Can you help me work through the go/no-go decision process?" — or present a specific set of findings and ask the advisor to walk you through classifying and resolving them.
Go/No-Go Decision Advisor
Lab 4
Welcome to the Go/No-Go Decision Lab. I'll help you convert evaluation findings into a structured deployment decision. We'll work through: (1) severity classification for each finding using defined criteria, (2) disposition determination—deploy blocker, conditioned acceptance with specific constraints, or acceptable risk, (3) deployment condition specification for every conditioned acceptance, and (4) monitoring requirement derivation from each condition. Present your findings for an AI agent, and we'll build the decision framework together.

Module 6 Test

Risk Frameworks: Evaluating Agents Before Deployment — 15 questions, 80% to pass
1. What distinguishes AI pre-deployment evaluation from traditional software QA?
Correct. Traditional QA asks "does it work?" AI pre-deployment evaluation must additionally ask whether it works safely across all populations, under adversarial conditions, at deployment scale.
AI evaluation must address probabilistic, adaptive, context-sensitive failure modes—including adversarially-activated behaviors—not just functional correctness, which is what traditional QA covers.
2. Microsoft Tay was taken offline within 16 hours of deployment. What does this case illustrate about pre-deployment review scope?
Correct. Tay passed functional tests but had no adversarial testing protocol. The review scope was too narrow—it confirmed coherent responses but not safe behavior under deliberate adversarial input.
Tay illustrates the danger of a pre-deployment review scoped only to functional correctness. Without adversarial behavioral testing, the failure mode—deliberate manipulation by users—was invisible to the evaluation process.
3. OpenAI's pre-release evaluation of GPT-4 specifically tested for what capability category that was not a concern for GPT-3?
Correct. GPT-4's higher capability tier made biological and chemical weapons synthesis uplift a plausible concern that hadn't existed at GPT-3's capability level—an example of capability overhang requiring evaluation.
GPT-4's evaluations specifically included testing for uplift in biological and chemical weapons synthesis—a capability overhang concern that emerged at GPT-4's capability level but wasn't present in GPT-3.
4. The 2018 Uber AV fatality in Tempe illustrated which specific risk taxonomy failure?
Correct. The NTSB found that "classification indecision" was observable in the system logs but was not a named failure type in the evaluation rubric—so no test existed for it and no corrective design response had been built.
The Uber case shows that if a failure mode has no name in the evaluation taxonomy, it generates no test and no designed corrective response. The system's classification oscillation was visible in logs but invisible to the evaluation framework.
5. NIST's AI Risk Management Framework organizes risks along which two primary axes?
Correct. NIST AI RMF organizes risks by source (system-origin, human misuse, organizational deployment context) and by impact domain—a structure that ensures both technical and contextual risks are addressed.
NIST's AI RMF uses two axes: source of risk (system-origin, human misuse, organizational deployment context) and impact domain. This structure specifically prevents organizations from focusing only on technical risks.
6. Under the EU AI Act, which of the following is classified as a high-risk AI application triggering mandatory evaluation requirements?
Correct. The EU AI Act classifies AI in employment decisions as high-risk, triggering conformity assessment, risk management system, data governance requirements, and post-market monitoring obligations.
Under the EU AI Act, AI used in employment decisions is classified as high-risk—one of the named application domains that triggers the full set of mandatory evaluation and compliance requirements.
7. What is the primary reason external red teams find failure modes that internal teams miss?
Correct. Internal teams are constrained by organizational culture, shared assumptions, and domain blind spots. External red teamers—especially from adversarial communities or unrelated disciplines—bring framings the internal team cannot easily generate.
External red teams add value because they bring framings, attack vectors, and adversary perspectives that are outside the organizational culture and shared assumptions of the internal team—not because of resources or compensation.
8. Which structured probing technique involves hiding adversarial instructions in data that the agent will process rather than in the direct prompt?
Correct. Prompt injection hides adversarial instructions within data the agent will process—a document it reads, a web page it retrieves, a database record it accesses—rather than in the user's direct input.
Prompt injection specifically hides adversarial instructions within data the agent processes (documents, retrieved content, database records) rather than in the direct user prompt—exploiting the agent's trust in its information sources.
9. DAIR Institute's critique of standard red-teaming proposed what methodological remedy?
Correct. DAIR proposed including affected communities in the evaluation process to discover harms to populations the technical team hadn't considered—a remedy for the structural problem that red teams find the failures they're already looking for.
DAIR's remedy was community-based evaluation: including members of affected communities alongside technical experts to surface harm categories that the technical team's prior conceptions of "harmful output" would never generate.
10. What is "evaluation theater" in the context of AI deployment?
Correct. Evaluation theater is the structural problem where evaluations are conducted but their findings can be overridden by commercial pressure, leaving the evaluation as a documentation exercise rather than a safety mechanism.
Evaluation theater means running safety evaluations that don't actually affect deployment decisions—where findings are acknowledged and then overridden by commercial pressure, making the evaluation process a compliance exercise rather than a safety mechanism.
11. In the go/no-go framework, what happens to a finding that a team wants to accept despite High severity classification?
Correct. High severity findings can be conditionally accepted, but require explicit written acceptance at C-level (CEO/CTO) plus Head of Safety, along with mandatory deployment conditions that bound the risk.
High severity findings require explicit written acceptance by C-level leadership (CEO/CTO) and the Head of Safety, plus mandatory deployment conditions. This prevents override under routine commercial pressure.
12. What distinguished AlphaFold 2's conditional deployment from a typical case of inadequate risk management?
Correct. AlphaFold 2's conditions—confidence scoring, accuracy documentation, expert-only positioning—were designed into the deployment architecture based on evaluation findings, not bolted on as compliance measures after the fact.
What made AlphaFold 2's deployment effective was architectural: the conditions derived from evaluation findings were built in from the start. Confidence scoring, accuracy limitations documentation, and expert-user design were all planned, not reactive.
13. What is the practical taxonomy-building method described in the lesson for evaluators without dedicated safety teams?
Correct. The four-question methodology (wrong output, subversion, downstream interaction, out-of-context use) applied to each capability component, then severity/probability mapped, produces an actionable working taxonomy without requiring full NIST RMF adoption.
The practical method is the four-question approach: for each capability, ask about wrong output, deliberate subversion, downstream system interaction, and out-of-context use. Map answers to severity and probability to produce a working taxonomy.
14. Anthropic's Constitutional AI approach addresses which specific gap in most go/no-go frameworks?
Correct. Constitutional AI's architectural coherence—using the same principles for training and monitoring—addresses the common disconnect where pre-deployment findings don't generate corresponding post-deployment monitoring requirements.
Anthropic's Constitutional AI approach is specifically cited for addressing the gap between pre-deployment evaluation and post-deployment monitoring by using the same constitutional principles as both the training objective and the monitoring audit criteria.
15. What is a "deployment pause criterion" and at what stage of the go/no-go process should it be established?
Correct. A deployment pause criterion is a pre-specified post-deployment signal that automatically triggers suspension pending review—and critically, it must be established before launch, not improvised in response to an incident.
A deployment pause criterion is a pre-defined signal—an incident rate, severity type, or behavior pattern—that triggers automatic deployment suspension pending review. It must be established before launch, as part of the go/no-go process, not improvised when an incident occurs.