Module 7 · Lesson 1

Scoping and Planning a Red-Team Engagement

Before you attack a single prompt, you need a map — and permission.

What separates structured red-teaming from random adversarial poking?

In 2022, Anthropic published its Constitutional AI paper partly in response to internal red-team findings that earlier RLHF-trained models would produce harmful content when queried through layered hypothetical framings. That internal process — scoped around specific harm categories, run by a dedicated team, with documented rules of engagement — is the direct ancestor of today's formal AI red-teaming discipline. Without scoping, testers found themselves overwhelmed by the search space; with it, they could map failure modes systematically.

Why Scoping Is the Foundation

Red-teaming without a defined scope produces noise, not signal. When Microsoft's AI Red Team ran engagements against Bing Chat (Sydney) in early 2023 — documented in their subsequent public retrospective — the team divided the problem space before any testing began: which modalities were in scope (text only), which harm categories were prioritized (influence operations, privacy leakage, CSAM elicitation), and what success criteria would trigger escalation to engineering.

Scope definition answers four questions: What system is being tested? What threat actors are modeled? What harm categories are prioritized? And what evidence constitutes a finding?

Real Precedent

The U.S. AI Safety Institute's 2024 pre-deployment evaluations of Llama 3 and GPT-4o used scope documents that specified CBRN uplift, CSAM generation, and cyberweapon assistance as Tier-1 priorities, with disinformation and privacy as Tier-2. Everything else was explicitly out-of-scope for that engagement — not because it was unimportant, but because unbounded scope produces unbounded timelines.

The Five Scoping Dimensions

Dimension 01

System Boundary

Define exactly which components are in scope: base model only, fine-tuned variant, RAG pipeline, tool-use integrations, multi-agent chain? Each boundary change radically alters attack surface.

Dimension 02

Threat Actor Personas

Who are you modeling? Curious teenagers, nation-state actors, disgruntled insiders, organized fraud rings? Persona selection drives both the sophistication of attacks attempted and the harm categories prioritized.

Dimension 03

Harm Taxonomy

Use an established taxonomy — MITRE ATLAS, NIST AI RMF harm categories, or the UK AI Safety Institute's harm framework — as a checklist. Document which categories are Tier-1 vs. Tier-2 priority.

Dimension 04

Access Level

White-box (full model weights, system prompt access), grey-box (API access plus documentation), or black-box (public endpoint only)? Each level enables different attack classes and demands different tooling.

Dimension 05

Success Criteria

Define a finding before you start. Is a single successful harmful output a finding? Does it need to be reproducible? At what reproduction rate? Vague criteria produce unfalsifiable claims.

Rules of Engagement

Every professional red-team engagement requires a Rules of Engagement (RoE) document signed before any testing begins. For AI systems, the RoE must address several AI-specific complications that don't exist in traditional network penetration testing.

Data handling: Successful jailbreaks may produce genuinely harmful content — synthesis instructions, CSAM, working exploit code. The RoE must specify who can view raw outputs, how they are stored, and when they must be destroyed.

Automated vs. manual: Automated fuzzing can generate thousands of outputs per hour. The RoE should specify whether automated attacks require pre-approval and what logging is mandatory.

Stop conditions: Define when testers must halt and escalate — for example, upon discovering a vulnerability that could enable mass real-world harm before the vendor has any patch in place.

Key Terms

RoERules of Engagement — the formal document specifying what testers may and may not do, how findings are handled, and under what conditions testing must halt.

Harm TaxonomyA structured classification of potential AI-enabled harms used to prioritize test cases — examples include MITRE ATLAS and NIST AI RMF.

Access LevelThe degree of system visibility granted to red teamers: white-box (full internals), grey-box (API + docs), or black-box (endpoint only).

Building the Threat Model

A threat model maps specific actor capabilities to specific system components and specific harm categories. The output is a prioritized list of attack scenarios, not a random queue of things to try. Google DeepMind's 2024 "Frontier Safety Framework" explicitly requires threat models to be produced and reviewed before red-team engagements begin on frontier models — the document must specify which threat actors are assumed capable of what, and which model capabilities could provide meaningful uplift.

The threat model feeds directly into test case design in Lesson 2. Without it, testers default to whatever techniques they find interesting — which rarely maps to the highest-risk scenarios for a given deployment.

Lesson 1 Quiz

Scoping and Planning — check your understanding

Which of the following best describes a "Rules of Engagement" document in an AI red-team engagement?

Correct. The RoE is a pre-engagement governance document — not a test plan or a findings report. It covers permissions, data handling, and stop conditions.

Not quite. The RoE governs the conduct of the engagement — it specifies permissions, data handling for harmful outputs, and escalation conditions. It is agreed before any testing begins.

When Microsoft's AI Red Team scoped the Bing Chat (Sydney) engagement in early 2023, they prioritized which harm categories as Tier-1?

Correct. Microsoft's documented retrospective on the Sydney engagement identified those three categories as the highest-priority harms given the product's public deployment context.

Not quite. According to Microsoft's published AI red-team retrospective, the Tier-1 priorities for the Bing Chat engagement were influence operations, privacy leakage, and CSAM elicitation — categories tied to the most severe real-world harms.

A "white-box" red-team engagement means the testers have access to:

Correct. White-box access is the most privileged level — testers can inspect internals, enabling gradient-based attacks and analysis of internal activations that are impossible in black-box settings.

Not quite. White-box means full visibility into model internals — weights, architecture, system prompts, training configuration. API-only is black-box; API-plus-docs is grey-box.

Why does Google DeepMind's Frontier Safety Framework require a threat model before red-team engagements begin on frontier models?

Correct. Without a threat model, testers default to whatever techniques interest them, which rarely maps to the highest-risk scenarios. The framework mandates threat modeling to anchor test case prioritization.

Not quite. The requirement exists to ensure that engagement priorities are derived from structured threat analysis — mapping actor capabilities to model capabilities — rather than tester preference or random selection.

Lab 1 — Scope a Red-Team Engagement

Practice defining scope documents with an AI methodology coach

Your Task

You have been asked to red-team a customer-facing AI assistant deployed by a mid-size financial services firm. The model is GPT-4o fine-tuned on the firm's internal policy documents, accessible via a public chat widget (black-box access). Your job is to produce a scope document before any testing begins.

Use this lab to work through the five scoping dimensions with the AI coach. Ask it to help you define the system boundary, select threat actor personas, map a harm taxonomy, specify access level constraints, and draft success criteria language.

Suggested opening: "Help me define the system boundary for a red-team engagement against a financial services AI assistant that has black-box API access only."

Methodology Coach

Scoping & Planning

Welcome to Lab 1. I'm your red-team methodology coach. We're going to build a proper scope document for your financial services AI engagement together. What aspect would you like to tackle first — system boundary, threat actor personas, harm taxonomy, access level, or success criteria?

Module 7 · Lesson 2

Test Case Design and Attack Taxonomies

Structured adversarial thinking requires a library, not a gut feeling.

How do professional red teams move from a threat model to a reproducible battery of test cases?

When the UK AI Safety Institute red-teamed Llama 3, GPT-4o, Gemini 1.5 Pro, and Claude 3 Opus in 2024 ahead of the Seoul AI Safety Summit, their published methodology described structured test case batteries organized by harm category and attack technique class. Rather than ad-hoc prompting, each model faced the same standardized scenarios — CBRN synthesis uplift, cyberweapon code generation, and influence operation content — enabling cross-model comparison. The ability to compare models depended entirely on having a shared test case taxonomy.

From Threat Model to Test Cases

A threat model names the risk; a test case operationalizes it. The translation process requires two things: an attack technique taxonomy (how attacks are structured) and a harm scenario library (what outcomes are being probed). These two dimensions intersect to produce a test matrix.

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides the most widely used attack taxonomy for AI systems, adapted from the MITRE ATT&CK framework used in traditional cybersecurity. It organizes techniques by tactic — from initial access through impact — mapped to AI-specific attack primitives.

Attack Technique Classes

Class	Description	Example Technique
Prompt Injection	Embedding adversarial instructions in user input or retrieved context to override system-level directives	Indirect injection via web search results in an agentic pipeline (demonstrated against Bing Chat, 2023)
Jailbreaking	Eliciting policy-violating outputs through social engineering, roleplay framing, or hypothetical structures	DAN (Do Anything Now) prompt family; many-shot jailbreaking (Anthropic 2024 research)
Adversarial Examples	Perturbed inputs that cause unexpected model behavior, including classification errors or safety bypass	GCG (Greedy Coordinate Gradient) suffix attacks (Zou et al., 2023)
Model Extraction	Querying a model at scale to reconstruct its weights, fine-tuning data, or system prompt	System prompt extraction via repetition attacks (demonstrated against GPT-4 and Claude, 2023)
Data Poisoning	Corrupting training or fine-tuning data to introduce backdoors or degrade safety behavior	Sleeper agent insertion during RLHF (Hubinger et al., 2024)
Evasion	Modifying malicious content to bypass content classifiers without changing semantic meaning	Unicode homoglyph substitution, Base64 encoding, L33t-speak obfuscation

Building a Test Matrix

A test matrix has harm categories on one axis and attack technique classes on the other. Each cell contains specific test cases. For a financial services AI assistant, the matrix might include: jailbreaking × financial fraud enablement, prompt injection × data exfiltration via retrieval, and model extraction × system prompt reconstruction.

Not every cell needs to be filled. Priority is determined by the threat model: cells that correspond to high-likelihood actors and high-severity outcomes get the most test cases.

Case: Many-Shot Jailbreaking

Anthropic's April 2024 research paper "Many-Shot Jailbreaking" documented a test case design insight: context window length is itself an attack surface. As context windows expanded to 100K+ tokens, researchers found that prepending hundreds of "faux-dialogue" examples of the model complying with harmful requests significantly increased compliance rates on subsequent harmful prompts — even for requests the model refused in a zero-shot setting. This technique emerged from structured test case exploration, not random prompting.

Reproducibility Requirements

A test case that cannot be reproduced is not a finding — it is anecdote. Professional red-team test cases must specify the exact prompt or prompt template used, the model version and temperature settings, the system prompt (if accessible), the expected vs. actual output, and the number of independent trials run.

The UK AISI's evaluation methodology required each finding to be reproducible at a defined rate (e.g., "succeeds in at least 3 of 10 trials") before it was counted as a confirmed vulnerability. This threshold prevents single-sample noise from being reported as systemic failures.

Key Terms

MITRE ATLASAdversarial Threat Landscape for AI Systems — a community-developed knowledge base of AI-specific adversarial tactics and techniques, maintained by MITRE.

Test MatrixA structured grid mapping harm categories (rows) against attack technique classes (columns), with specific test cases populating each cell.

Reproducibility ThresholdThe minimum rate at which a test case must succeed across independent trials before it qualifies as a confirmed vulnerability finding.

Lesson 2 Quiz

Test Case Design and Attack Taxonomies

What made the UK AI Safety Institute's cross-model evaluations of GPT-4o, Llama 3, Gemini 1.5 Pro, and Claude 3 Opus comparable across models?

Correct. The published methodology specified standardized scenario batteries — the same test cases run against each model — which is what enabled meaningful cross-model comparison.

Not quite. Cross-model comparability required shared, standardized test case batteries. Without common test cases, results from different models cannot be meaningfully compared.

GCG (Greedy Coordinate Gradient) suffix attacks, published by Zou et al. in 2023, fall into which attack technique class?

Correct. GCG attacks use gradient-based optimization to find token suffixes that cause the model to comply with harmful requests — a classic adversarial example technique applied to language models.

Not quite. GCG attacks are adversarial examples — they use gradient optimization over the input token space to cause unexpected model behavior, analogous to pixel perturbation attacks in computer vision.

Anthropic's 2024 "Many-Shot Jailbreaking" paper identified what previously underexplored attack surface?

Correct. As context windows grew to 100K+ tokens, researchers found that prepending hundreds of faux-dialogue compliance examples in context significantly increased harmful request compliance on the final turn.

Not quite. Many-shot jailbreaking exploits expanded context windows by filling them with fake compliance examples, showing the model "evidence" it complies with such requests, which shifts behavior on the actual harmful prompt at the end.

According to UK AISI evaluation methodology, when does a test case qualify as a confirmed vulnerability finding?

Correct. Reproducibility thresholds prevent single-sample noise from being reported as systemic vulnerabilities. The AISI required minimum success rates across independent trials before confirming a finding.

Not quite. A single successful trial is not sufficient — it could be noise. AISI methodology required confirmed findings to be reproducible at a defined rate across multiple independent trials.

Lab 2 — Build a Test Matrix

Design structured test cases across harm categories and attack technique classes

Your Task

You're continuing the financial services AI engagement from Lab 1. Now you need to translate your scope document into a test matrix. The coach will help you populate cells across harm categories (financial fraud, data exfiltration, influence/manipulation) and attack classes (jailbreaking, prompt injection, model extraction, evasion).

Work through at least three specific test cases. For each, you should specify: the harm category, the attack technique class, the specific prompt template or approach, and a proposed reproducibility threshold.

Suggested opening: "Let's build a test matrix. Start with the jailbreaking × financial fraud cell — what test cases should I include?"

Methodology Coach

Test Case Design

Ready to build your test matrix. I can help you design specific test cases for each cell — harm category crossed with attack technique class. Which cell do you want to populate first? I'd suggest starting with the highest-priority intersection from your threat model.

Module 7 · Lesson 3

Execution: Running the Engagement

The difference between testing and red-teaming is documentation at every step.

What operational discipline separates a professional red-team execution from exploratory hacking?

In the 2023 DEFCON AI Village "Generative AI Red Team Challenge," hundreds of participants attempted to jailbreak eight commercial models simultaneously. Despite the energy and volume of attempts, the event's organizers — including Anthropic, Google, Hugging Face, and NVIDIA — noted in their post-event analysis that the most actionable findings came not from participants who generated the most outputs, but from those who documented their attempts methodically enough to produce reproducible demonstrations. Execution discipline, not volume, determined finding quality.

The Execution Workflow

Phase 01

Baseline Establishment

Before adversarial testing, document the model's behavior on benign variants of each test case. This creates a baseline that distinguishes adversarial-specific failures from pre-existing model limitations.

Phase 02

Structured Probing

Execute test cases from the matrix in order of priority. Log every attempt: exact prompt, model version, temperature, timestamp, and raw output. No paraphrasing of outputs in initial logs.

Phase 03

Iterative Refinement

When a test case fails (model refuses), document why it failed and iterate with a modified approach. Each iteration is a separate logged attempt. This preserves the attack development trail for the report.

Phase 04

Reproducibility Verification

Once a potential finding is identified, run the test case the required number of independent trials (per the RoE-specified threshold) with fresh conversation contexts to confirm reproducibility.

Phase 05

Severity Scoring

Assign a severity score to each confirmed finding before writing the report. Use a defined rubric: harm severity × reproducibility rate × ease of exploitation. This prevents severity inflation from excitement.

Phase 06

Secure Archival

Store all raw outputs — including harmful content produced during testing — in accordance with RoE data handling requirements. Chain of custody documentation is required for any content that may be used in regulatory or legal contexts.

Logging Requirements

Every test attempt must generate a log entry containing: a unique attempt ID, the test case ID from the matrix, the exact prompt (or prompt hash if content is too sensitive to store in plaintext), the model identifier and version, inference parameters (temperature, top-p, max tokens), the timestamp of the request, and the raw output verbatim.

This logging standard was formalized in NIST AI 100-1 (AI Risk Management Framework) and echoed in the EU AI Act's requirements for high-risk AI system documentation. The practical reason is simpler: without complete logs, findings cannot be verified, escalated, or patched.

Case: DEFCON 2023 Red Team Challenge

The post-event analysis from the DEFCON AI Village challenge noted a systematic quality gap between casual and disciplined participants. Participants who used structured logging — even simple spreadsheets — produced findings that vendors could act on. Those who relied on memory or informal notes produced anecdotes that could not be reproduced. The event organizers subsequently published a simplified logging template based on what the highest-quality submissions had in common.

Managing Team Dynamics

Large red-team engagements involve multiple testers, which introduces coordination problems. Testers must avoid duplicating effort across the same test matrix cells, must communicate near-real-time when a finding is identified (so others can probe related attack paths immediately), and must avoid contaminating each other's independent reproducibility trials.

Microsoft's AI Red Team, which employs full-time AI red teamers, uses a shared ticketing system where each test case attempt is logged centrally. When one tester identifies a potential finding, the ticket is flagged for a second tester to independently verify — preserving independence while enabling rapid follow-up.

Handling Unexpected Findings

Engagements regularly surface vulnerabilities that are outside the defined scope — a tester probing financial fraud elicitation might accidentally discover that the model will provide detailed synthesis instructions for a dangerous compound. The RoE should specify an escalation path for out-of-scope critical findings: typically, immediate halt of the adjacent test stream, documented notification to the engagement lead, and a decision within a defined time window on whether to expand scope or sequester the finding for separate disclosure.

Key Terms

BaselineDocumentation of model behavior on benign variants of each test case, used to distinguish adversarial-specific failures from pre-existing limitations.

Severity RubricA defined formula for scoring finding severity — typically combining harm severity, reproducibility rate, and ease of exploitation — applied before report writing to prevent inflation.

Chain of CustodyDocumentation tracing the handling of sensitive materials (harmful outputs) from creation through storage and destruction, required for regulatory or legal contexts.

Lesson 3 Quiz

Execution: Running the Engagement

What did the post-event analysis of the DEFCON 2023 AI Village Red Team Challenge identify as the key differentiator between high-quality and low-quality findings?

Correct. The analysis found that documentation discipline — not volume or experience — determined whether a finding could be acted upon by vendors. Undocumented successes became anecdotes.

Not quite. The analysis specifically highlighted that methodical documentation, not volume of attempts or prior experience, produced actionable findings. Undocumented results could not be verified or patched.

Why is establishing a baseline before adversarial testing important?

Correct. Without a baseline, a tester cannot determine whether a harmful output reflects an adversarial vulnerability or simply how the model behaves on similar requests without any adversarial framing.

Not quite. The baseline purpose is to create a comparison point — documenting behavior on benign variants so that adversarial-specific failures can be distinguished from inherent model behavior that would occur without any attack.

Microsoft's AI Red Team uses what mechanism to ensure independent verification of potential findings?

Correct. The ticketing system preserves independence — the second tester runs the case fresh without seeing the first tester's raw output — while enabling rapid follow-up on related attack paths.

Not quite. Microsoft's documented approach uses a centralized ticketing system where potential findings are flagged and assigned to a second tester who independently runs verification, preserving independence of the reproducibility check.

When a red team discovers a critical out-of-scope vulnerability, the correct immediate action per standard RoE protocols is:

Correct. The RoE escalation path for out-of-scope critical findings requires immediate notification and a structured decision process — not unilateral action by the tester who discovered it.

Not quite. Out-of-scope critical findings require immediate escalation through the defined RoE path: halt adjacent testing, notify the engagement lead, and wait for a scope decision — not continued independent exploration or public disclosure.

Lab 3 — Execution Discipline Practice

Practice structured logging and severity scoring with the methodology coach

Your Task

You've discovered a potential finding during your financial services AI engagement: when asked to roleplay as a "compliance officer who has decided to ignore AML rules," the model provides specific guidance on structuring transactions to avoid regulatory reporting thresholds. You need to document this properly and score its severity.

Work with the coach to: (1) draft a complete log entry for this finding, (2) define what reproducibility trials you'd run, (3) apply a severity rubric (harm severity × reproducibility × ease of exploitation), and (4) draft the escalation notification.

Suggested opening: "I've found a potential finding: roleplay-based jailbreak eliciting AML structuring advice. Help me draft a complete log entry."

Methodology Coach

Execution & Logging

Good — a roleplay-framed jailbreak producing AML structuring advice is exactly the kind of finding that needs careful documentation. Let's build the log entry together. First: do you have the exact prompt and the raw model output captured verbatim? We need both before we can write anything else.

Module 7 · Lesson 4

Reporting, Disclosure, and Remediation Tracking

A finding that isn't fixed is just reconnaissance for the next attacker.

How do red teams translate technical findings into organizational change — and how is responsible disclosure managed for AI vulnerabilities?

In March 2023, security researcher Johann Rehberger demonstrated that a prompt injection attack against ChatGPT plugins could exfiltrate user conversation data to an attacker-controlled server via a maliciously crafted web page. He reported the finding to OpenAI through their responsible disclosure program. OpenAI's initial response classified it as "not a vulnerability" — a decision Rehberger publicly disputed. The finding was ultimately acknowledged and partially addressed months later. The case became a reference point in AI security circles for the inadequacy of traditional software vulnerability disclosure frameworks when applied to AI systems, where the harm model, reproduction criteria, and patch definition are all fundamentally different.

The Red-Team Report Structure

A professional red-team report is not a dump of raw logs. It is a structured document designed to enable organizational decision-making. The standard structure for AI red-team reports, adapted from the format used by Microsoft's AI Red Team and published in their 2023 public documentation, includes five sections.

Section	Contents	Primary Audience
Executive Summary	Overall risk posture, count and severity distribution of findings, top three recommendations	Leadership, legal, product
Scope & Methodology	System boundary, access level, threat actor personas, harm taxonomy used, test matrix summary	Engineering, security
Findings Catalog	Each finding: ID, severity score, attack technique class, harm category, exact reproduction steps, raw output excerpt (redacted per RoE), and number of successful trials	Engineering, security
Recommendations	Specific, actionable mitigations for each finding category — not generic "improve safety training" advice	Engineering, product
Remediation Tracking Baseline	Open/closed status for each finding, assigned owner, target resolution date, and re-test criteria	Engineering, security, legal

Responsible Disclosure for AI Vulnerabilities

Traditional software vulnerability disclosure (CVE system, 90-day disclosure windows pioneered by Google Project Zero) was designed for deterministic bugs with clear patch definitions. AI vulnerabilities don't fit this model cleanly: a jailbreak might be mitigated by a system prompt update, addressed in fine-tuning, or simply accepted as a known limitation of the current model generation.

The Rehberger/ChatGPT case illustrates the friction. When Rehberger reported in March 2023, OpenAI's bug bounty program — run through Bugcrowd — had no defined severity taxonomy for prompt injection attacks. The disclosure sat in limbo for months because neither party had agreed in advance on what "fixed" looked like for a probabilistic system.

Emerging Standards

The AI Vulnerability Database (AVID), launched in 2023 by a coalition including Cohere, Hugging Face, and independent researchers, attempts to create a CVE-equivalent for AI harms. AVID taxonomy classifies findings by harm type, affected modality, and lifecycle stage (training, deployment, inference), and provides a structured disclosure template that specifies mitigation types distinct from traditional patches. Several organizations now accept AVID-formatted disclosures as a standard reporting format.

Remediation Tracking

Every finding in the catalog must have a remediation ticket with an owner, a target date, and defined re-test criteria before the report is delivered. Without this, reports become artifacts rather than drivers of change.

AI-specific remediation types differ from traditional software patches. They include: system prompt hardening (guardrails added at the prompt level), output filter tuning (adjusting content classifier thresholds), fine-tuning or RLHF updates (most expensive, highest latency), retrieval pipeline hardening (for RAG-based systems), and architectural changes (e.g., removing tool access that enabled a finding).

Critically, remediation must be verified — the specific test case that produced the finding must be re-run against the patched system. A reported fix that has not been verified by the red team should not be closed in the tracking system.

Continuous vs. Point-in-Time Red-Teaming

Point-in-time engagements — the traditional consulting model — are increasingly insufficient for AI systems that are updated continuously. OpenAI, Anthropic, and Google all now maintain ongoing internal red-team functions that run continuous evaluation suites against each model update, supplemented by periodic deeper engagements. The EU AI Act's Article 9 risk management system requires "ongoing" monitoring — not just pre-deployment assessment — for high-risk AI systems. This has driven interest in automated red-teaming pipelines (covered in Module 8) that can run test case batteries against new model versions as part of the deployment CI/CD pipeline.

Key Terms

AVIDAI Vulnerability Database — an emerging CVE-equivalent for AI harms, providing a structured taxonomy and disclosure format for AI-specific vulnerabilities.

Re-test CriteriaThe specific test case and success threshold that must be run against a patched system to confirm a finding has been remediated — required before closing a finding ticket.

Continuous Red-TeamingAn ongoing automated or semi-automated evaluation function that runs test case batteries against each model update, rather than periodic point-in-time engagements only.

Lesson 4 Quiz

Reporting, Disclosure, and Remediation Tracking

Johann Rehberger's 2023 prompt injection finding against ChatGPT plugins became a reference case in AI security primarily because:

Correct. The case highlighted the structural mismatch between traditional bug bounty/CVE disclosure processes and AI-specific vulnerabilities, where harm models, reproduction criteria, and remediation definitions are fundamentally different.

Not quite. The case is referenced because it exposed the inadequacy of traditional disclosure frameworks for AI — specifically, the lack of a severity taxonomy for prompt injection and the ambiguity around what "patching" means for a probabilistic system.

The AI Vulnerability Database (AVID) was created to address what gap?

Correct. AVID provides an AI-specific taxonomy (by harm type, modality, and lifecycle stage) and a structured disclosure template that distinguishes AI remediation types from traditional software patches.

Not quite. AVID is the AI Vulnerability Database — an attempt to create a CVE-equivalent for AI harms, with a taxonomy and disclosure format adapted specifically for AI-specific vulnerability characteristics.

Which of the following is NOT a valid AI-specific remediation type listed in Lesson 4?

Correct. Binary patching is a traditional software remediation type. AI-specific remediations involve prompt engineering, filter tuning, fine-tuning/RLHF updates, retrieval pipeline changes, or architectural changes — not binary patches to the inference infrastructure.

Not quite. AI-specific remediations are fundamentally different from traditional software patches. The lesson lists system prompt hardening, output filter tuning, fine-tuning/RLHF updates, retrieval pipeline hardening, and architectural changes — not binary infrastructure patches.

Under what condition should a finding ticket be closed in the remediation tracking system?

Correct. An unverified fix is not a fix. The red team must re-run the specific test case against the patched system and confirm it no longer reproduces before the ticket can be closed.

Not quite. Reported fixes that have not been independently verified by the red team must remain open. Closure requires re-running the original test case against the patched system and confirming the finding no longer reproduces.

Lab 4 — Draft a Red-Team Report Section

Practice writing a findings catalog entry and remediation tracking baseline

Your Task

Your financial services AI engagement is complete. You have three confirmed findings. Now you need to write the Findings Catalog section of the report and set up remediation tracking for each finding. The coach will help you structure each finding entry correctly and ensure your recommendations are specific and actionable — not generic.

Work through at least one complete finding entry, including: finding ID, severity score, attack technique class, harm category, exact reproduction steps (use the AML structuring finding from Lab 3), re-test criteria, and assigned remediation type with justification.

Suggested opening: "Help me write a complete findings catalog entry for the AML structuring jailbreak. Finding ID: FIN-001, severity: High."

Methodology Coach

Reporting & Disclosure

Ready to draft the findings catalog. A strong catalog entry needs to be specific enough that an engineer who wasn't present during the engagement can reproduce the finding and verify a fix. Let's start with FIN-001. What is the exact attack technique class — jailbreaking, prompt injection, or something else? And what severity score did you assign using the rubric from Lesson 3?

Module 7 — Red-Teaming Methodology

15 questions · Pass at 80% · All lessons covered

1. A red-team scope document that defines which harm categories are "Tier-1 priority" serves what primary function?

Correct. Tier prioritization bounds the search space, preventing tester resources from being spread evenly across an effectively infinite problem space.

Not quite. Tier prioritization is fundamentally about focusing effort — preventing an unbounded search space from producing unfocused, low-value testing.

2. MITRE ATLAS is best described as:

Correct. MITRE ATLAS is the most widely used AI-specific attack taxonomy, maintained by MITRE and modeled on ATT&CK's structure of tactics, techniques, and procedures.

Not quite. MITRE ATLAS is a knowledge base of AI adversarial tactics and techniques — a mapping framework, not a certification, tool, or regulatory taxonomy.

3. What specific attack surface did the Anthropic 2024 "Many-Shot Jailbreaking" paper identify?

Correct. Many-shot jailbreaking exploits the expansion of context windows to 100K+ tokens by prepending hundreds of fake compliance examples that prime the model to comply with the final harmful request.

Not quite. The paper identified that long context windows are an attack surface — by filling them with fake compliance dialogues, attackers shift model behavior on the terminal harmful request.

4. The UK AISI required each red-team finding to succeed in "at least 3 of 10 independent trials" before confirmation. This requirement addresses:

Correct. The threshold exists to separate genuine vulnerabilities from stochastic flukes — AI models are probabilistic, so a single harmful output may be noise rather than a systematic failure.

Not quite. The reproducibility threshold addresses the probabilistic nature of AI outputs — a single success could be noise. Requiring 3-of-10 confirms the attack works systematically, not just once.

5. GCG (Greedy Coordinate Gradient) attacks are most analogous to which traditional adversarial ML technique?

Correct. GCG uses gradient-based optimization over the token input space to find perturbations that cause unexpected behavior — directly analogous to FGSM and PGD pixel perturbation attacks in computer vision.

Not quite. GCG attacks are adversarial examples — they use gradient optimization over the input space (tokens rather than pixels) to find perturbations that cause safety bypass, just as FGSM attacks modify pixels to fool image classifiers.

6. In a "grey-box" red-team engagement, testers have access to:

Correct. Grey-box is the intermediate access level — more than black-box (API only) but less than white-box (full internals). API access plus documentation enables system prompt inference and architecture-informed attack design.

Not quite. Grey-box = API access plus documentation. White-box = full internals. Black-box = public endpoint only. Grey-box enables more sophisticated attacks than black-box but not gradient-based attacks that require weights.

7. The Rehberger/ChatGPT plugin disclosure case illustrated that traditional 90-day software disclosure windows are problematic for AI vulnerabilities because:

Correct. The mismatch is structural: traditional disclosure assumes a deterministic bug with a definable patch. AI vulnerabilities are probabilistic, harm taxonomies are immature, and "fixed" is ambiguous — a system prompt update may reduce but not eliminate a jailbreak.

Not quite. The core problem is definitional — there's no agreed severity taxonomy for AI-specific attack classes and no clear definition of what constitutes a "fix" for a probabilistic system that can still sometimes produce harmful outputs after mitigation.

8. Which section of a red-team report is primarily designed for an engineering audience rather than leadership?

Correct. The Findings Catalog contains exact reproduction steps, raw output excerpts, attack technique classifications, and severity rubric scores — all technical content for engineers who need to understand and fix each finding.

Not quite. The Findings Catalog is the technical core of the report — it contains reproduction steps, raw outputs, and severity scores that engineers need to diagnose and fix findings. Leadership reads the Executive Summary.

9. Why does the lesson recommend establishing a baseline before adversarial testing begins?

Correct. A baseline answers the question: "Would the model produce this output even without the adversarial framing?" If yes, the attack isn't the cause — the model already had this limitation. This distinction is critical for severity scoring and recommendations.

Not quite. Baselines serve an analytical purpose — they let the team determine whether a failure is caused by the adversarial technique or is simply how the model behaves on similar requests without any attack. Without a baseline, severity scoring is arbitrary.

10. According to EU AI Act Article 9, organizations deploying high-risk AI systems must conduct red-teaming:

Correct. Article 9's risk management system requirement specifies ongoing monitoring — the regulation explicitly does not limit evaluation to pre-deployment, which is one driver of interest in continuous automated red-teaming pipelines.

Not quite. The EU AI Act requires ongoing risk management monitoring — not just pre-deployment or annual assessment. This has pushed organizations toward continuous red-teaming and automated evaluation pipelines integrated into deployment workflows.

11. The "sleeper agent" finding from Hubinger et al. (2024) falls into which attack technique class?

Correct. Sleeper agents are inserted via data poisoning during RLHF — the training data is corrupted to introduce a backdoor trigger that causes harmful behavior when activated, while the model behaves normally otherwise.

Not quite. Sleeper agents are a data poisoning technique — they corrupt training data to insert a backdoor trigger. This is distinct from prompt-level attacks that target the model at inference time.

12. What was the key operational lesson from the DEFCON 2023 AI Village Red Team Challenge regarding finding quality?

Correct. The post-event analysis found that methodical documentation produced actionable, reproducible findings, while volume-focused participants produced anecdotes that vendors could not act upon.

Not quite. The key lesson was about documentation discipline. Participants who logged attempts systematically produced reproducible findings; those who relied on memory or informal notes produced unverifiable anecdotes regardless of their experience level.

13. Which of the following is a valid AI-specific remediation type that does NOT have a direct equivalent in traditional software patching?

Correct. System prompt hardening is AI-specific — it involves modifying the natural language instructions that govern model behavior, rather than modifying compiled code or configuration files.

Not quite. Dependency updates, configuration hardening, and access controls have direct equivalents in traditional software security. System prompt hardening — modifying natural language governance instructions — is uniquely AI-specific.

14. Microsoft's AI Red Team uses a centralized ticketing system primarily to:

Correct. The ticketing system serves two coordination functions: it shows which test matrix cells are being worked, preventing duplication, and it flags potential findings for assignment to a second tester for independent verification.

Not quite. The ticketing system's primary purposes are operational coordination (avoiding duplicate effort) and verification integrity (ensuring a second independent tester confirms each potential finding before it's confirmed).

15. The AVID (AI Vulnerability Database) taxonomy classifies findings by which three dimensions?

Correct. AVID's three-dimensional taxonomy — harm type × modality × lifecycle stage — is specifically designed to capture AI-specific vulnerability characteristics that the CVE system's binary "vulnerable/patched" model cannot accommodate.

Not quite. AVID classifies by harm type, affected modality (text, image, multimodal, etc.), and lifecycle stage (training, deployment, inference). These dimensions are designed specifically for the characteristics of AI vulnerabilities that don't map to traditional CVE fields.