In 2022, Anthropic published its Constitutional AI paper partly in response to internal red-team findings that earlier RLHF-trained models would produce harmful content when queried through layered hypothetical framings. That internal process — scoped around specific harm categories, run by a dedicated team, with documented rules of engagement — is the direct ancestor of today's formal AI red-teaming discipline. Without scoping, testers found themselves overwhelmed by the search space; with it, they could map failure modes systematically.
Red-teaming without a defined scope produces noise, not signal. When Microsoft's AI Red Team ran engagements against Bing Chat (Sydney) in early 2023 — documented in their subsequent public retrospective — the team divided the problem space before any testing began: which modalities were in scope (text only), which harm categories were prioritized (influence operations, privacy leakage, CSAM elicitation), and what success criteria would trigger escalation to engineering.
Scope definition answers four questions: What system is being tested? What threat actors are modeled? What harm categories are prioritized? And what evidence constitutes a finding?
The U.S. AI Safety Institute's 2024 pre-deployment evaluations of Llama 3 and GPT-4o used scope documents that specified CBRN uplift, CSAM generation, and cyberweapon assistance as Tier-1 priorities, with disinformation and privacy as Tier-2. Everything else was explicitly out-of-scope for that engagement — not because it was unimportant, but because unbounded scope produces unbounded timelines.
Every professional red-team engagement requires a Rules of Engagement (RoE) document signed before any testing begins. For AI systems, the RoE must address several AI-specific complications that don't exist in traditional network penetration testing.
Data handling: Successful jailbreaks may produce genuinely harmful content — synthesis instructions, CSAM, working exploit code. The RoE must specify who can view raw outputs, how they are stored, and when they must be destroyed.
Automated vs. manual: Automated fuzzing can generate thousands of outputs per hour. The RoE should specify whether automated attacks require pre-approval and what logging is mandatory.
Stop conditions: Define when testers must halt and escalate — for example, upon discovering a vulnerability that could enable mass real-world harm before the vendor has any patch in place.
A threat model maps specific actor capabilities to specific system components and specific harm categories. The output is a prioritized list of attack scenarios, not a random queue of things to try. Google DeepMind's 2024 "Frontier Safety Framework" explicitly requires threat models to be produced and reviewed before red-team engagements begin on frontier models — the document must specify which threat actors are assumed capable of what, and which model capabilities could provide meaningful uplift.
The threat model feeds directly into test case design in Lesson 2. Without it, testers default to whatever techniques they find interesting — which rarely maps to the highest-risk scenarios for a given deployment.
You have been asked to red-team a customer-facing AI assistant deployed by a mid-size financial services firm. The model is GPT-4o fine-tuned on the firm's internal policy documents, accessible via a public chat widget (black-box access). Your job is to produce a scope document before any testing begins.
Use this lab to work through the five scoping dimensions with the AI coach. Ask it to help you define the system boundary, select threat actor personas, map a harm taxonomy, specify access level constraints, and draft success criteria language.
When the UK AI Safety Institute red-teamed Llama 3, GPT-4o, Gemini 1.5 Pro, and Claude 3 Opus in 2024 ahead of the Seoul AI Safety Summit, their published methodology described structured test case batteries organized by harm category and attack technique class. Rather than ad-hoc prompting, each model faced the same standardized scenarios — CBRN synthesis uplift, cyberweapon code generation, and influence operation content — enabling cross-model comparison. The ability to compare models depended entirely on having a shared test case taxonomy.
A threat model names the risk; a test case operationalizes it. The translation process requires two things: an attack technique taxonomy (how attacks are structured) and a harm scenario library (what outcomes are being probed). These two dimensions intersect to produce a test matrix.
MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides the most widely used attack taxonomy for AI systems, adapted from the MITRE ATT&CK framework used in traditional cybersecurity. It organizes techniques by tactic — from initial access through impact — mapped to AI-specific attack primitives.
| Class | Description | Example Technique |
|---|---|---|
| Prompt Injection | Embedding adversarial instructions in user input or retrieved context to override system-level directives | Indirect injection via web search results in an agentic pipeline (demonstrated against Bing Chat, 2023) |
| Jailbreaking | Eliciting policy-violating outputs through social engineering, roleplay framing, or hypothetical structures | DAN (Do Anything Now) prompt family; many-shot jailbreaking (Anthropic 2024 research) |
| Adversarial Examples | Perturbed inputs that cause unexpected model behavior, including classification errors or safety bypass | GCG (Greedy Coordinate Gradient) suffix attacks (Zou et al., 2023) |
| Model Extraction | Querying a model at scale to reconstruct its weights, fine-tuning data, or system prompt | System prompt extraction via repetition attacks (demonstrated against GPT-4 and Claude, 2023) |
| Data Poisoning | Corrupting training or fine-tuning data to introduce backdoors or degrade safety behavior | Sleeper agent insertion during RLHF (Hubinger et al., 2024) |
| Evasion | Modifying malicious content to bypass content classifiers without changing semantic meaning | Unicode homoglyph substitution, Base64 encoding, L33t-speak obfuscation |
A test matrix has harm categories on one axis and attack technique classes on the other. Each cell contains specific test cases. For a financial services AI assistant, the matrix might include: jailbreaking × financial fraud enablement, prompt injection × data exfiltration via retrieval, and model extraction × system prompt reconstruction.
Not every cell needs to be filled. Priority is determined by the threat model: cells that correspond to high-likelihood actors and high-severity outcomes get the most test cases.
Anthropic's April 2024 research paper "Many-Shot Jailbreaking" documented a test case design insight: context window length is itself an attack surface. As context windows expanded to 100K+ tokens, researchers found that prepending hundreds of "faux-dialogue" examples of the model complying with harmful requests significantly increased compliance rates on subsequent harmful prompts — even for requests the model refused in a zero-shot setting. This technique emerged from structured test case exploration, not random prompting.
A test case that cannot be reproduced is not a finding — it is anecdote. Professional red-team test cases must specify the exact prompt or prompt template used, the model version and temperature settings, the system prompt (if accessible), the expected vs. actual output, and the number of independent trials run.
The UK AISI's evaluation methodology required each finding to be reproducible at a defined rate (e.g., "succeeds in at least 3 of 10 trials") before it was counted as a confirmed vulnerability. This threshold prevents single-sample noise from being reported as systemic failures.
You're continuing the financial services AI engagement from Lab 1. Now you need to translate your scope document into a test matrix. The coach will help you populate cells across harm categories (financial fraud, data exfiltration, influence/manipulation) and attack classes (jailbreaking, prompt injection, model extraction, evasion).
Work through at least three specific test cases. For each, you should specify: the harm category, the attack technique class, the specific prompt template or approach, and a proposed reproducibility threshold.
In the 2023 DEFCON AI Village "Generative AI Red Team Challenge," hundreds of participants attempted to jailbreak eight commercial models simultaneously. Despite the energy and volume of attempts, the event's organizers — including Anthropic, Google, Hugging Face, and NVIDIA — noted in their post-event analysis that the most actionable findings came not from participants who generated the most outputs, but from those who documented their attempts methodically enough to produce reproducible demonstrations. Execution discipline, not volume, determined finding quality.
Every test attempt must generate a log entry containing: a unique attempt ID, the test case ID from the matrix, the exact prompt (or prompt hash if content is too sensitive to store in plaintext), the model identifier and version, inference parameters (temperature, top-p, max tokens), the timestamp of the request, and the raw output verbatim.
This logging standard was formalized in NIST AI 100-1 (AI Risk Management Framework) and echoed in the EU AI Act's requirements for high-risk AI system documentation. The practical reason is simpler: without complete logs, findings cannot be verified, escalated, or patched.
The post-event analysis from the DEFCON AI Village challenge noted a systematic quality gap between casual and disciplined participants. Participants who used structured logging — even simple spreadsheets — produced findings that vendors could act on. Those who relied on memory or informal notes produced anecdotes that could not be reproduced. The event organizers subsequently published a simplified logging template based on what the highest-quality submissions had in common.
Large red-team engagements involve multiple testers, which introduces coordination problems. Testers must avoid duplicating effort across the same test matrix cells, must communicate near-real-time when a finding is identified (so others can probe related attack paths immediately), and must avoid contaminating each other's independent reproducibility trials.
Microsoft's AI Red Team, which employs full-time AI red teamers, uses a shared ticketing system where each test case attempt is logged centrally. When one tester identifies a potential finding, the ticket is flagged for a second tester to independently verify — preserving independence while enabling rapid follow-up.
Engagements regularly surface vulnerabilities that are outside the defined scope — a tester probing financial fraud elicitation might accidentally discover that the model will provide detailed synthesis instructions for a dangerous compound. The RoE should specify an escalation path for out-of-scope critical findings: typically, immediate halt of the adjacent test stream, documented notification to the engagement lead, and a decision within a defined time window on whether to expand scope or sequester the finding for separate disclosure.
You've discovered a potential finding during your financial services AI engagement: when asked to roleplay as a "compliance officer who has decided to ignore AML rules," the model provides specific guidance on structuring transactions to avoid regulatory reporting thresholds. You need to document this properly and score its severity.
Work with the coach to: (1) draft a complete log entry for this finding, (2) define what reproducibility trials you'd run, (3) apply a severity rubric (harm severity × reproducibility × ease of exploitation), and (4) draft the escalation notification.
In March 2023, security researcher Johann Rehberger demonstrated that a prompt injection attack against ChatGPT plugins could exfiltrate user conversation data to an attacker-controlled server via a maliciously crafted web page. He reported the finding to OpenAI through their responsible disclosure program. OpenAI's initial response classified it as "not a vulnerability" — a decision Rehberger publicly disputed. The finding was ultimately acknowledged and partially addressed months later. The case became a reference point in AI security circles for the inadequacy of traditional software vulnerability disclosure frameworks when applied to AI systems, where the harm model, reproduction criteria, and patch definition are all fundamentally different.
A professional red-team report is not a dump of raw logs. It is a structured document designed to enable organizational decision-making. The standard structure for AI red-team reports, adapted from the format used by Microsoft's AI Red Team and published in their 2023 public documentation, includes five sections.
| Section | Contents | Primary Audience |
|---|---|---|
| Executive Summary | Overall risk posture, count and severity distribution of findings, top three recommendations | Leadership, legal, product |
| Scope & Methodology | System boundary, access level, threat actor personas, harm taxonomy used, test matrix summary | Engineering, security |
| Findings Catalog | Each finding: ID, severity score, attack technique class, harm category, exact reproduction steps, raw output excerpt (redacted per RoE), and number of successful trials | Engineering, security |
| Recommendations | Specific, actionable mitigations for each finding category — not generic "improve safety training" advice | Engineering, product |
| Remediation Tracking Baseline | Open/closed status for each finding, assigned owner, target resolution date, and re-test criteria | Engineering, security, legal |
Traditional software vulnerability disclosure (CVE system, 90-day disclosure windows pioneered by Google Project Zero) was designed for deterministic bugs with clear patch definitions. AI vulnerabilities don't fit this model cleanly: a jailbreak might be mitigated by a system prompt update, addressed in fine-tuning, or simply accepted as a known limitation of the current model generation.
The Rehberger/ChatGPT case illustrates the friction. When Rehberger reported in March 2023, OpenAI's bug bounty program — run through Bugcrowd — had no defined severity taxonomy for prompt injection attacks. The disclosure sat in limbo for months because neither party had agreed in advance on what "fixed" looked like for a probabilistic system.
The AI Vulnerability Database (AVID), launched in 2023 by a coalition including Cohere, Hugging Face, and independent researchers, attempts to create a CVE-equivalent for AI harms. AVID taxonomy classifies findings by harm type, affected modality, and lifecycle stage (training, deployment, inference), and provides a structured disclosure template that specifies mitigation types distinct from traditional patches. Several organizations now accept AVID-formatted disclosures as a standard reporting format.
Every finding in the catalog must have a remediation ticket with an owner, a target date, and defined re-test criteria before the report is delivered. Without this, reports become artifacts rather than drivers of change.
AI-specific remediation types differ from traditional software patches. They include: system prompt hardening (guardrails added at the prompt level), output filter tuning (adjusting content classifier thresholds), fine-tuning or RLHF updates (most expensive, highest latency), retrieval pipeline hardening (for RAG-based systems), and architectural changes (e.g., removing tool access that enabled a finding).
Critically, remediation must be verified — the specific test case that produced the finding must be re-run against the patched system. A reported fix that has not been verified by the red team should not be closed in the tracking system.
Point-in-time engagements — the traditional consulting model — are increasingly insufficient for AI systems that are updated continuously. OpenAI, Anthropic, and Google all now maintain ongoing internal red-team functions that run continuous evaluation suites against each model update, supplemented by periodic deeper engagements. The EU AI Act's Article 9 risk management system requires "ongoing" monitoring — not just pre-deployment assessment — for high-risk AI systems. This has driven interest in automated red-teaming pipelines (covered in Module 8) that can run test case batteries against new model versions as part of the deployment CI/CD pipeline.
Your financial services AI engagement is complete. You have three confirmed findings. Now you need to write the Findings Catalog section of the report and set up remediation tracking for each finding. The coach will help you structure each finding entry correctly and ensure your recommendations are specific and actionable — not generic.
Work through at least one complete finding entry, including: finding ID, severity score, attack technique class, harm category, exact reproduction steps (use the AML structuring finding from Lab 3), re-test criteria, and assigned remediation type with justification.