AI Security and Red-Teaming · Introduction

Every new platform attracts new attackers. Security did not lag the internet by accident.

AI systems have novel attack surfaces. Defenders who started two years ago are already behind. This course is how you catch up.

When email arrived, it took a decade for phishing to become an industry. When the web arrived, it took five years for SQL injection to become a standard attack pattern. When smartphones arrived, it took three years for mobile malware to become pervasive. In each case, the attackers arrived on a faster clock than the defenders.

AI is repeating the pattern, but faster again. Prompt injection, model extraction, data poisoning, jailbreaks, adversarial examples, supply-chain attacks on model weights, deepfake social engineering — every category has gone from academic paper to active exploitation in the past three years, and the next category is probably already out there.

This course is an offensive-and-defensive guide to AI security. It covers the full taxonomy of AI-specific attacks, how to red-team an AI system before attackers do, how to build monitoring and detection for prompt injection and data exfiltration, how to think about model-weight supply-chain security, how to run a responsible-disclosure process, and the specific defensive patterns that actually slow attackers down. It's a security course with AI-specific content, not a general security course with AI as an example.

If you finish every module, here's who you become:

You'll understand the full taxonomy of AI-specific attacks — prompt injection, model extraction, data poisoning, jailbreaks, and supply-chain threats on model weights.
You'll be able to structure and run a red-team exercise against an AI system before attackers do, including scoping, technique selection, and responsible-disclosure reporting.
You'll recognize why standard security tooling — CVEs, NIST databases, WAFs — is blind to AI-native vulnerabilities and know what to use instead.
You'll know how to detect and block both direct and indirect prompt injection in production systems, including RAG pipelines where poisoned retrieval is the attack vector.
You'll become the person on a security or AI team who can map an unfamiliar AI system's attack surface and articulate its specific exposure to adversaries.
You'll be able to design layered defenses for production AI applications — monitoring, input-output filtering, weight provenance checks, and exfiltration controls — rather than retrofitting general security patterns.
You'll think like an attacker and a defender simultaneously, which is the only posture that keeps pace with a threat landscape moving faster than the tooling.

Lesson 1 · AI Security & Red-Teaming · Module 1

Why AI Systems Are Attack Surfaces

Traditional software gets patched. AI systems learn, generalize — and fail in ways their creators never anticipated.

What fundamentally separates AI security threats from conventional software vulnerabilities?

When Microsoft integrated GPT-4 into Bing, security researcher Kevin Liu discovered within days that he could instruct the chatbot to ignore its system prompt entirely — simply by asking it to reveal its "initial instructions." The underlying model had been fine-tuned to follow a confidential directive called Sydney. The directive was exposed. Microsoft had shipped a product with a novel class of vulnerability that had no CVE number, no patch cadence, and no prior art in the NIST vulnerability database.

This was not a buffer overflow. It was not an injection attack on a database. It was a prompt injection — a class of flaw native to AI systems and invisible to every security tool the team had deployed.

The New Attack Surface

Classical software security operates on a deterministic model: a function accepts defined inputs and produces defined outputs. Security teams reason about state machines, memory boundaries, and protocol parsers. Vulnerabilities are discrete, reproducible, and patchable.

AI systems — particularly large language models and neural classifiers — are probabilistic. They approximate a function learned from billions of data points. That function is never fully auditable. It has no source code in the traditional sense. Its behavior emerges from weights, not logic. This creates attack surfaces that did not exist before 2017.

Security practitioners must now reason about three distinct layers: the model itself (weights, architecture, training data), the inference pipeline (APIs, context windows, tool calls), and the deployment environment (who can send what inputs, how outputs are consumed downstream).

A Taxonomy of AI-Native Threats

Category 01

Prompt Injection

Attacker-controlled text overrides developer-supplied instructions. First documented systematically by Riley Goodside in September 2022; weaponized against Bing Chat in February 2023.

Category 02

Training Data Poisoning

Malicious data injected into training corpora causes the model to learn attacker-desired behaviors. Demonstrated by Carlini et al. (2021) against GPT-2 fine-tuning pipelines.

Category 03

Model Extraction

Adversary queries a black-box model to reconstruct a functional replica, stealing intellectual property. Demonstrated against commercial ML APIs by Tramèr et al. (2016).

Category 04

Adversarial Examples

Imperceptible perturbations cause misclassification. Goodfellow et al.'s 2014 "Fast Gradient Sign Method" achieved near-100% attack success on ImageNet classifiers.

Category 05

Membership Inference

Adversary determines whether a specific record was in training data, enabling privacy violations. Shokri et al. (2017) achieved up to 84% accuracy against commercial classifiers.

Category 06

Model Inversion

Attacker reconstructs sensitive training data from model outputs. Fredrikson et al. (2015) recovered patient facial images from a medical model's confidence scores.

Why Classical Defenses Fall Short

A Web Application Firewall inspects HTTP requests against known malicious patterns. A prompt injection payload looks, syntactically, like a perfectly valid natural-language sentence. There is no byte sequence to blocklist. The malice is semantic, not syntactic.

Static analysis tools parse source code. An LLM's "source code" — its weights — is a 70-billion-parameter float array. No static analyzer reads it meaningfully. Dynamic fuzzing tools generate structured inputs to trigger crashes; AI systems rarely crash — they silently produce wrong, harmful, or attacker-controlled outputs instead.

This is why the security industry coined the term AI red-teaming: structured adversarial testing by human experts who reason about semantic intent, not syntactic signatures. In 2023, NIST published its AI Risk Management Framework (AI RMF) explicitly naming red-teaming as a required mitigation practice. DARPA, CISA, and the UK's NCSC have each issued analogous guidance since.

Documented Case — Samsung, April 2023

Three separate Samsung semiconductor engineers inadvertently uploaded proprietary source code, meeting notes, and hardware test data to ChatGPT within a single month. The data was potentially used as training material. Samsung subsequently banned generative AI tools internally. The incident required no exploit — only normal product use. The attack surface was the deployment decision itself.

Key Terms

Threat ModelA structured enumeration of assets, adversaries, attack vectors, and mitigations relevant to a given system. In AI contexts, must explicitly address model, pipeline, and deployment layers.

Attack SurfaceAll points where an adversary can interact with a system to influence its behavior. For AI systems, includes training data, inference endpoints, system prompts, output consumers, and plugin/tool integrations.

AI Red-TeamingStructured adversarial testing of AI systems by humans simulating attacker behaviors, distinct from automated fuzz testing because semantic intent must be evaluated.

Probabilistic FailureFailure mode unique to ML systems where the same input may produce different outputs across runs, and boundary conditions cannot be exhaustively enumerated.

Core Principle

AI security is not a subset of application security. It is a parallel discipline with overlapping tools but fundamentally different threat classes. Practitioners who treat LLM deployments as "just another web app" will miss the most dangerous attack vectors every time.

Lesson 1 Quiz

Why AI Systems Are Attack Surfaces · 3 questions

1. Which characteristic of AI systems most fundamentally distinguishes their failure modes from classical software bugs?

Correct. AI systems approximate a learned function; their failure modes are semantic and probabilistic, not syntactic and deterministic — which is why classical security tooling misses them.

Not quite. The key distinction is the probabilistic, weight-based nature of AI behavior, which makes failure modes qualitatively different from code bugs.

2. The Samsung data leak of April 2023 is cited as an example of which type of AI security risk?

Correct. No exploit was needed. Engineers used the product as intended and inadvertently leaked proprietary data. This illustrates that AI attack surfaces include the deployment context, not just the model.

Incorrect. The Samsung incident required no technical exploit — engineers used ChatGPT normally and exposed confidential data. The vulnerability was the deployment decision.

3. Why do Web Application Firewalls fail to detect prompt injection attacks?

Correct. The malice in a prompt injection is semantic — it lies in meaning, not structure. A WAF blocking on byte signatures has no mechanism to distinguish a legitimate instruction from an attacker override.

Not correct. The core problem is semantic vs. syntactic: prompt injections look like normal text. No signature-based tool can reliably detect them.

Lab 1 — Mapping the AI Attack Surface

Interactive threat modeling with an AI security analyst · Complete 3 exchanges to finish

Objective

You are a security architect reviewing a new LLM-powered customer support chatbot before production deployment. Work with the AI analyst to build a structured threat model: identify assets, enumerate attack vectors, and classify threat categories.

Start by describing the system you want to threat-model, or ask the analyst to walk you through the STRIDE framework as applied to an AI deployment. Try: "Walk me through how STRIDE maps to an LLM-based customer support bot."

AI Security Analyst

Threat Modeling

Ready to threat-model your AI deployment. Tell me about the system — what model is it built on, what data does it access, and who are its users? Or ask me to apply a standard framework like STRIDE or LINDDUN to a typical LLM chatbot scenario.

Lesson 2 · AI Security & Red-Teaming · Module 1

Adversaries, Assets, and Motivations

A threat model without an adversary model is just a wishlist. Who actually attacks AI systems — and what do they want?

How do adversary motivations shape which AI attack vectors become operationally relevant?

Within 96 hours of Microsoft's Bing Chat launch, Stanford researcher Marvin von Hagen obtained the full text of the model's confidential system prompt — code-named "Sydney" — by asking the chatbot to roleplay a developer session. The exposed prompt revealed business constraints, behavioral guardrails, and Microsoft's operational guidelines. Von Hagen published the document on Twitter. Microsoft had not anticipated that a curious researcher, with no malicious intent and no technical exploit, could extract a document they had explicitly instructed the model to keep secret.

The adversary here was not a nation-state. It was a graduate student with a browser tab and an afternoon.

Adversary Taxonomy for AI Systems

Classical adversary taxonomies (script kiddie → hacktivist → organized crime → APT) map reasonably well to AI systems, but the barrier to entry for AI attacks is dramatically lower. Prompt injection requires no programming knowledge. Jailbreaking requires pattern recognition and persistence, not technical skill. This expands the effective threat population.

Security teams should model adversaries across two axes: capability (what technical resources and expertise they possess) and motivation (what outcome they are trying to achieve). Motivation determines which attack classes are operationally relevant for a given deployment.

Adversary Type 01

Curious Users

Motivation: exploration, social proof, notoriety. Low capability; high volume. Discovered Sydney prompt, Bing's hidden instructions, and dozens of other system prompts through social sharing of effective techniques.

Adversary Type 02

Financially Motivated Actors

Motivation: fraud, IP theft, competitive intelligence. Medium capability. Demonstrated: model extraction attacks against AWS SageMaker endpoints; LLM-generated phishing at scale (2023 WormGPT operations).

Adversary Type 03

Ideological / Hacktivist

Motivation: disruption, reputational damage, forcing policy change. Variable capability. Target: AI content moderation systems to suppress or amplify specific content. Active in 2022–2023 content filter bypass campaigns.

Adversary Type 04

Nation-State / APT

Motivation: intelligence collection, supply chain compromise, PSYOP. High capability. CISA AA23-347A (December 2023) documented APT targeting of ML pipelines for training data poisoning and model theft.

Adversary Type 05

Insider Threats

Motivation: sabotage, IP exfiltration, competitive advantage. High access, variable intent. Google vs. Linwei Ding (2024): engineer allegedly exfiltrated 500+ files of proprietary AI infrastructure source code to Chinese startups.

Adversary Type 06

Automated / AI-Driven Attackers

Motivation: scale, persistence, cost reduction. Emerging threat class. Demonstrated: LLM-automated red-teaming tools (PentestGPT, 2023) that autonomously probe AI system defenses at machine speed.

Asset Enumeration in AI Systems

A complete threat model requires explicit enumeration of what adversaries are trying to obtain or damage. AI deployments have assets that do not appear in traditional system inventories.

Model WeightsThe trained parameters of the model. Represent enormous R&D investment. Extraction enables competitors to replicate capabilities without training costs. Meta's LLaMA weights leaked on 4chan in March 2023 within days of restricted release.

System PromptsDeveloper-supplied instructions that configure model behavior. Extraction reveals business logic, safety bypasses, and operational constraints. Treated as trade secrets by major AI providers.

Training DataThe corpus used to train or fine-tune the model. May contain PII, trade secrets, or copyrighted material. Extractable via membership inference or model inversion attacks.

Inference AccessThe ability to query the model. Valuable for generating malicious content, bypassing content policies, or conducting model extraction. API key theft grants full inference access.

Model BehaviorThe decision function itself — how the model classifies, responds, or acts. Adversaries may wish to corrupt this (poisoning) or predict it (evasion) to achieve reliable malicious outcomes.

Documented Case — WormGPT, July 2023

A threat actor advertised "WormGPT" on underground forums — a fine-tuned LLM with safety guardrails removed, marketed for generating phishing emails and malware. SlashNext researchers purchased access and confirmed the tool generated "disturbingly persuasive" business email compromise content. The adversary motivation was financial; the attack asset was unrestricted inference access. Pricing was $60/month — the barrier to entry for LLM-enabled fraud.

Mapping Motivation to Attack Vector

Not every threat applies equally to every deployment. A medical AI classifying X-rays faces different threats than a customer support chatbot. Effective threat modeling requires matching adversary motivation to the specific assets and capabilities of the target system.

An adversary motivated by financial fraud against a banking AI will focus on evasion attacks — crafting inputs that cause the model to approve fraudulent transactions. An adversary motivated by competitive intelligence against the same system will focus on model extraction — reconstructing the scoring function to build a competing product. Same system, different threat actors, different attack classes, different mitigations.

Framework Principle

Begin every AI security engagement by asking: who benefits if this system fails, and how? The answer constrains the threat space from "all possible attacks" to "attacks worth defending against given this adversary population and their capabilities."

Lesson 2 Quiz

Adversaries, Assets, and Motivations · 3 questions

1. Marvin von Hagen's extraction of Microsoft's "Sydney" system prompt is most accurately categorized as which adversary type?

Correct. Von Hagen was a Stanford researcher using a browser, no technical exploit, and published findings for notoriety. Classic curious user / low-capability adversary profile.

Incorrect. Von Hagen required no special access, no malicious intent, and no technical expertise — the textbook definition of a curious user adversary.

2. A financially motivated adversary targeting a credit-scoring AI model would most likely prioritize which attack?

Correct. Financial motivation with a scoring system points directly to evasion — getting the model to approve what it should deny. The other attacks require more effort for less direct financial return.

Not quite. Map motivation to payoff: for fraud, the direct payoff is approval of fraudulent transactions — achieved through evasion attacks that craft borderline inputs.

3. What made WormGPT (July 2023) significant from a threat modeling perspective?

Correct. WormGPT's significance was economic: it commoditized LLM-enabled fraud. The threat population for business email compromise expanded to anyone with $60 and a forum account.

Incorrect. WormGPT's impact was on the economics of adversary capability — it made LLM-powered phishing cheap and accessible, not technically sophisticated.

Lab 2 — Adversary Profile Workshop

Build adversary profiles for a real AI deployment · Complete 3 exchanges to finish

Objective

You are red-teaming a healthcare AI system that uses an LLM to assist clinicians with diagnosis suggestions and accesses patient records via tool calls. Work with the analyst to enumerate adversary profiles, map motivations to specific assets, and identify which adversary is highest priority.

Try: "Give me the top three adversary profiles for a clinical AI that has access to patient records and makes diagnosis recommendations." Or drill into a specific adversary type and ask what attacks they would prioritize.

AI Security Analyst

Adversary Profiling

Let's build adversary profiles for your clinical AI system. Describe the deployment — what model powers it, what data it can access, who has query access, and what business outcomes depend on its recommendations. Then we'll enumerate who would want to attack it and why.

Lesson 3 · AI Security & Red-Teaming · Module 1

Regulatory and Compliance Landscape

Governments have noticed. What legal and regulatory obligations now attach to AI security — and how do they interact with technical threat models?

Which regulatory frameworks impose explicit security obligations on AI systems, and what do they actually require in practice?

President Biden's Executive Order 14110 on Safe, Secure, and Trustworthy Artificial Intelligence, issued October 30, 2023, included a directive that developers of the most powerful AI models must share safety test results with the federal government before public deployment. The order invoked the Defense Production Act. For the first time, AI red-teaming results became a potential legal disclosure obligation — not merely a best-practice recommendation. NIST was tasked with defining what "safety testing" meant.

Security practitioners who had been doing red-teaming as an internal engineering discipline suddenly found themselves operating in a regulatory environment.

The Regulatory Stack

AI security obligations now arrive from multiple regulatory layers simultaneously. A single enterprise deploying an LLM in a regulated sector may face obligations under five or more overlapping frameworks. Understanding which framework controls which requirement — and where they conflict — is now a core security competency.

2018

GDPR (EU): Article 22 restricts solely automated decision-making affecting individuals. Indirectly covers ML systems making credit, hiring, or law enforcement decisions. Requires human oversight mechanisms that red-teaming must validate.

2021

NIST AI RMF (Draft): First comprehensive US framework mapping AI risks to organizational controls. Govern, Map, Measure, Manage structure. Red-teaming explicitly named under "Measure" function. Published final January 2023.

2023 Mar

EU AI Act (Political Agreement): Risk-tiered regulation. "High-risk" AI systems (medical devices, critical infrastructure, biometric identification) require mandatory conformity assessments including adversarial testing. Prohibits certain AI applications outright. Took effect August 2024.

2023 Oct

Biden EO 14110: Requires developers of dual-use foundation models (above compute thresholds) to report red-team results to NIST. CISA designated as sector risk management agency for AI security incidents. Reversed by subsequent administration January 2025 — regulatory state in flux.

2024

SEC AI Disclosure Guidance: Registrants must disclose material cybersecurity incidents within 4 days. AI system compromises producing material financial impact now trigger SEC disclosure obligations — first applied to AI systems in 2024 enforcement actions.

2024

HIPAA AI Guidance (HHS): Clarification that AI tools processing PHI are covered entities or business associates. Security Rule applies to model weights containing patient data, inference logs, and training datasets derived from EHR systems.

What Regulations Actually Require

Regulatory language is often abstract; translation to technical requirements is the practitioner's job. Three requirements appear across most frameworks in some form:

1. Risk Assessment: Structured identification of how the AI system can fail or be misused, with proportionate documentation. Maps to threat modeling. The EU AI Act requires this for all high-risk systems before deployment.

2. Testing and Validation: Evidence that the system performs as claimed under adversarial conditions. "Adversarial testing" appears explicitly in NIST AI RMF (Measure 2.5), EU AI Act conformity assessments, and Biden EO 14110 red-team reporting requirements.

3. Incident Response and Disclosure: Mechanisms for detecting, containing, and reporting AI security incidents. The EU AI Act requires serious incident reporting to national authorities. SEC requires 8-K disclosure for material cybersecurity incidents.

Documented Case — FTC vs. Rite Aid, December 2023

The FTC banned Rite Aid from using facial recognition AI for five years after finding the system incorrectly flagged customers — disproportionately people of color — as shoplifters, leading to false accusations and humiliating confrontations. The FTC's complaint cited failure to adequately test the system before deployment and failure to maintain human oversight. This was a regulatory enforcement action predicated on inadequate AI security and testing practices — the first such FTC action specifically targeting AI system validation failures.

The Compliance–Security Gap

Meeting regulatory requirements and actually securing an AI system are not the same thing. SOC 2 Type II certification does not assess prompt injection resistance. ISO 27001 certification does not evaluate training data poisoning vectors. HIPAA compliance does not require adversarial testing of clinical AI recommendations.

Security practitioners must maintain two parallel workstreams: the compliance documentation that satisfies auditors, and the technical red-teaming that actually finds exploitable vulnerabilities. Conflating them produces organizations that are auditably compliant and operationally compromised.

EU AI Act Risk TiersUnacceptable risk (banned) → High risk (mandatory conformity assessment) → Limited risk (transparency obligations) → Minimal risk (voluntary codes). Most enterprise LLM deployments fall in limited or high risk depending on use case.

NIST AI RMFFour-function framework: Govern (policy), Map (context), Measure (assess), Manage (respond). The primary US voluntary framework for AI risk management; basis for federal agency AI security requirements post-EO 14110.

Dual-Use Foundation ModelAn AI model trained on broad data, generally capable, and deployable for both beneficial and potentially harmful purposes. Biden EO 14110 defined these by compute threshold (10^26 FLOPS for training). The class subject to the most stringent federal oversight.

Practitioner Note

Regulatory frameworks define obligations; they do not define security. An AI system that passes every required conformity assessment can still be catastrophically vulnerable to prompt injection, model extraction, or training data poisoning. Red-teaming must go beyond what regulators require — because attackers certainly will.

Lesson 3 Quiz

Regulatory and Compliance Landscape · 3 questions

1. Under the EU AI Act, which category of AI system requires mandatory conformity assessment including adversarial testing before deployment?

Correct. The EU AI Act's high-risk tier — covering AI in healthcare, critical infrastructure, biometric systems, and employment decisions — requires conformity assessment including adversarial testing before deployment.

Incorrect. Only high-risk AI systems require mandatory conformity assessment. Lower-risk tiers face transparency obligations or voluntary codes, not mandatory adversarial testing.

2. The FTC's 2023 action against Rite Aid's facial recognition AI is significant for AI security practitioners because it demonstrated:

Correct. The FTC framed inadequate testing and absent oversight as unfair business practices — making AI security validation failures a regulatory enforcement matter, not merely a technical or reputational one.

Incorrect. The FTC did not ban facial recognition categorically. It penalized Rite Aid for deploying a system without adequate testing and without human oversight mechanisms — a security and governance failure, not a technology prohibition.

3. Why is there a "compliance–security gap" in AI regulation?

Correct. SOC 2 and ISO 27001 assess general information security controls. They have no evaluation criteria for prompt injection resistance, model extraction defenses, or adversarial robustness — the attack classes most relevant to AI systems.

Not correct. The gap exists because existing compliance frameworks were designed before AI-native threat classes emerged. An organization can be fully compliant and still be wide open to AI-specific attacks.

Lab 3 — Regulatory Gap Analysis

Map compliance obligations to actual security requirements · Complete 3 exchanges to finish

Objective

Your organization is deploying an LLM-based hiring screening tool in the EU. The CISO believes that existing SOC 2 Type II certification and GDPR compliance cover all AI security obligations. Work with the analyst to identify gaps — what the certifications miss, which EU AI Act requirements apply, and what additional testing is required.

Try: "Our hiring AI is SOC 2 certified and GDPR compliant. Does the EU AI Act impose additional security requirements on us?" Or ask: "What specific adversarial tests does the EU AI Act conformity assessment require for a high-risk hiring system?"

AI Compliance Analyst

Regulatory Gap Analysis

Let's analyze your regulatory obligations. An LLM-based hiring tool in the EU is almost certainly high-risk under the EU AI Act — employment and recruitment AI is explicitly listed in Annex III. Tell me about your current compliance posture and I'll map the gaps between what you have and what the Act requires.

Lesson 4 · AI Security & Red-Teaming · Module 1

Building an AI Threat Model: The MAESTRO Framework

Threat modeling AI systems requires a structured method. Here is the one the field has converged on — and how to apply it to a real deployment.

How do you construct a complete, actionable threat model for an AI system that covers all three layers — model, pipeline, and deployment?

Air Canada's LLM-powered customer support chatbot told a grieving passenger that he could book a bereavement fare after travel and claim a refund retroactively — a policy that does not exist. When the passenger presented the chatbot's assurance in court, Air Canada argued the chatbot was a "separate legal entity" responsible for its own statements. The British Columbia Civil Resolution Tribunal rejected this argument. Air Canada was ordered to honor the fare.

A complete threat model of this system would have identified hallucination-as-liability as an asset under threat: the company's legal and financial obligations were the asset, and the model's tendency to confabulate confident but false policy information was the attack vector. No adversary required. The threat was the system's normal operation.

Why Existing Frameworks Are Insufficient

STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) was developed by Microsoft in 1999 for traditional software systems. It maps reasonably to AI systems in some dimensions but misses critical AI-specific threats: training data integrity, model behavior manipulation, and emergent capability risks.

PASTA, DREAD, and LINDDUN each have analogous gaps. The security community has responded by developing AI-specific extensions. The most practically adopted is the MAESTRO framework, developed through contributions from MITRE ATLAS, the OWASP Top 10 for LLMs, and NIST AI RMF mapping work.

The MAESTRO Layers

MAESTRO structures AI threat modeling across seven interdependent layers. Each layer has distinct assets, adversary access points, and relevant attack classes.

M — Model Layer

Architecture & Weights

Threats: model extraction, weight theft, adversarial example generation. Assets: trained parameters, architecture IP. Control: access controls on weight storage, API rate limiting, output watermarking.

A — Agent Layer

Autonomous Behaviors

Threats: goal misalignment, prompt injection via tool outputs, action space manipulation. Assets: downstream systems the agent can modify. Control: minimal privilege, action approval gates, rollback capabilities.

E — Embedding Layer

Vector Stores & RAG

Threats: data poisoning of vector databases, embedding inversion, cross-user context contamination. Assets: proprietary knowledge bases. Control: content validation pipelines, namespace isolation, embedding access controls.

S — Supply Chain Layer

Third-Party Models & Data

Threats: backdoored pretrained models, poisoned fine-tuning datasets, malicious plugins. Assets: model integrity, behavior predictability. Control: model provenance verification, dataset audits, plugin sandboxing.

T — Training Layer

Data & Fine-Tuning

Threats: training data poisoning, backdoor triggers, membership inference. Assets: training data privacy, model behavior integrity. Control: data provenance tracking, differential privacy, anomaly detection in training data.

R — Runtime Layer

Inference & API

Threats: prompt injection, jailbreaking, denial of service, context window manipulation. Assets: output integrity, API availability. Control: input validation, output filtering, rate limiting, context isolation.

O — Output Layer

Downstream Consumption

Threats: hallucination-as-liability (Air Canada), insecure code generation executed downstream, PII leakage in outputs. Assets: consumer trust, legal exposure. Control: output validation, human-in-the-loop for high-stakes decisions, attribution metadata.

Applying MAESTRO: A Worked Example

Consider an enterprise RAG (Retrieval-Augmented Generation) system that allows employees to query internal documentation via natural language, with the LLM having access to HR, legal, and financial document stores.

M (Model): The underlying LLM weights are hosted by a third-party provider (e.g., Azure OpenAI). Threat: provider-side model substitution or weight leakage. Mitigation: contractual SLAs on model integrity, output monitoring.

A (Agent): The system has no autonomous action capabilities in this version — low risk at this layer currently, but must be re-evaluated if agentic features are added.

E (Embedding): Three separate vector stores (HR, Legal, Finance) are queried. Threat: cross-namespace leakage — an employee query retrieves documents from a store they lack authorization to access. Mitigation: namespace-level access control mapped to user role, enforced before embedding similarity search.

S (Supply Chain): The embedding model (e.g., text-embedding-3-large) was downloaded from a third-party hub. Threat: backdoored embedding model that causes adversarial documents to rank highly. Mitigation: hash verification against provider-signed checksums.

T (Training): No fine-tuning in initial deployment. Risk deferred. Flag for re-assessment if custom fine-tuning is added.

R (Runtime): Users submit natural-language queries. Threat: indirect prompt injection — an attacker embeds instructions in a document in the store, which is retrieved and executed by the LLM. Mitigation: output validation layer that strips instruction-following patterns; source attribution in every response.

O (Output): Responses are displayed to employees and may influence HR decisions. Threat: hallucinated policy guidance acted upon by managers. Mitigation: mandatory source citation in all outputs, explicit disclaimer for HR/legal queries, human review gate for decisions above defined impact threshold.

MITRE ATLAS — Documented AI Attacks

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is the AI analogue of ATT&CK. It catalogs documented real-world attacks against ML systems with tactic/technique/procedure (TTP) mapping. As of 2024, ATLAS documents over 100 adversarial techniques across 14 tactic categories. Every MAESTRO layer maps to ATLAS tactics. ATLAS is freely available at atlas.mitre.org and should be the reference taxonomy for AI threat modeling.

Threat Model Outputs

A complete AI threat model produces four deliverables: (1) an asset register enumerating model, data, pipeline, and output assets; (2) an adversary profile matrix mapping threat actors to motivation and capability; (3) a threat register enumerating attack scenarios with MAESTRO layer and ATLAS TTP reference; and (4) a control mapping documenting current mitigations, gaps, and residual risk.

This documentation becomes the input to red-team scope definition — the topic of Module 2. Red-teamers test what the threat model says is at risk. Without the threat model, red-teaming is unfocused and unlikely to find the most consequential vulnerabilities before attackers do.

Module 1 Summary

AI systems are probabilistic, emergent, and opaque in ways that classical security tooling cannot address. Their attack surfaces span model weights, training data, inference pipelines, and downstream output consumers. Adversaries range from curious researchers to nation-states, and motivation determines which attack class is operationally relevant. Regulatory obligations are multiplying but remain behind the technical threat frontier. Structured threat modeling — layer by layer, adversary by adversary — is the foundation on which all AI red-teaming is built.

Lesson 4 Quiz

Building an AI Threat Model: MAESTRO · 3 questions

1. In the MAESTRO framework, indirect prompt injection via a retrieved document in a RAG system falls under which layer?

Correct. Indirect prompt injection occurs during inference — the Runtime Layer — when the model processes both user input and retrieved content that may contain attacker-controlled instructions.

Incorrect. Prompt injection, including the indirect variant via RAG retrieval, occurs at inference time — the Runtime Layer — where the model processes inputs and produces outputs.

2. The Air Canada chatbot case illustrates which threat in the MAESTRO framework?

Correct. Air Canada illustrates the Output Layer threat of hallucination-as-liability. The model's fabricated policy statement was acted upon by a customer and enforced by a court — a downstream consequence of unvalidated model output.

Incorrect. The Air Canada case is an Output Layer issue: the chatbot's hallucinated response was consumed downstream by a customer and a court. The threat is hallucination-as-liability — a risk in the output consumption layer.

3. What is MITRE ATLAS and why is it the recommended reference taxonomy for AI threat modeling?

Correct. ATLAS provides documented, real-world TTPs for AI attacks — the same structured adversarial knowledge that ATT&CK provides for traditional cyber attacks. It gives threat models a grounded reference base rather than theoretical attack speculation.

Incorrect. MITRE ATLAS (atlas.mitre.org) is the AI-specific extension of the ATT&CK framework — documenting real adversarial techniques against ML systems with the same tactic/technique/procedure structure that makes ATT&CK useful for traditional threat modeling.

Lab 4 — MAESTRO Threat Model Construction

Build a layer-by-layer threat model for a real AI deployment · Complete 3 exchanges to finish

Objective

Apply the MAESTRO framework to a complex AI deployment: an agentic LLM system that assists financial analysts, has access to live market data APIs, can execute trades within pre-approved parameters, and stores conversation history in a vector database for personalization. Work through each MAESTRO layer with the analyst.

Try: "Walk me through the MAESTRO Agent Layer threats for a financial AI that can execute trades." Or: "What Supply Chain threats apply to a system using a third-party embedding model and a commercial LLM API?" Pick any layer and go deep.

AI Threat Modeling Analyst

MAESTRO Framework

This is a high-stakes deployment — an agentic system with real-world financial execution authority. Let's build the threat model layer by layer. Which MAESTRO layer do you want to start with? I'd suggest beginning with the Agent Layer, since autonomous trade execution is the highest-consequence capability. Or tell me which layer concerns you most.

Module 1 Test

The AI Security Threat Model · 15 questions · Pass at 80%

1. Which property of AI systems makes their failure modes qualitatively different from classical software bugs?

Correct.

Incorrect. The key distinction is probabilistic, weight-based behavior versus deterministic code logic.

2. Riley Goodside's September 2022 work and the Bing Chat incident of February 2023 both relate to which attack class?

Correct. Goodside documented prompt injection systematically; Bing Chat demonstrated it against a production system.

Incorrect. Both cases involve prompt injection — attacker-controlled text overriding developer instructions.

3. Tramèr et al. (2016) demonstrated which AI attack against commercial ML APIs?

Correct. Tramèr et al. demonstrated model extraction — querying a black-box API to steal the model's decision function.

Incorrect. Tramèr et al. demonstrated model extraction. Shokri et al. demonstrated membership inference; Fredrikson demonstrated model inversion.

4. The Samsung data leak of April 2023 required which technical exploit?

Correct. No exploit was needed. The attack surface was the deployment decision — using ChatGPT for work that involved proprietary data.

Incorrect. Samsung's leak required no technical exploit. Engineers used ChatGPT normally; the vulnerability was the deployment context.

5. What two axes should define adversary classification in an AI threat model?

Correct. Capability determines what attacks are feasible; motivation determines which are operationally relevant for the specific deployment.

Incorrect. The two axes for adversary classification are capability (what they can do) and motivation (what outcome they want) — which together determine which attacks are relevant.

6. WormGPT's primary significance for threat modeling was:

Correct. WormGPT's impact was economic — it made LLM-powered phishing accessible to anyone, not just technically sophisticated actors.

Incorrect. WormGPT's significance was the commoditization of LLM-enabled fraud — lowering the capability bar for financially motivated adversaries to the price of a forum subscription.

7. Meta's LLaMA model weights leaked publicly in March 2023. This is an example of which asset being compromised?

Correct. LLaMA's weights — the trained parameters — were the asset. Once leaked, anyone could run the model locally and circumvent Meta's intended access controls.

Incorrect. The leaked asset was the model weights — LLaMA's trained parameters — enabling anyone to run the model without Meta's authorization or safeguards.

8. Under the EU AI Act, an LLM-based hiring screening tool is classified as:

Correct. Employment and recruitment AI is explicitly listed in EU AI Act Annex III as a high-risk application category requiring conformity assessment before deployment.

Incorrect. Hiring/recruitment AI is explicitly named in EU AI Act Annex III as high-risk, requiring mandatory conformity assessment including adversarial testing.

9. The FTC's action against Rite Aid (December 2023) established which principle relevant to AI security?

Correct. The FTC framed Rite Aid's failure to test the system and implement human oversight as an unfair business practice — making AI security validation a regulatory enforcement matter.

Incorrect. The FTC penalized Rite Aid for deploying without adequate testing and human oversight — making pre-deployment AI security validation an enforcement issue, not merely a technical best practice.

10. Why does achieving SOC 2 Type II certification not adequately address AI security requirements?

Correct. SOC 2 was designed before AI-native threat classes existed. It has no controls or evaluation criteria for prompt injection, adversarial examples, model extraction, or training data poisoning.

Incorrect. The gap is substantive: SOC 2 controls were designed for traditional IT security and have no criteria for evaluating AI-specific attack vectors.

11. In the MAESTRO framework, cross-namespace leakage in a RAG vector database falls under which layer?

Correct. The Embedding Layer covers vector stores and RAG systems. Cross-namespace leakage — where a query retrieves documents from an unauthorized namespace — is an Embedding Layer threat.

Incorrect. Vector database security, including namespace isolation and cross-namespace leakage, is addressed in the Embedding (E) Layer of MAESTRO.

12. A backdoored pretrained model downloaded from a third-party hub is a threat in which MAESTRO layer?

Correct. Third-party model integrity — including the risk of backdoored pretrained models from external hubs — is a Supply Chain Layer threat in MAESTRO.

Incorrect. Risks from third-party model sources, plugins, and datasets are Supply Chain (S) Layer threats in MAESTRO.

13. The Air Canada chatbot hallucination case illustrates which MAESTRO layer threat?

Correct. The Air Canada case is an Output Layer threat — model-generated content was consumed by a customer and enforced by a court, creating financial liability from unvalidated output.

Incorrect. Air Canada illustrates the Output Layer threat of hallucination-as-liability — where downstream consumption of a model's fabricated response created legal and financial consequences.

14. What is MITRE ATLAS and how does it relate to traditional MITRE ATT&CK?

Correct. ATLAS applies ATT&CK's tactic/technique/procedure structure to AI-specific adversarial techniques, grounding AI threat models in documented real-world attack evidence.

Incorrect. MITRE ATLAS is the AI-specific extension of ATT&CK — same TTP structure, applied to documented adversarial attacks against ML systems. Available at atlas.mitre.org.

15. What are the four deliverables of a complete AI threat model as described in Lesson 4?

Correct. The four outputs are: asset register, adversary profile matrix, threat register with MAESTRO layer and ATLAS TTP references, and control mapping documenting mitigations, gaps, and residual risk.

Incorrect. The four deliverables of an AI threat model are: (1) asset register, (2) adversary profile matrix, (3) threat register with MAESTRO/ATLAS references, and (4) control mapping with residual risk assessment.