When email arrived, it took a decade for phishing to become an industry. When the web arrived, it took five years for SQL injection to become a standard attack pattern. When smartphones arrived, it took three years for mobile malware to become pervasive. In each case, the attackers arrived on a faster clock than the defenders.
AI is repeating the pattern, but faster again. Prompt injection, model extraction, data poisoning, jailbreaks, adversarial examples, supply-chain attacks on model weights, deepfake social engineering — every category has gone from academic paper to active exploitation in the past three years, and the next category is probably already out there.
This course is an offensive-and-defensive guide to AI security. It covers the full taxonomy of AI-specific attacks, how to red-team an AI system before attackers do, how to build monitoring and detection for prompt injection and data exfiltration, how to think about model-weight supply-chain security, how to run a responsible-disclosure process, and the specific defensive patterns that actually slow attackers down. It's a security course with AI-specific content, not a general security course with AI as an example.
If you finish every module, here's who you become:
When Microsoft integrated GPT-4 into Bing, security researcher Kevin Liu discovered within days that he could instruct the chatbot to ignore its system prompt entirely — simply by asking it to reveal its "initial instructions." The underlying model had been fine-tuned to follow a confidential directive called Sydney. The directive was exposed. Microsoft had shipped a product with a novel class of vulnerability that had no CVE number, no patch cadence, and no prior art in the NIST vulnerability database.
This was not a buffer overflow. It was not an injection attack on a database. It was a prompt injection — a class of flaw native to AI systems and invisible to every security tool the team had deployed.
Classical software security operates on a deterministic model: a function accepts defined inputs and produces defined outputs. Security teams reason about state machines, memory boundaries, and protocol parsers. Vulnerabilities are discrete, reproducible, and patchable.
AI systems — particularly large language models and neural classifiers — are probabilistic. They approximate a function learned from billions of data points. That function is never fully auditable. It has no source code in the traditional sense. Its behavior emerges from weights, not logic. This creates attack surfaces that did not exist before 2017.
Security practitioners must now reason about three distinct layers: the model itself (weights, architecture, training data), the inference pipeline (APIs, context windows, tool calls), and the deployment environment (who can send what inputs, how outputs are consumed downstream).
A Web Application Firewall inspects HTTP requests against known malicious patterns. A prompt injection payload looks, syntactically, like a perfectly valid natural-language sentence. There is no byte sequence to blocklist. The malice is semantic, not syntactic.
Static analysis tools parse source code. An LLM's "source code" — its weights — is a 70-billion-parameter float array. No static analyzer reads it meaningfully. Dynamic fuzzing tools generate structured inputs to trigger crashes; AI systems rarely crash — they silently produce wrong, harmful, or attacker-controlled outputs instead.
This is why the security industry coined the term AI red-teaming: structured adversarial testing by human experts who reason about semantic intent, not syntactic signatures. In 2023, NIST published its AI Risk Management Framework (AI RMF) explicitly naming red-teaming as a required mitigation practice. DARPA, CISA, and the UK's NCSC have each issued analogous guidance since.
Three separate Samsung semiconductor engineers inadvertently uploaded proprietary source code, meeting notes, and hardware test data to ChatGPT within a single month. The data was potentially used as training material. Samsung subsequently banned generative AI tools internally. The incident required no exploit — only normal product use. The attack surface was the deployment decision itself.
AI security is not a subset of application security. It is a parallel discipline with overlapping tools but fundamentally different threat classes. Practitioners who treat LLM deployments as "just another web app" will miss the most dangerous attack vectors every time.
You are a security architect reviewing a new LLM-powered customer support chatbot before production deployment. Work with the AI analyst to build a structured threat model: identify assets, enumerate attack vectors, and classify threat categories.
Within 96 hours of Microsoft's Bing Chat launch, Stanford researcher Marvin von Hagen obtained the full text of the model's confidential system prompt — code-named "Sydney" — by asking the chatbot to roleplay a developer session. The exposed prompt revealed business constraints, behavioral guardrails, and Microsoft's operational guidelines. Von Hagen published the document on Twitter. Microsoft had not anticipated that a curious researcher, with no malicious intent and no technical exploit, could extract a document they had explicitly instructed the model to keep secret.
The adversary here was not a nation-state. It was a graduate student with a browser tab and an afternoon.
Classical adversary taxonomies (script kiddie → hacktivist → organized crime → APT) map reasonably well to AI systems, but the barrier to entry for AI attacks is dramatically lower. Prompt injection requires no programming knowledge. Jailbreaking requires pattern recognition and persistence, not technical skill. This expands the effective threat population.
Security teams should model adversaries across two axes: capability (what technical resources and expertise they possess) and motivation (what outcome they are trying to achieve). Motivation determines which attack classes are operationally relevant for a given deployment.
A complete threat model requires explicit enumeration of what adversaries are trying to obtain or damage. AI deployments have assets that do not appear in traditional system inventories.
A threat actor advertised "WormGPT" on underground forums — a fine-tuned LLM with safety guardrails removed, marketed for generating phishing emails and malware. SlashNext researchers purchased access and confirmed the tool generated "disturbingly persuasive" business email compromise content. The adversary motivation was financial; the attack asset was unrestricted inference access. Pricing was $60/month — the barrier to entry for LLM-enabled fraud.
Not every threat applies equally to every deployment. A medical AI classifying X-rays faces different threats than a customer support chatbot. Effective threat modeling requires matching adversary motivation to the specific assets and capabilities of the target system.
An adversary motivated by financial fraud against a banking AI will focus on evasion attacks — crafting inputs that cause the model to approve fraudulent transactions. An adversary motivated by competitive intelligence against the same system will focus on model extraction — reconstructing the scoring function to build a competing product. Same system, different threat actors, different attack classes, different mitigations.
Begin every AI security engagement by asking: who benefits if this system fails, and how? The answer constrains the threat space from "all possible attacks" to "attacks worth defending against given this adversary population and their capabilities."
You are red-teaming a healthcare AI system that uses an LLM to assist clinicians with diagnosis suggestions and accesses patient records via tool calls. Work with the analyst to enumerate adversary profiles, map motivations to specific assets, and identify which adversary is highest priority.
President Biden's Executive Order 14110 on Safe, Secure, and Trustworthy Artificial Intelligence, issued October 30, 2023, included a directive that developers of the most powerful AI models must share safety test results with the federal government before public deployment. The order invoked the Defense Production Act. For the first time, AI red-teaming results became a potential legal disclosure obligation — not merely a best-practice recommendation. NIST was tasked with defining what "safety testing" meant.
Security practitioners who had been doing red-teaming as an internal engineering discipline suddenly found themselves operating in a regulatory environment.
AI security obligations now arrive from multiple regulatory layers simultaneously. A single enterprise deploying an LLM in a regulated sector may face obligations under five or more overlapping frameworks. Understanding which framework controls which requirement — and where they conflict — is now a core security competency.
Regulatory language is often abstract; translation to technical requirements is the practitioner's job. Three requirements appear across most frameworks in some form:
1. Risk Assessment: Structured identification of how the AI system can fail or be misused, with proportionate documentation. Maps to threat modeling. The EU AI Act requires this for all high-risk systems before deployment.
2. Testing and Validation: Evidence that the system performs as claimed under adversarial conditions. "Adversarial testing" appears explicitly in NIST AI RMF (Measure 2.5), EU AI Act conformity assessments, and Biden EO 14110 red-team reporting requirements.
3. Incident Response and Disclosure: Mechanisms for detecting, containing, and reporting AI security incidents. The EU AI Act requires serious incident reporting to national authorities. SEC requires 8-K disclosure for material cybersecurity incidents.
The FTC banned Rite Aid from using facial recognition AI for five years after finding the system incorrectly flagged customers — disproportionately people of color — as shoplifters, leading to false accusations and humiliating confrontations. The FTC's complaint cited failure to adequately test the system before deployment and failure to maintain human oversight. This was a regulatory enforcement action predicated on inadequate AI security and testing practices — the first such FTC action specifically targeting AI system validation failures.
Meeting regulatory requirements and actually securing an AI system are not the same thing. SOC 2 Type II certification does not assess prompt injection resistance. ISO 27001 certification does not evaluate training data poisoning vectors. HIPAA compliance does not require adversarial testing of clinical AI recommendations.
Security practitioners must maintain two parallel workstreams: the compliance documentation that satisfies auditors, and the technical red-teaming that actually finds exploitable vulnerabilities. Conflating them produces organizations that are auditably compliant and operationally compromised.
Regulatory frameworks define obligations; they do not define security. An AI system that passes every required conformity assessment can still be catastrophically vulnerable to prompt injection, model extraction, or training data poisoning. Red-teaming must go beyond what regulators require — because attackers certainly will.
Your organization is deploying an LLM-based hiring screening tool in the EU. The CISO believes that existing SOC 2 Type II certification and GDPR compliance cover all AI security obligations. Work with the analyst to identify gaps — what the certifications miss, which EU AI Act requirements apply, and what additional testing is required.
Air Canada's LLM-powered customer support chatbot told a grieving passenger that he could book a bereavement fare after travel and claim a refund retroactively — a policy that does not exist. When the passenger presented the chatbot's assurance in court, Air Canada argued the chatbot was a "separate legal entity" responsible for its own statements. The British Columbia Civil Resolution Tribunal rejected this argument. Air Canada was ordered to honor the fare.
A complete threat model of this system would have identified hallucination-as-liability as an asset under threat: the company's legal and financial obligations were the asset, and the model's tendency to confabulate confident but false policy information was the attack vector. No adversary required. The threat was the system's normal operation.
STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) was developed by Microsoft in 1999 for traditional software systems. It maps reasonably to AI systems in some dimensions but misses critical AI-specific threats: training data integrity, model behavior manipulation, and emergent capability risks.
PASTA, DREAD, and LINDDUN each have analogous gaps. The security community has responded by developing AI-specific extensions. The most practically adopted is the MAESTRO framework, developed through contributions from MITRE ATLAS, the OWASP Top 10 for LLMs, and NIST AI RMF mapping work.
MAESTRO structures AI threat modeling across seven interdependent layers. Each layer has distinct assets, adversary access points, and relevant attack classes.
Consider an enterprise RAG (Retrieval-Augmented Generation) system that allows employees to query internal documentation via natural language, with the LLM having access to HR, legal, and financial document stores.
M (Model): The underlying LLM weights are hosted by a third-party provider (e.g., Azure OpenAI). Threat: provider-side model substitution or weight leakage. Mitigation: contractual SLAs on model integrity, output monitoring.
A (Agent): The system has no autonomous action capabilities in this version — low risk at this layer currently, but must be re-evaluated if agentic features are added.
E (Embedding): Three separate vector stores (HR, Legal, Finance) are queried. Threat: cross-namespace leakage — an employee query retrieves documents from a store they lack authorization to access. Mitigation: namespace-level access control mapped to user role, enforced before embedding similarity search.
S (Supply Chain): The embedding model (e.g., text-embedding-3-large) was downloaded from a third-party hub. Threat: backdoored embedding model that causes adversarial documents to rank highly. Mitigation: hash verification against provider-signed checksums.
T (Training): No fine-tuning in initial deployment. Risk deferred. Flag for re-assessment if custom fine-tuning is added.
R (Runtime): Users submit natural-language queries. Threat: indirect prompt injection — an attacker embeds instructions in a document in the store, which is retrieved and executed by the LLM. Mitigation: output validation layer that strips instruction-following patterns; source attribution in every response.
O (Output): Responses are displayed to employees and may influence HR decisions. Threat: hallucinated policy guidance acted upon by managers. Mitigation: mandatory source citation in all outputs, explicit disclaimer for HR/legal queries, human review gate for decisions above defined impact threshold.
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is the AI analogue of ATT&CK. It catalogs documented real-world attacks against ML systems with tactic/technique/procedure (TTP) mapping. As of 2024, ATLAS documents over 100 adversarial techniques across 14 tactic categories. Every MAESTRO layer maps to ATLAS tactics. ATLAS is freely available at atlas.mitre.org and should be the reference taxonomy for AI threat modeling.
A complete AI threat model produces four deliverables: (1) an asset register enumerating model, data, pipeline, and output assets; (2) an adversary profile matrix mapping threat actors to motivation and capability; (3) a threat register enumerating attack scenarios with MAESTRO layer and ATLAS TTP reference; and (4) a control mapping documenting current mitigations, gaps, and residual risk.
This documentation becomes the input to red-team scope definition — the topic of Module 2. Red-teamers test what the threat model says is at risk. Without the threat model, red-teaming is unfocused and unlikely to find the most consequential vulnerabilities before attackers do.
AI systems are probabilistic, emergent, and opaque in ways that classical security tooling cannot address. Their attack surfaces span model weights, training data, inference pipelines, and downstream output consumers. Adversaries range from curious researchers to nation-states, and motivation determines which attack class is operationally relevant. Regulatory obligations are multiplying but remain behind the technical threat frontier. Structured threat modeling — layer by layer, adversary by adversary — is the foundation on which all AI red-teaming is built.
Apply the MAESTRO framework to a complex AI deployment: an agentic LLM system that assists financial analysts, has access to live market data APIs, can execute trades within pre-approved parameters, and stores conversation history in a vector database for personalization. Work through each MAESTRO layer with the analyst.