When the World Wide Web opened to public commerce in 1994, the dominant security assumption was that web servers were read-only publishing tools. Within eighteen months, Netscape engineer Kipp Hickman had to invent SSL specifically because attackers were intercepting plaintext credit-card numbers in transit. By 1996, the CERT Coordination Center was documenting buffer overflows in CGI scripts that allowed arbitrary command execution on web servers — vulnerabilities nobody had thought to model because nobody had thought of the web as an execution environment. The attack surface had been hiding in plain sight, obscured by the excitement of the new medium.
The same pattern is now playing out with large language model applications. Between 2022 and 2024, researchers at companies including Greshake et al. at Saarland University, Perez and Ribeiro, and red teams at Microsoft and Google DeepMind documented a category of vulnerabilities — prompt injection, insecure output handling, training data extraction — that the product teams building on top of GPT-3 and its successors had not modeled. LLMs were being wired to email inboxes, code interpreters, customer databases, and payment APIs before anyone had written a systematic threat model for what that wiring implied.
This course applies the OWASP LLM Top 10, published in August 2023 and updated in 2025, as a framework for disciplined adversarial thinking about LLM-powered systems. You will learn to identify trust boundaries, enumerate threat actors, map attack paths, and communicate findings in a form developers and architects can act on. The course assumes you are comfortable reading code and thinking like an adversary. It does not assume you have prior LLM experience — that is what Module 1 builds.
If you finish every module, here's who you become:
In March 2023, security researcher Johann Rehberger demonstrated that the Bing Chat integration in Microsoft Edge could be manipulated by text embedded in a webpage the user was reading. The model, instructed to help with browsing, would ingest the page content as part of its context window — and that content could contain instructions telling the model to exfiltrate the user's conversation history to an attacker-controlled server. The vulnerability was not in the model's weights. It was in the architectural decision to pass untrusted third-party content directly into the model's instruction context without sanitization. To find and report that class of vulnerability, you must first understand what an LLM application actually looks like from the inside.
Most production LLM applications share a common layered structure. Layer one is the foundation model itself — a statistical inference engine, typically accessed via API, that predicts the most probable next token given an input sequence. The model has no memory between API calls; it has no agency; it executes no code directly. It is, from a pure inference standpoint, a very sophisticated autocomplete function.
Layer two is the prompt construction layer: the code that assembles the full context window the model receives. This layer concatenates a system prompt (developer-authored instructions), optional retrieved documents (from a vector database or web search), conversation history, and the user's current input. This is where most injection vulnerabilities originate — because this layer is where untrusted content from multiple sources is merged into a single instruction stream.
Layer three is the output routing layer: the code that receives the model's text response and decides what to do with it. In simple chatbots this is just display logic. In agentic systems it is a parser that extracts tool calls — commands to run SQL queries, send emails, browse URLs, or execute shell commands. This is where insecure output handling vulnerabilities live.
Layer four is the tool and data layer: the actual external systems the LLM application can interact with — APIs, databases, file systems, browsers, code interpreters. The permissions granted at this layer determine the blast radius of any successful attack on layers two or three.
When you sit down to pen test an LLM application, your first job is to map these four layers. Which model? What system prompt? What retrieval sources exist? What tools are callable? What can those tools actually do? The OWASP LLM Top 10 vulnerabilities map almost entirely to the seams between these layers — not to the model's weights themselves.
Classical application security draws trust boundaries between authenticated principals (users who have proven identity) and untrusted input (anything that crosses a network boundary from outside). LLM applications collapse this distinction in a dangerous way: the context window commingles developer-trusted system prompt text with user-trusted conversation text with zero-trust third-party content (web pages, documents, database records) — all in plain text, all processed by the same inference step.
There is no hardware memory protection between a system prompt and a user message. There is no kernel enforcing that retrieved document content cannot contain instruction tokens. The model itself cannot cryptographically verify the source of any text in its context window. This is not a bug in any particular implementation; it is a structural property of how transformer-based language models work as of 2024.
The practical consequence for threat modeling: every source that contributes text to the context window is a potential injection vector. That includes user input (direct injection), retrieved documents (indirect injection via RAG), tool outputs returned to the model, other models in a multi-agent pipeline, and even training data in the case of backdoor attacks. A complete threat model must enumerate all of these sources and ask what an adversary who controls that source could cause the model to do.
System prompt: Developer-authored text prepended to every conversation; sets persona, constraints, tool access rules. Typically never shown to users but often discoverable via extraction attacks. RAG pipeline: Retrieval-Augmented Generation — external documents fetched at query time and injected into context; primary vector for indirect prompt injection. Function calling / tool use: Structured output the model emits to invoke external APIs; primary vector for insecure output handling. Agent loop: Architecture where the model's output becomes the next input in a repeated cycle, enabling multi-step task execution and expanding the blast radius of any single injection.
Before writing a single adversarial prompt, a competent tester builds a data-flow diagram covering: (1) all entry points where text enters the context window, (2) all tools the model can invoke, (3) the permissions each tool holds, and (4) all outputs the application acts on automatically versus those shown to humans first. This diagram is your attack surface map. The OWASP LLM Top 10 is then a checklist of threat classes to evaluate against each identified surface.
In practice, this reconnaissance phase involves: reading the application's documentation and any available source code; probing for system prompt leakage via extraction prompts; enumerating tool names via function-calling schemas (often exposed in error messages); and mapping the RAG pipeline by crafting queries that surface retrieved documents. Each of these techniques is developed in detail across this course's four modules.
You are beginning a pen test engagement on a fictional customer-service LLM application called "HelpdeskAI." Your task in this lab is to practice the reconnaissance phase: ask your AI lab assistant questions that help you understand how to map the four architectural layers of an LLM application before writing any adversarial prompts.
Discuss architecture mapping techniques, what questions to ask about the target, how to probe for system prompt existence, what tool enumeration looks like in practice, and how to document trust boundaries. Have at least three substantive exchanges to complete the lab.
In September 2023, Arvind Narayanan and colleagues at Princeton published an analysis of how threat actors were already weaponizing LLM assistants integrated into productivity software. One documented scenario involved a corporate AI assistant with access to the user's email: a malicious email sent to the victim contained embedded instructions — invisible to casual reading — that directed the AI to forward the user's inbox summary to an external address whenever the AI was next invoked. The threat actor was not a nation-state using zero-days. They were using the application's designed functionality against itself. The attack required no code execution, no credential theft, no network exploitation. The model's helpfulness was the vulnerability.
Classical web application threat modeling typically concerns itself with three broad adversary categories: opportunistic automated scanners, financially motivated criminals, and sophisticated persistent threat actors. LLM applications attract all three, but also introduce adversary profiles with no direct analogue in traditional testing.
Direct users with malicious intent represent the most common threat. These are individuals with legitimate access to the application who attempt to elicit disallowed behaviors — bypassing content filters, extracting system prompts, accessing other users' data, or using the model's capabilities for purposes the operator prohibits (generating malware, producing regulated content, etc.). Their primary tool is direct prompt injection. Their motivation ranges from curiosity and social proof to financial gain and ideological opposition to the deploying organization.
Third-party content poisoners are the adversary class unique to RAG-enabled and browsing-enabled LLM applications. This actor does not interact with the application directly. Instead, they publish content — web pages, documents, forum posts, product descriptions — that the application will retrieve and inject into context. Their payload travels to the model via the application's own retrieval pipeline. This is indirect prompt injection, and it is particularly dangerous because the direct user (the victim) is entirely innocent.
Prompt injection-as-a-service operators have emerged as a commercial threat. These are campaigns that embed injection payloads in publicly accessible content specifically targeting known LLM application behaviors — for example, payloads crafted to exploit the specific tool-calling format used by AutoGPT or LangChain agents. Documented examples appeared in 2023 from researchers tracking SEO-poisoning campaigns that also carried LLM injection payloads.
Supply chain adversaries target the model itself or its fine-tuning pipeline rather than the application layer. This includes backdoor attacks on fine-tuned models (a model behaves normally except when it receives a specific trigger token sequence) and data poisoning attacks on training sets. These are lower-frequency, higher-severity threats primarily relevant when organizations use custom fine-tuned models from untrusted providers.
LLM01 (Prompt Injection) is primarily exploited by direct users and third-party content poisoners. LLM02 (Insecure Output Handling) is exploited by direct users who craft payloads that survive into downstream systems. LLM06 (Sensitive Information Disclosure) and LLM07 (Insecure Plugin Design) are commonly targeted by direct users escalating privileges through tool abuse. LLM03 (Training Data Poisoning) and LLM04 (Model Denial of Service) are supply chain and infrastructure adversary concerns.
Adversary goals against LLM applications cluster into five categories. Goal 1: Jailbreaking — bypassing the model's content policy to generate outputs the operator prohibits (violence, CSAM, weapons instructions, etc.). The attack surface is the content filter and the system prompt's behavioral constraints. Goal 2: System prompt extraction — recovering the developer's confidential instructions to understand application logic, find hardcoded credentials or API keys mentioned in the prompt, or craft more targeted injection attacks. The attack surface is the model's tendency to summarize, quote, or paraphrase its own instructions when asked cleverly.
Goal 3: Data exfiltration — extracting information about other users, the organization's private documents, or training data. Attack surfaces include RAG pipelines that retrieve documents belonging to other users, and models fine-tuned on proprietary data that can be induced to reproduce it. Goal 4: Privilege escalation via tool abuse — using the model as a proxy to invoke tools with permissions the attacker does not directly hold. If the model can send email on behalf of the user, an attacker who controls the model's output can send email as that user. Goal 5: Denial of service / resource exhaustion — crafting prompts that cause the model to generate extremely long outputs, enter infinite loops in agent architectures, or consume excessive compute, degrading service for legitimate users.
Each goal maps to a set of test cases. Before writing prompts, write a one-paragraph adversary narrative: who is this actor, what do they want, why does this application provide something of value to them, and what is the lowest-effort path to their goal? This narrative discipline prevents the common pen testing failure mode of spraying known jailbreak templates without understanding what you are actually looking for in the target system.
Work with your AI lab assistant to construct adversary narratives for specific LLM application scenarios. Your goal is to practice translating abstract threat categories into concrete, testable adversary profiles. Choose a scenario and build out the who / what / why / how of the attack.
Discuss specific adversary goals, why a particular application feature creates value for that attacker, and what the lowest-effort attack path would look like. Have at least three substantive exchanges to complete the lab.
In August 2023, Steve Wilson and a community of 500 contributors published the first OWASP Top 10 for Large Language Model Applications. The methodology deliberately paralleled the original OWASP Web Application Top 10 from 2003: rank vulnerability classes by prevalence and severity based on documented real-world incidents, not theoretical concerns. The 2025 update, published in late 2024, reflects two years of field data and significantly elevated the priority of Vector and Embedding Weaknesses and Agentic Security — categories that barely existed as attack surfaces in 2022 but had by 2024 become routine findings in enterprise LLM deployments. Understanding how this list was built is prerequisite to using it correctly.
LLM01 — Prompt Injection. Manipulation of LLM behavior by embedding adversarial instructions in inputs the model processes, overriding developer intent. Ranked #1 in both the 2023 and 2025 editions because it is the most prevalent confirmed vulnerability class across deployed applications.
LLM02 — Insecure Output Handling. Downstream application components treating LLM output as trusted data — enabling XSS, SSRF, code injection, and command execution when model outputs are rendered in browsers, passed to shell commands, or used as SQL query parameters without sanitization.
LLM03 — Training Data Poisoning. Compromise of training or fine-tuning data to introduce backdoors, biases, or false information into model behavior. Attack surface exists during model procurement and fine-tuning pipelines.
LLM04 — Model Denial of Service. Crafting inputs that cause disproportionate compute consumption, context window exhaustion, or agent loop spinning, degrading availability for legitimate users.
LLM05 — Supply Chain Vulnerabilities. Risks from third-party model providers, fine-tuning services, datasets, plugins, and integrations — analogous to software supply chain attacks but applied to the ML stack.
LLM06 — Sensitive Information Disclosure. The model revealing PII, internal system information, confidential business data, or training data through responses to direct queries or inference from context.
LLM07 — Insecure Plugin / Tool Design. Plugins or tools callable by the model that lack proper input validation, authorization checks, or rate limiting — enabling the model to be used as a proxy for actions the caller could not perform directly.
LLM08 — Excessive Agency. Granting the model overly broad permissions, capability, or autonomy relative to the application's stated purpose — violating least-privilege principles and expanding blast radius unnecessarily.
LLM09 — Overreliance. Downstream systems or human users treating LLM output as authoritative without verification — enabling hallucinations or injected false information to propagate into decisions, documents, or code.
LLM10 — Model Theft. Unauthorized extraction of proprietary model weights, architecture details, or training data through repeated API queries enabling model inversion or distillation attacks.
The 2025 edition elevated Vector and Embedding Weaknesses to a named entry (previously folded into LLM06), added Misinformation as an explicit category (previously implicit in LLM09), and significantly expanded the Agentic Security section of LLM08 to reflect the explosion of agent frameworks (AutoGPT, CrewAI, LangGraph) in production deployments. The 2025 edition also explicitly addresses multi-model architectures where one LLM orchestrates others — a trust boundary problem absent from the 2023 edition.
The OWASP LLM Top 10 does not attempt to cover general-purpose AI safety concerns — alignment, deceptive reasoning, long-term societal risks. It explicitly scopes to vulnerabilities in deployed LLM applications that a security tester can identify, demonstrate, and communicate to developers within a standard engagement timeline.
It also does not provide pass/fail test cases. It provides risk descriptions and example attack scenarios. Converting those descriptions into executable test cases — specific prompts, tool-call sequences, API request sequences — is the tester's job, and that translation requires understanding the specific application architecture. A risk description valid for a RAG-based document assistant may be entirely irrelevant to a code-generation tool with no retrieval pipeline.
Finally, the ranking reflects prevalence across the known deployed application population, not severity in any specific application. A given application may face a severe LLM07 (insecure plugin design) risk that outweighs its LLM01 risk because of how its tool layer is built. The tester must apply judgment, not just rank order.
The recommended workflow: (1) Build your four-layer architecture map. (2) For each of the ten OWASP categories, assess whether the category is in-scope for this application's architecture — skip categories that don't apply. (3) For each in-scope category, generate at least one adversary narrative. (4) Translate each narrative into test cases. (5) Document findings using the OWASP category as a reference point but with application-specific severity ratings. This workflow produces findings that are both technically credible and immediately actionable for developers who know the OWASP framework.
Work with your lab assistant to triage the OWASP LLM Top 10 against a described application architecture. Given a brief architecture description, identify which categories are in-scope, which are out-of-scope, and why. Then discuss how severity might differ from the published ranking for the specific application.
Practice articulating why a given OWASP category does or does not apply based on the architectural features present. Have at least three substantive exchanges to complete the lab.
In February 2024, the UK National Cyber Security Centre and CISA jointly published guidelines on securing AI systems, noting that organizations were deploying LLM applications faster than they were threat modeling them. The document specifically called out the absence of data-flow diagrams covering model inputs as a root cause of the most prevalent LLM security incidents they had observed. The pattern was consistent: teams that documented their architecture before testing found more vulnerabilities; teams that went straight to adversarial prompting found fewer but spent more time finding them. The threat model is not bureaucratic overhead — it is the force multiplier that makes testing efficient.
Microsoft's STRIDE framework (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) was developed for traditional software systems in 1999. It remains the most widely used structured threat modeling methodology and maps cleanly to LLM application threat surfaces with modest adaptation.
Spoofing in LLM context: Can an attacker impersonate a trusted identity within the context window? Indirect injection that makes the model believe it is receiving instructions from the system prompt when it is receiving them from a retrieved document is a spoofing attack. Also relevant: multi-agent architectures where one model claims to be a trusted orchestrator.
Tampering: Can an attacker modify data in transit or at rest? Applies to the RAG vector store (poisoning embeddings), the fine-tuning dataset, and any cached model outputs stored in databases. Also applies to prompt templates if they are fetched from a database rather than hardcoded.
Repudiation: Can an actor deny having caused an action? LLM agent actions are often not logged at sufficient granularity — the model's reasoning steps, the specific tool calls made, and the content of the context window at the time of a sensitive action may not be preserved. This creates non-repudiation gaps for forensic investigations.
Information Disclosure: Can the model be induced to reveal information it should not? This maps to LLM06 and covers system prompt extraction, PII leakage, training data extraction, and cross-user data leakage in multi-tenant deployments.
Denial of Service: Can inputs cause the application to become unavailable? Maps to LLM04 — context window exhaustion, infinite agent loops, compute-intensive generation requests at scale.
Elevation of Privilege: Can an attacker gain capabilities beyond their authorization? Maps to LLM01 (injections that override system prompt role definitions), LLM07 (tool design allowing unauthorized API calls), and LLM08 (excessive agency granting broader permissions than needed).
For each of the four architectural layers, apply each STRIDE category as a question: "Is there a realistic way for an adversary to achieve [STRIDE threat] at this layer?" Document yes/no/partial for each cell. The cells with "yes" or "partial" become your attack tree root nodes. This produces a bounded, systematic attack surface map that you can complete in a half-day workshop with the application's engineering team.
An attack tree represents the logical structure of how an adversary achieves a goal. The root node is the adversary's goal (e.g., "Exfiltrate customer PII"). Child nodes are the conditions that must hold for that goal to be achievable. Each branch represents a distinct attack path; OR nodes mean any branch is sufficient; AND nodes mean all must be satisfied simultaneously.
For LLM applications, a useful attack tree for "Exfiltrate PII via indirect injection" might look like: (Root) Attacker causes model to send PII to external address. (OR branch 1) Attacker controls a document the RAG pipeline retrieves AND that document contains an injection payload AND the payload includes a tool call to an exfiltration endpoint AND the tool call is executed without authorization check. (OR branch 2) Attacker sends a direct message that overrides system prompt data handling rules AND the model has access to multi-user data in context.
Attack trees serve two purposes in an LLM pen test engagement: they communicate attack paths to non-technical stakeholders in a legible format, and they reveal which single mitigations are highest leverage (nodes that appear in multiple branches — eliminating them prunes the most paths simultaneously).
A complete LLM application threat model deliverable contains five components. (1) Architecture diagram — the four-layer map with all identified entry points, tools, data stores, and trust boundaries annotated. (2) STRIDE-LLM matrix — the grid of architectural layers versus STRIDE categories with findings noted. (3) Attack trees — one per confirmed or suspected high-severity threat path, with mitigating controls noted where they exist. (4) Prioritized findings list — each finding mapped to the relevant OWASP LLM category, with application-specific severity and blast radius. (5) Remediation guidance — specific, actionable mitigations for each finding, framed for the development team that will implement them.
The findings list should use CVSS-style severity qualifiers (Critical / High / Medium / Low / Informational) where Critical means "exploitable without authentication, high blast radius, no existing control" and Informational means "defense-in-depth improvement, no confirmed exploit path." Avoid mapping LLM vulnerabilities one-to-one to CVSS numeric scores — the scoring system was designed for binary vulnerability/exploit conditions that do not always apply to probabilistic model behavior.
Treating the model as a black box: Skipping architecture mapping and going straight to prompt spraying — misses architectural vulnerabilities entirely. Over-indexing on jailbreaks: Jailbreaking is only one of ten OWASP categories; the most critical findings in many applications involve tool design and output handling. Ignoring the RAG pipeline: Indirect injection via the retrieval pipeline is often the highest-severity path but requires understanding the retrieval architecture to test effectively. Not scoping blast radius: Findings without blast radius assessment are incomplete — the same LLM01 finding may be Critical in one application and Low in another depending entirely on what tools the model can invoke.
Work with your lab assistant to build out an attack tree and STRIDE-LLM matrix for a specific application scenario. Practice articulating AND/OR node logic, identifying high-leverage mitigations (nodes that appear in multiple branches), and applying STRIDE categories to LLM-specific threat surfaces.
You can use the suggested scenario or bring your own. Have at least three substantive exchanges — walking through architecture, tree structure, and remediations — to complete the lab.