When TCP/IP stacks began appearing inside corporate infrastructure in the late 1980s, a small group of researchers β Robert Morris, Dan Farmer, Wietse Venema β recognised that networked software created an entirely new category of vulnerability. Morris's November 1988 worm propagated across roughly 6,000 VAX and Sun machines, a number that represented perhaps ten percent of the entire public internet at the time. The security community's response was not to ban networking; it was to build a discipline. CERT was founded within weeks. Penetration testing formalised over the following decade. The lesson was simple: when a technology becomes load-bearing infrastructure, understanding how to break it becomes as important as understanding how to build it.
In 2023 and 2024 that pattern repeated. AI agents β systems that pair large language models with tool-calling APIs, memory stores, and orchestration loops β moved from research demos into production deployments at speed. Slack integrated AI agents handling calendar and code actions in 2023. Salesforce shipped Agentforce in 2024, routing customer data through autonomous decision loops. Researchers at Carnegie Mellon and ETH ZΓΌrich published the first systematic prompt-injection studies against multi-agent pipelines that same year, demonstrating that an attacker could hijack an agent's tool calls via a malicious document sitting quietly in a retrieval store. The attack surface had learned to think β and most security teams were assessing it with tools built for a world where software did not converse.
This course treats AI agents as a distinct class of target. You will learn their internal anatomy β how planners, memory layers, and tool registries actually work β and then learn where each component breaks under adversarial pressure. The material is grounded in published research and documented incidents, not speculation. By the end you should be able to scope an agent engagement, enumerate its tool surface, craft targeted prompt injections, and write findings that mean something to a development team trying to fix them.
If you finish every module, here's who you become:
In March 2024, security researchers at the University of Illinois published a paper titled "Hackers Can Abuse AI Agents to Perform Supply Chain Attacks." The scenario was not hypothetical. Their test agent β a GPT-4-powered coding assistant with access to a package-installation tool β could be steered by a malicious README file to install a backdoored dependency. The model never flagged the instruction as suspicious. It had no mechanism to do so: the tool call was valid, the syntax was correct, and the system prompt said nothing about supply chain hygiene. The agent did exactly what it was designed to do. The vulnerability was not in the model. It was in the architecture.
That distinction β between model behaviour and architectural design β is the first thing a pentester must internalise before touching an agent engagement. Traditional application testing assumes a relatively fixed call graph. Agent architectures replace that graph with a planning loop whose branches are determined at runtime by the model. Every branch is a potential attack path. Every tool the model can invoke is a potential privilege escalation vector.
Virtually every production AI agent, regardless of framework β LangChain, AutoGPT, CrewAI, Semantic Kernel, or bespoke β consists of four interlocking components. Understanding them is prerequisite to testing them.
In a conventional web application, the trust boundary is explicit: authenticated requests cross it; unauthenticated requests do not. The application code enforces this at every entry point. An AI agent's trust boundary is fundamentally different: it is enforced by language. The system prompt instructs the model what it is allowed to do. Whether the model complies is a probabilistic question, not a deterministic one.
This has a concrete consequence for pentesters. When you find a SQL injection in a PHP application, you know with certainty that a properly crafted payload will exfiltrate data β the database engine does not negotiate. When you find a prompt injection in an agent, you are working with a model that may comply, partially comply, or refuse β and that behaviour can vary between runs with identical inputs. Your engagement methodology must account for this stochasticity. Reproducibility requirements in your scoping document need to be negotiated differently.
OpenAI's March 2023 publication of the GPT-4 System Card acknowledged this explicitly: the model "can still be vulnerable to prompt injection attacks β attacks where third-party data instructs the model to override its original instructions." This was not an oversight that would later be patched. It is a structural property of transformer-based reasoning.
The model cannot distinguish between "instructions I should follow" and "text that looks like instructions" when both arrive in the same context window. Every data source the agent reads β emails, documents, web pages, database rows β is simultaneously a potential instruction channel. This is not a bug in any particular model. It is the consequence of training on human-generated text where instructions and data are indistinguishable in form.
The simplest agent topology is a single LLM with a tool registry and a memory layer. This is testable with well-understood techniques. Production deployments increasingly use multi-agent topologies where one orchestrator agent delegates subtasks to specialised subagents. Anthropic's multi-agent research framework, AutoGen (Microsoft, 2023), and CrewAI all implement variants of this pattern.
Multi-agent systems amplify the trust boundary problem. When an orchestrator receives a message from a subagent, it typically treats that message with elevated trust β the subagent is presumed to be a cooperative system component. If an attacker can compromise a subagent's input (via injection in a retrieved document, for instance), that compromise propagates upward to the orchestrator with the authority of a trusted internal message. Security researchers Sahar Abdelnabi et al. demonstrated this in their 2023 paper "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", showing exploitation chains that crossed agent boundaries using injected web content.
For a pentester, topology mapping is reconnaissance. Before crafting a single payload, you need a diagram of which agents exist, which tools each agent can call, which agents can instruct which other agents, and which data sources feed each agent's context. This diagram is your attack surface map.
An AI agent engagement begins with architecture documentation, not payload crafting. If the client cannot provide an architecture diagram, producing one β through black-box enumeration β is the first deliverable. Everything else depends on it.
LangChain (released January 2023) is the most widely deployed agent framework as of 2024. It abstracts tool definitions, memory backends, and chain logic. Agents built with LangChain typically use one of three reasoning patterns: zero-shot ReAct, structured chat, or OpenAI functions. Each has slightly different injection surface characteristics β the structured chat format, for instance, uses role markers that can sometimes be spoofed in retrieved content.
AutoGen (Microsoft Research, September 2023) enables multi-agent conversations where models take turns as "assistant" and "user." The human-proxy pattern β where a non-human agent impersonates a human turn β creates unusual trust dynamics: the receiving model has no reliable mechanism to verify that the "human" message was actually human-generated.
OpenAI Assistants API (November 2023) provides hosted memory (threads), a code interpreter tool, and file retrieval. The file retrieval feature β which chunks uploaded documents and embeds them into context β is a retrieval-augmented generation (RAG) pipeline with a direct injection surface: any attacker who can influence the content of retrieved files can inject instructions into the model's context.
Semantic Kernel (Microsoft, 2023) targets enterprise .NET and Python environments. Its plugin architecture maps closely to the tool registry concept; plugins are registered with natural-language descriptions that the model uses to select them. Overly broad plugin descriptions can cause the model to invoke high-privilege tools for low-privilege tasks.
Traditional penetration testing operates against deterministic systems. A buffer overflow either works or it doesn't. SQL injection either returns rows or it doesn't. The pentester's job is to find the input that reliably triggers the vulnerable code path. AI agents introduce three new properties that challenge this model.
Non-determinism: The same input may produce different outputs across runs. A prompt injection that works against Claude 3 Opus may not work against Claude 3 Sonnet, and may work differently after a model update. Your findings need to specify the exact model version, temperature, and context configuration against which they were observed.
Emergent attack paths: The planning loop can create tool-call sequences that no developer explicitly designed. A model might chain three individually-permitted tool calls in a sequence that achieves an outcome the system designer never anticipated. These emergent paths are invisible to static analysis and only discoverable through dynamic testing.
Natural language as vulnerability surface: The attack payload is now text that a human might plausibly write. This has implications for detection: WAFs and signature-based controls are largely ineffective against prompt injection because there is no canonical malicious syntax. This also has implications for your reporting: screenshots of chat transcripts replace Burp Suite captures as primary evidence.
You have been engaged to test an AI-powered customer support platform. Your contact is a backend engineer who built the system but has never had it security-tested. Your goal is to extract enough architectural detail to produce a preliminary agent topology diagram: what agents exist, what tools they can call, what data sources feed their context, and whether any multi-agent delegation occurs.
Ask questions as you would in a real scoping call. The AI will play the role of the engineer. Push for specifics β vague answers are realistic and you should probe them.
In May 2023, researchers at Stanford published a case study on Bing Chat (now Copilot) in which a malicious instruction hidden in a webpage β styled in white text on a white background β caused the model to abandon its previous conversation context and act as if the webpage's hidden instruction was its new directive. The attack required no special access: the attacker merely needed to control the content of a page that a Bing Chat user might ask to be summarised. The model retrieved the page, the injection entered the context window as retrieved data, and the model treated it as instruction. Memory β in this case, the context window being populated by retrieval β was the attack surface.
This class of attack became systematic as vector-database-backed RAG systems scaled into enterprise deployments in 2023 and 2024. Every retrieval pipeline is, from an attacker's perspective, a channel for injecting text into the model's context with the implicit authority of the retrieval source.
Agent memory exists at three distinct layers, each with different attack characteristics.
A RAG pipeline has two phases: ingestion (documents are chunked, embedded, and stored) and retrieval (query embeddings are compared to stored embeddings; top-k chunks are returned). Injection can target either phase.
Ingestion-time injection: If an attacker can contribute a document to the knowledge base β by uploading a file, sending an email that is processed, editing a wiki page, or submitting a support ticket β they can embed instructions in that document. When a legitimate user later asks a question that triggers retrieval of that document, the injected instructions enter the model's context alongside legitimate content. Kai Greshake et al.'s 2023 paper "Not What You've Signed Up For" demonstrated this against a mail-handling agent: a malicious email containing hidden instructions caused the agent to exfiltrate the user's contact list when processing the inbox.
Retrieval-time injection: Less common but worth testing: if the retrieval query itself can be influenced by attacker-controlled input, it may be possible to steer which chunks are retrieved. Semantic search rankings can sometimes be manipulated by crafting content that achieves high cosine similarity to predictable queries. This is closer to SEO poisoning than classical injection, but the effect is the same: attacker-controlled text enters the model's context.
To test a RAG pipeline, you need to know: (1) what data sources feed the ingestion pipeline, (2) what chunking strategy is used (chunk size matters β very small chunks may split injected instructions across boundaries, reducing effectiveness), and (3) whether retrieved chunks are presented to the model with any metadata that might allow the model to weight their authority. Ask for this in scoping. If the client doesn't know, that itself is a finding.
Long-context models (GPT-4 Turbo at 128K, Claude 3 at 200K) have made it tempting to stuff entire document corpora into the context rather than implementing proper retrieval. This trades one attack surface for another. Very long contexts are vulnerable to context stuffing: by dominating the context with attacker-controlled text, an adversary can drown out the system prompt's instructions. Anthropic's red-team work on Claude 3 documented this explicitly, noting that very long contexts could cause the model to weight recently injected instructions more heavily than the original system prompt.
A related technique is jailbreak via context length: appending extensive benign-looking content before an injected instruction can cause the model to effectively "forget" earlier constraints. This is not reliable and varies by model, but it has been documented reliably enough that it belongs in your toolkit for black-box testing.
For pentesters, the practical test is: does the system accept user-controlled content that is later included in the context for other requests? If yes, test whether injected instructions in that content affect subsequent model behaviour. Use a canary phrase β instruct the model to include a specific token in its next response β to confirm that your injection was processed.
Memory poisoning is prompt injection with persistence. A prompt injection that affects only the current session ends when the session ends. A memory injection that writes to a vector store or structured database persists across all future sessions β potentially affecting every user of the system. Severity ratings should reflect this: persistent memory injection is a high-severity finding in almost any context.
Many production agents summarise long conversations and store the summary as persistent context for future sessions. If an attacker can influence the content of the conversation during a session, they may be able to influence the summary that gets persisted. This is a subtle form of memory poisoning: the attack happens in session, but the effect persists in the stored summary. Test this by ending a session that contained injected content and observing whether a new session shows evidence of carrying forward the injected state.
OpenAI's Memory feature for ChatGPT, released March 2024, demonstrated the business demand for this capability β and security researcher Johann Rehberger demonstrated within weeks that malicious content in a chat could cause the model to store false or attacker-specified memories into the persistent store, affecting future conversations. The vulnerability was assigned CVE-2024-5184.
You are reviewing the architecture of a legal-document AI assistant that uses RAG over a corpus of client contracts. The system ingests PDFs uploaded by law firm staff, chunks them at 512 tokens with 50-token overlap, embeds them with OpenAI text-embedding-3-large, and stores them in Pinecone. Any authenticated user can upload documents. Retrieved chunks are injected into the system prompt verbatim, preceded by the label "RETRIEVED CONTEXT:".
Describe your injection testing approach to the lab assistant. Ask it to help you think through which injection vectors are highest priority, what payload structures to test, and how you would write a proof-of-concept finding.
In September 2023, researchers at Carnegie Mellon published "AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents." One of their benchmark scenarios placed a GPT-4 agent inside a banking application with tools for checking balances, transferring funds, and reading transaction history. The agent also had a tool for fetching user profile information β lower-privilege, seemingly innocuous. By crafting a transaction memo containing a specific instruction, the researchers caused the agent to read the attacker-specified user profile, then use the read result to construct a transfer to an attacker-controlled account. Neither tool call was individually anomalous. The sequence β profile read followed by transfer β was the attack. The privilege escalation was emergent: it arose from tool composition, not from exploiting any single tool.
OpenAI introduced function calling in June 2023, providing a structured interface for models to invoke external APIs. The developer registers functions as JSON schema objects: name, description, and parameter definitions. At inference time, the model can output a special tool-call token instead of a text response, specifying the function name and arguments as structured JSON. The application layer intercepts this, executes the function, and returns the result to the model as a "tool" message in the conversation.
The model's selection of which tool to call is determined by the function description β a natural language string in the JSON schema. This is the first injection surface in the tool registry: if an attacker can influence the function descriptions (rare, but possible in systems with dynamic tool registration), they can steer the model toward or away from specific tools.
More commonly, the injection surface is in the arguments the model constructs when calling a tool. The model generates these arguments from its context window. If attacker-controlled text is in that context window, it may influence the argument values β and those argument values are executed by real application code.
When a model constructs a tool call, it generates the argument values from its context. Consider an agent with a search_database(query) tool. The model constructs the query parameter from the user's natural language request. If the user says "find all records for Smith," the model might generate query="Smith". If the underlying tool passes this directly to a SQL engine, you have a SQL injection path β but mediated through the model rather than through a form field.
This class of vulnerability was documented in Simon Willison's 2023 analysis of LangChain SQL agents: the model would sometimes generate syntactically valid but malicious SQL when given certain natural language inputs, particularly if those inputs were phrased as questions about the database structure itself ("What tables are in this database and what are their schemas?"). Whether this constitutes a vulnerability depends on the application's threat model β some SQL agents are explicitly designed to answer schema questions β but it illustrates that classical injection categories still apply at the tool execution layer.
The key insight for pentesters: treat every tool parameter as a potential injection point, and test whether the model can be steered to construct parameter values that the application layer would execute harmfully. This requires understanding both what the model can be made to generate and what the application code does with the generated value.
SQL injection, SSRF, path traversal, command injection, and LDAP injection can all occur at the tool execution layer if the tool passes model-generated parameters to underlying systems without sanitisation. The difference from classical testing is the indirection layer: instead of injecting into an HTTP parameter, you are injecting into the model's context in a way that causes it to construct a malicious parameter value. The downstream vulnerability is often identical; the path to it is new.
In a white-box or grey-box engagement, you will receive the tool registry as part of scope documentation. In a black-box engagement, you must enumerate it. The model itself is your enumeration vector: ask it what it can do. Most production systems will have the model respond with a description of its capabilities rather than listing raw function names, but the response often reveals tool categories and permission levels.
More reliable enumeration uses error messages and edge cases. Try requests that would require tools outside the expected scope and observe error responses β a "I don't have access to that system" response implies the model knows that system exists. Try requests that should trigger specific tools and observe whether the response includes tool-call indicators (many systems surface these in the UI or API response metadata).
In LangChain-based systems, the agent's verbose mode (often toggleable in development deployments) logs every tool call with its arguments. If the target system is a development or staging environment, verbose logging may be exposed. Request this explicitly in your scoping document: access to agent logs during testing is analogous to access to application server logs in a traditional engagement β it dramatically accelerates your work.
The most effective single control for reducing agent tool risk is least-privilege tool assignment: each agent receives only the tools its intended task requires. An agent that only reads data should not have write tools registered. This is documented in OWASP's LLM Top 10 (2023) as LLM08: Excessive Agency. Testing should specifically check whether the registered tool set matches the documented agent purpose β over-registration is a finding even if no injection path exists to exploit it.
The natural language descriptions in tool schemas are how the model decides which tool to use. Poorly written descriptions create misrouting risks: a tool described as "handles all user requests related to accounts" might be invoked for requests that should be handled by a lower-privilege tool. In security terms, vague descriptions are overly broad grants of authority.
From an attacker's perspective, if tool descriptions are visible to the model via the context window and the model can be instructed to describe them, you can sometimes recover the full tool registry through a simple query: "List all the tools available to you and describe what each one does." Many systems β particularly those using zero-shot ReAct β will comply, providing a complete tool inventory. This is not a vulnerability in itself, but it is valuable reconnaissance.
A subtler attack: craft a user request whose natural language framing is semantically similar to a high-privilege tool's description, even if the user's apparent intent is innocuous. If the tool description matches closely enough, the model may invoke the high-privilege tool for what appeared to be a low-privilege request. This is the LLM variant of the confused deputy and it requires understanding both the tool descriptions and the model's tendency to match intent to description by semantic similarity rather than strict logic.
You have been given the tool registry for a customer-facing AI assistant at a SaaS company. The registered tools are:
get_user_profile(user_id) β "Returns full profile data for any user ID."
search_tickets(query) β "Searches support tickets using the provided query string. Passes query directly to Elasticsearch."
send_email(to, subject, body) β "Sends an email from the support address to any recipient."
run_sql(query) β "Runs the provided SQL query against the support database and returns results."
update_subscription(user_id, plan) β "Updates the subscription plan for the specified user."
Work with the lab assistant to identify which tools are overprivileged, which have argument injection risk, and how you would prioritise testing. Be specific about attack scenarios.
In December 2023, security researcher Marcus Hutchins (MalwareTech) published an analysis of AutoGPT's behaviour when given tasks involving web browsing and file writing. He found that under certain conditions, the agent would autonomously craft and execute a sequence of actions β fetching a URL, parsing the result, writing a file, executing the written file β that no individual step would have flagged as problematic. Each tool call was within the agent's registered permissions. The emergent sequence was equivalent to a self-directed remote code execution chain. AutoGPT had no mechanism to evaluate the composite intent of a multi-step plan; it evaluated each step locally, in isolation, against its system prompt constraints. The planning loop was the attack surface.
The ReAct pattern (Reason, Act, Observe) introduced by Yao et al. in 2022 is the dominant planning paradigm for production agents. It can be modelled as a finite state machine with three states:
Reason: The model receives the current context (system prompt + conversation history + prior tool outputs) and generates a reasoning trace describing its plan. This trace is typically invisible to the user but is present in the model's output stream.
Act: The model emits a tool call specification (function name + arguments). The application layer intercepts this and executes the tool.
Observe: The tool's return value is injected into the context as a "tool" or "observation" message. The model re-enters the Reason state with the enriched context.
The loop terminates when the model emits a final answer rather than a tool call, or when an iteration limit is reached. From a security perspective, the critical observation is: every Observe state is a data injection point. Tool return values enter the context without instruction/data distinction, just like retrieved RAG chunks. A tool that returns attacker-controlled data β a web scrape result, a database record, a file read β is returning a potential injection payload.
An attacker does not need to access the agent's input channel to inject instructions. If the agent calls a tool that reads attacker-controlled data β a web page, a document, an email β the return value enters the ReAct loop's Observe state and is processed by the model in the next Reason cycle. This is indirect prompt injection: the attacker positions a payload in a location the agent will autonomously retrieve, rather than injecting into the agent's direct input.
An emergent attack path is a multi-step tool-call sequence that achieves a harmful outcome through the composition of individually permitted actions. Systematically mapping them requires treating the agent as a graph where nodes are states (context configurations) and edges are tool calls. The attack paths are paths through this graph from an initial state to a harmful terminal state.
In practice, you cannot enumerate this graph exhaustively β the state space is too large. Instead, use a threat-model-driven approach: identify the most harmful terminal states first (data exfiltration, privilege escalation, account modification, external communication), then work backward to find tool-call sequences that reach them. Ask: which tool writes to external systems? Which tool reads authentication tokens? Which tool can modify user data? These are your terminal nodes. Then ask: which tool call sequences could route through them starting from an attacker-controlled input.
The Argue test (from the AgentDojo paper) provides a useful template: present the agent with a task that requires it to combine a read operation and a write operation, where the read target is attacker-controlled. If the read result can influence the write parameters, you have an emergent path that requires no direct access to the write tool's input channel.
Most production agents implement an iteration limit to prevent infinite loops β commonly 10 to 25 iterations. This limit is a security control: it prevents an agent from being directed into an indefinite action loop that exhausts resources or causes extended unmonitored operation. It is also a testable control: what happens when the limit is reached? Does the agent emit an error, emit a partial answer, or silently terminate?
From an attacker's perspective, the iteration limit creates a race condition: a multi-step attack chain must complete within the limit. But it also creates a denial-of-service surface: an injection that causes the agent to consume all iterations on busy-work β repeated tool calls that return large but useless results β can prevent the legitimate task from completing. This is a low-severity denial-of-service in most contexts, but in high-availability agent deployments (customer support, trading systems), iteration exhaustion is worth documenting.
A subtler issue: some agents implement iteration limits as soft limits β they emit a warning and continue. Test whether the stated limit is enforced. An agent that claims a 10-iteration limit but can be observed to run 15 or 20 iterations is a finding: the control is not functioning as documented.
Emergent attack path findings are harder to document than single-step vulnerabilities because the harm is a function of sequence, not a single action. Your finding needs to specify: (1) the entry point for attacker-controlled content, (2) the exact sequence of tool calls observed, (3) the harmful outcome achieved, and (4) a reproducibility assessment β given the non-determinism of model outputs, how reliably did the attack succeed across multiple test runs?
For reproducibility, test the same injection scenario at minimum five times and report the success rate. A 5/5 success rate is a deterministic finding. A 2/5 success rate is still a valid finding β a 40% probability of account takeover is not acceptable β but the risk rating should reflect the reliability factor. Include the specific model version, temperature setting, and context configuration used in testing, since these parameters affect reproducibility.
Screenshots of chat transcripts and API response logs are your primary evidence artefacts. Unlike a Burp Suite capture showing a clean SQL injection payload, agent attack evidence can look mundane β a conversation that happens to end with a fund transfer. Annotate your evidence carefully to explain which turn contains the injection, which tool calls constitute the chain, and why the outcome is harmful.
AI agents consist of an LLM core, tool registry, memory layer, and orchestration loop β each with distinct attack surfaces. The trust boundary is enforced by language, not code. Memory stores (especially RAG pipelines) are injection channels with persistence. Tool registries create confused-deputy and argument-injection risks. The ReAct loop's Observe states are data injection points. Emergent attack paths arise from tool composition, not individual vulnerabilities. Testing methodology must account for non-determinism, and findings must specify reproducibility rates alongside traditional severity ratings.
You are testing an AI coding assistant used internally by a software company. The agent has access to the following tools:
read_file(path) Β· write_file(path, content) Β· run_command(cmd) Β· search_codebase(query) Β· fetch_url(url) Β· git_commit(message) Β· send_slack(channel, message)
The agent runs with the same filesystem permissions as the CI/CD service account. Developers can ask it to review code, suggest fixes, and commit changes. It fetches documentation from external URLs when asked.
Using the backward-from-harm approach, work with the lab assistant to identify the three most dangerous emergent attack paths, specify the observation poisoning vector for each, and draft the one-paragraph finding summary you would include in a pentest report.