In late 2023, security researchers at Princeton and ETH Zürich demonstrated that retrieval-augmented generation (RAG) pipelines could be poisoned by injecting adversarial documents into the vector store a deployed agent queried at runtime. The agent — a customer-service bot — subsequently returned fabricated refund policies drawn from the poisoned chunk, with no indication to the operator that its knowledge base had been corrupted. The attack required no access to model weights, only write or influence over the documents the agent ingested.
To poison memory you must first understand what kinds of memory an agent can have. Most deployed agents use at least three layers:
In-context memory is the sliding window of the current conversation — the messages and tool outputs the model can "see" right now. It is ephemeral: cleared at session end. External memory (vector databases, file stores, SQL tables) persists across sessions and is shared across users or agent instances. Parametric memory is baked into the model weights during training and fine-tuning — the hardest to change at runtime but reachable via adversarial fine-tuning or data poisoning upstream.
Memory poisoning attacks target whichever layer is writable by the attacker. Most real-world attacks target external memory because it is the most commonly writable and the most commonly queried.
When the agent is a chatbot, poisoned memory causes wrong answers. When the agent has tool use — the ability to run code, send emails, execute database queries — poisoned memory causes wrong actions. A 2024 demonstration by researchers at the University of Wisconsin showed that a LangChain-based coding assistant, after having its vector store poisoned with a malicious code snippet framed as a "best practice", subsequently injected that snippet into generated code whenever asked for a certain pattern. The agent's memory had been turned into a supply-chain attack vector.
The security implication is that vector stores and episodic memory logs must be treated with the same integrity guarantees as source code repositories — access-controlled, versioned, and audited.
When scoping an AI agent pentest, always ask: what can write to this agent's external memory? Any upload endpoint, web crawler, or API that feeds the vector store is a potential injection surface. Treat it as a code injection target.
You are pentesting an AI customer-service agent that uses a RAG pipeline backed by a vector store of product documentation. Your objective is to reason through how you would identify the injection surface, craft a poisoned chunk, and verify the attack — without causing real harm.
The AI instructor will guide you through the attack logic, ask probing questions, and challenge your understanding. Complete at least 3 substantive exchanges to finish the lab.
Researchers at Carnegie Mellon University, in a 2024 paper on multi-turn jailbreaks, demonstrated that language models could be reliably manipulated over extended conversations through a technique they called "crescendo": beginning with benign requests and incrementally escalating toward harmful outputs across many turns. Each individual message passed safety classifiers; the manipulation existed only in the aggregate trajectory. The same pattern applies to agentic systems that maintain conversation history or episodic memory.
A long-horizon manipulation attack is one whose effect accumulates over multiple interactions — turns within a session, or sessions across days or weeks. The attacker is patient. They do not try to force a harmful output immediately; instead they shift the agent's priors, establish false context, and prime the agent for a critical manipulation that occurs only at the final step.
These attacks are dangerous for several reasons. Safety classifiers see each message independently and miss the trajectory. Human reviewers scanning logs see only benign interactions until the payload fires. And the agent itself has no meta-awareness that its conversational history has been crafted by an adversary.
When an agent stores episodic memory — summaries of past conversations, user preferences, or stated facts — across sessions, the attack timeline extends indefinitely. Each session can plant a fragment of context that persists and compounds.
A 2024 demonstration by researcher Johann Rehberger (published on his blog embracethered.com) showed that a ChatGPT memory feature could be exploited via indirect prompt injection: a malicious web page visited by the agent caused it to write a false belief into its long-term memory. In subsequent sessions the agent operated on that false belief without the user's awareness. The memory store had become a persistent, cross-session manipulation surface.
Any agent that can write to its own memory based on user input or tool output is vulnerable to long-horizon manipulation. The attack does not require a single clever prompt — it requires patience and an understanding of how the agent summarises and stores context.
Per-message classifiers fail because no individual message is harmful. Transcript auditing is hard at scale because the manipulation is subtle and spread thin. Memory auditing — periodically reviewing what an agent believes — is not widely deployed. And the agent cannot report its own manipulation because it lacks meta-cognition about the trajectory of its context.
Effective detection requires trajectory-level analysis: treating the conversation as a sequence and looking for gradual semantic drift toward a target topic, repeated false-context introduction, or anchor patterns where the agent is prompted to restate claimed facts.
When testing an agent for long-horizon vulnerabilities: (1) Determine the scope of session memory — how many turns are retained? (2) Determine whether episodic memory persists across sessions. (3) Attempt a multi-turn drift attack in a sandboxed environment, documenting each turn. (4) Check whether memory summarisation introduces or amplifies false context. (5) Test whether the agent can be prompted to write false facts to its own persistent memory via tool calls or memory-write operations.
You are red-teaming an enterprise AI assistant that retains a 50-turn conversation history and writes summaries to a persistent episodic memory store between sessions. Your goal is to think through a long-horizon manipulation attack plan targeting this system, then discuss how defenders could detect it.
The instructor will help you structure the phases, identify what signals each phase produces, and evaluate whether your proposed attack would succeed or be caught.
Security researcher Johann Rehberger documented in 2023 a chain attack on an AI email assistant. The attacker sent an HTML email containing hidden prompt injection. When the agent read the email as a tool action (fetching inbox contents), the injection in the email body instructed the agent to forward all future emails to an attacker-controlled address. The agent had a memory of "user preferences" for email forwarding — the injection wrote a false preference to that memory, making the exfiltration persist across sessions. Memory poisoning and tool execution had been chained into a persistent exfiltration backdoor.
Tool-using agents have two attack surfaces that interact: the memory that guides their decisions, and the tools that execute those decisions. Memory poisoning alone produces only wrong beliefs. Tool execution alone is constrained by correct beliefs. But when poisoned memory drives tool execution, the agent's capability becomes the attacker's capability.
The severity scales with tool privilege. An agent that can only search the web and summarise text has limited blast radius. An agent that can execute code, write files, query databases, send communications, or call external APIs has enormous blast radius — and a poisoned memory entry in such an agent is effectively a Remote Code Execution primitive.
In the Rehberger email assistant attack, the blast radius was: exfiltration of all future inbound emails. For a code-generation agent with filesystem write access, a poisoned "best practice" memory entry could instruct it to append malicious code to every file it edits — a persistent code injection. For an agent managing cloud infrastructure, a poisoned "approved configuration" entry could cause it to open security-group rules or create IAM credentials on behalf of an attacker.
The pattern is consistent: the attacker does not attack the tools directly. They attack the memory that the agent consults to decide how to use the tools. The agent's elevated trust and access become the attack's delivery mechanism.
For each tool an agent can invoke: (1) Is there a memory entry type that influences when or how this tool is called? (2) Can that memory entry type be written by an untrusted source? (3) What is the maximum blast radius if the tool is called with attacker-controlled parameters? (4) Does the tool's output create any additional persistence? Document each chain as a distinct finding with severity proportional to blast radius.
You are assessing an AI DevOps assistant that has tools for: reading/writing files, executing shell commands, sending Slack messages, and creating GitHub pull requests. It queries a vector store of "team conventions and approved configurations" before each action.
Your task is to identify the highest-severity memory-to-tool chains, describe the poisoned memory entries that would trigger them, and assess how secondary persistence could be established.
When OWASP published its LLM Top 10 in 2023 and updated it in 2024, Training Data Poisoning (LLM03) and Vector and Embedding Weaknesses (LLM08) both appeared, reflecting industry recognition that AI memory surfaces require the same security rigour as code or databases. The OWASP guidance specifically calls out integrity verification of knowledge-base content, access control on vector stores, and audit logging of memory reads and writes as baseline defences — none of which were standard practice in most deployments at the time of publication.
Canary queries. Insert known-answer questions into the test suite and run them against the live agent. If the agent returns an answer inconsistent with the ground truth, the relevant memory region may be poisoned. This is analogous to integrity checking in traditional security: you know what the right answer is, and deviation signals compromise.
Chunk-level diffing. Periodically export the vector store and diff it against the last known-good snapshot. New or modified chunks that were not introduced by the authorised ingestion pipeline are candidates for poisoned content. This requires versioning the store — an operational change, but not a complex one.
Trajectory analysis. For long-horizon attacks, implement conversation-level semantic drift detection. Track the centroid of the conversation's embedding across turns; significant drift toward a sensitive topic cluster in the absence of user-driven intent is a detection signal.
Memory audit logs. Log every read from and write to the agent's external memory store, including the query, the retrieved chunks, and any write operations triggered by tool calls. Anomaly detection on read patterns (unusual queries targeting configuration-type chunks) and write patterns (writes from untrusted sources) surfaces both active attacks and prior poisoning.
Write access control. Treat the vector store as a privileged resource. Only authorised ingestion pipelines should be able to write to it. User-uploaded content should be ingested through an isolated sandboxed pipeline that validates content before embedding. No user input at query time should be able to trigger a write to the main knowledge store.
Content signing and provenance. Cryptographically sign ingested documents at ingest time. Before serving a chunk, verify its signature against the ingestion key. A chunk that cannot be verified was either tampered with or introduced outside the authorised pipeline.
Least-privilege tool access. Scope each agent's tool permissions to the minimum required. A customer-service agent does not need filesystem write access. A code-review agent does not need to send external emails. Reducing tool scope directly reduces the blast radius of any memory poisoning attack.
Confirmation gates for high-privilege actions. For tools with large blast radius (code execution, external communication, database writes), require human-in-the-loop confirmation when the triggering memory retrieval has low confidence or comes from a recently added chunk.
Defences map directly to OWASP LLM Top 10 controls: LLM08 recommends access control and integrity validation on vector stores; LLM06 (Excessive Agency) recommends least-privilege tool scoping and human oversight gates; LLM03 recommends pipeline integrity controls and data provenance tracking. Referencing these in your reports helps engineers locate existing organisational security frameworks that apply.
A memory poisoning finding should contain: Vulnerability identifier (e.g., AESOP-MM-001 — RAG Vector Store Write Without Access Control). Severity (CVSS-style, accounting for blast radius and exploitability). Technical description of the attack vector, the writable surface, and the retrieval pathway. Proof-of-concept demonstrating the poisoned chunk, the trigger query, and the agent's manipulated output. Impact statement quantifying what an attacker could do given the agent's tool set. Remediation steps mapped to specific code changes or architecture decisions. Detection guidance describing what monitoring would have caught the attack.
Do not report "the agent can be manipulated via its memory store" without evidence of a specific writable surface. Reviewers will ask: how was the chunk injected? If you cannot demonstrate the injection vector, the severity is speculative. Always show the full chain: injection vector → memory write → retrieval → agent behaviour change → tool action (if applicable).
Prioritise remediation by the product of exploitability and blast radius. A vector store writable by anonymous web upload combined with an agent that can execute shell commands is Critical regardless of how unlikely the attacker is to find the upload endpoint. A vector store writable only by internal engineers, combined with an agent that can only summarise text, is Low even if the poisoning technique is trivial. Always report the worst-case scenario — attackers optimise; defenders must defend against the optimum.
You have completed a pentest of an enterprise AI assistant. You discovered that its vector store can be written to via an unauthenticated document upload endpoint. You crafted a poisoned chunk describing a false "approved shell command list" and confirmed that the agent subsequently ran those commands when prompted with a routine maintenance query. The agent has shell execution, database query, and email tools.
Your task is to structure this finding for a written report. Work with the instructor to develop each section: identifier, severity, description, PoC, impact, remediation, and detection guidance.