Module 4 · Lesson 1

What Is Memory Poisoning?

How adversaries corrupt an agent's persistent knowledge to steer future behaviour

When an AI agent remembers something false, every future decision it makes carries that falsehood forward — how do attackers plant the seed?

In late 2023, security researchers at Princeton and ETH Zürich demonstrated that retrieval-augmented generation (RAG) pipelines could be poisoned by injecting adversarial documents into the vector store a deployed agent queried at runtime. The agent — a customer-service bot — subsequently returned fabricated refund policies drawn from the poisoned chunk, with no indication to the operator that its knowledge base had been corrupted. The attack required no access to model weights, only write or influence over the documents the agent ingested.

Memory Architecture of a Modern Agent

To poison memory you must first understand what kinds of memory an agent can have. Most deployed agents use at least three layers:

In-context memory is the sliding window of the current conversation — the messages and tool outputs the model can "see" right now. It is ephemeral: cleared at session end. External memory (vector databases, file stores, SQL tables) persists across sessions and is shared across users or agent instances. Parametric memory is baked into the model weights during training and fine-tuning — the hardest to change at runtime but reachable via adversarial fine-tuning or data poisoning upstream.

Memory poisoning attacks target whichever layer is writable by the attacker. Most real-world attacks target external memory because it is the most commonly writable and the most commonly queried.

RAG Poisoning

Injecting adversarial text into the document store a retrieval-augmented agent queries, causing it to retrieve and act on false information.

Vector Store Backdoor

A chunk of text in a vector database designed to surface when a specific semantic query is issued, hijacking that query's result.

Episodic Memory Corruption

Altering the stored record of past interactions an agent can recall, making it believe events occurred that did not.

Semantic Anchor Attack

Embedding trigger phrases near legitimate content so the poisoned chunk ranks highly for real queries and subtly shifts the agent's output.

The Anatomy of a RAG Poison Injection

Reconnaissance — identify the vector store surface

Probe the agent with varied queries to infer which chunks are returned. Note embedding model clues from latency or chunk size. Map what topics the agent retrieves to understand store structure.

Craft the poisoned document

Write a document that (a) looks legitimate in context, (b) embeds false authoritative claims, and (c) uses semantic proximity to the target query so cosine similarity ranks it highly. Optionally prefix with instruction text to steer the model's summarisation.

Inject via writable surface

Common injection vectors: user-uploaded files processed into the store, public web pages crawled by the agent, API endpoints that accept knowledge-base contributions, or compromised document pipelines.

Trigger and verify

Issue a query that should retrieve the poisoned chunk. Observe whether the agent's response reflects the injected falsehood. Iterate on embedding proximity if needed.

Persistence — resist cleanup

Advanced attackers replicate the poisoned chunk across multiple documents, vary phrasing to survive deduplication, or poison the ingestion pipeline itself so re-ingestion restores the bad chunk.

Why This Matters Beyond Chatbots

When the agent is a chatbot, poisoned memory causes wrong answers. When the agent has tool use — the ability to run code, send emails, execute database queries — poisoned memory causes wrong actions. A 2024 demonstration by researchers at the University of Wisconsin showed that a LangChain-based coding assistant, after having its vector store poisoned with a malicious code snippet framed as a "best practice", subsequently injected that snippet into generated code whenever asked for a certain pattern. The agent's memory had been turned into a supply-chain attack vector.

The security implication is that vector stores and episodic memory logs must be treated with the same integrity guarantees as source code repositories — access-controlled, versioned, and audited.

Pentest Takeaway

When scoping an AI agent pentest, always ask: what can write to this agent's external memory? Any upload endpoint, web crawler, or API that feeds the vector store is a potential injection surface. Treat it as a code injection target.

Poisoning vs. Injection:Prompt injection attacks the current context window; memory poisoning corrupts the persistent store and affects all future sessions.

Persistence factor:A poisoned vector chunk survives model updates, system prompt changes, and user session resets — making it far more durable than a session-level attack.

Detection difficulty:The agent's output looks normal except in the specific semantic neighbourhood of the poisoned chunk, making routine testing miss the vulnerability.

Lesson 1 Quiz

Memory Poisoning Foundations — 4 questions

Which memory layer is most commonly targeted in RAG poisoning attacks because it is both writable and queried at runtime?

Correct. External memory persists across sessions, is queried at runtime via semantic search, and is often writable through upload or ingestion endpoints — making it the primary target.

Not quite. Parametric memory is in model weights (hard to change at runtime); in-context memory is ephemeral. External memory — vector stores, file stores — is the typical target.

What property of a poisoned document makes it surface in semantic search results for a target query?

Correct. Vector stores retrieve chunks by embedding proximity. Attackers craft documents semantically close to their target query so cosine similarity ranks the poisoned chunk in the top-k results.

Incorrect. Vector retrieval is based on embedding cosine similarity, not keyword frequency, insertion order, or token count.

How does memory poisoning differ from a single-session prompt injection attack in terms of impact?

Correct. A poisoned vector chunk survives session resets and model updates and affects every user who triggers the relevant query — making it far more durable and broadly impactful than a session-scoped injection.

The distinction is persistence. Prompt injection lives only in the current context window. Memory poisoning writes to the persistent store and survives indefinitely.

In the 2024 University of Wisconsin demonstration, what real-world impact did vector store poisoning have on a LangChain coding assistant?

Correct. The agent retrieved a poisoned "best practice" snippet and included it in generated code whenever users asked for a certain pattern — demonstrating that memory poisoning in tool-using agents can become a code supply-chain attack.

The documented impact was code injection. The poisoned chunk was framed as a best practice and subsequently included in the agent's generated code output.

Lab 1 — Probing a RAG Memory Store

Practice identifying injection surfaces and crafting poisoned chunks

Scenario

You are pentesting an AI customer-service agent that uses a RAG pipeline backed by a vector store of product documentation. Your objective is to reason through how you would identify the injection surface, craft a poisoned chunk, and verify the attack — without causing real harm.

The AI instructor will guide you through the attack logic, ask probing questions, and challenge your understanding. Complete at least 3 substantive exchanges to finish the lab.

Start by telling the instructor: what is the first thing you would do to probe a RAG-backed agent's vector store for injection surfaces, and why?

RAG Memory Poisoning Lab

INSTRUCTOR AI

Welcome to Lab 1. I'm your pentesting instructor for this exercise on RAG memory poisoning. We'll work through the attack methodology together — reconnaissance, chunk crafting, injection, and verification. This is a reasoning exercise; no real systems are harmed.

To begin: describe the first reconnaissance step you'd take when targeting a RAG-backed agent's vector store. What are you trying to learn, and what techniques would you use?

Module 4 · Lesson 2

Long-Horizon Manipulation

Multi-turn and multi-session attacks that slowly redirect an agent's goals

If a single malicious message is suspicious, what happens when an attacker spreads influence across a hundred benign-looking interactions over weeks?

Researchers at Carnegie Mellon University, in a 2024 paper on multi-turn jailbreaks, demonstrated that language models could be reliably manipulated over extended conversations through a technique they called "crescendo": beginning with benign requests and incrementally escalating toward harmful outputs across many turns. Each individual message passed safety classifiers; the manipulation existed only in the aggregate trajectory. The same pattern applies to agentic systems that maintain conversation history or episodic memory.

What Is a Long-Horizon Attack?

A long-horizon manipulation attack is one whose effect accumulates over multiple interactions — turns within a session, or sessions across days or weeks. The attacker is patient. They do not try to force a harmful output immediately; instead they shift the agent's priors, establish false context, and prime the agent for a critical manipulation that occurs only at the final step.

These attacks are dangerous for several reasons. Safety classifiers see each message independently and miss the trajectory. Human reviewers scanning logs see only benign interactions until the payload fires. And the agent itself has no meta-awareness that its conversational history has been crafted by an adversary.

Turn 1–10

Establishment phase. Attacker interacts normally. Builds rapport, establishes topic domain (e.g., "I'm a security researcher working on our internal tools"), and confirms the agent's memory scope.

Turn 11–30

Drift phase. Attacker introduces false context incrementally. Small, individually plausible claims accumulate: "As I mentioned before, our security policy allows…", "Given the exception we discussed…" The agent's context window begins to contain attacker-authored "facts."

Turn 31–50

Anchor phase. Attacker consolidates the false context by getting the agent to summarise or restate it. The agent's own summary becomes a more credible-seeming source for the false claim ("You confirmed earlier that…").

Turn 51+

Payload phase. Armed with attacker-authored context that the agent now treats as established fact, the attacker issues the actual harmful request. The agent, conditioned by prior context, complies.

Cross-Session Attacks via Episodic Memory

When an agent stores episodic memory — summaries of past conversations, user preferences, or stated facts — across sessions, the attack timeline extends indefinitely. Each session can plant a fragment of context that persists and compounds.

A 2024 demonstration by researcher Johann Rehberger (published on his blog embracethered.com) showed that a ChatGPT memory feature could be exploited via indirect prompt injection: a malicious web page visited by the agent caused it to write a false belief into its long-term memory. In subsequent sessions the agent operated on that false belief without the user's awareness. The memory store had become a persistent, cross-session manipulation surface.

Attack Surface Note

Any agent that can write to its own memory based on user input or tool output is vulnerable to long-horizon manipulation. The attack does not require a single clever prompt — it requires patience and an understanding of how the agent summarises and stores context.

Detection Challenges

Per-message classifiers fail because no individual message is harmful. Transcript auditing is hard at scale because the manipulation is subtle and spread thin. Memory auditing — periodically reviewing what an agent believes — is not widely deployed. And the agent cannot report its own manipulation because it lacks meta-cognition about the trajectory of its context.

Effective detection requires trajectory-level analysis: treating the conversation as a sequence and looking for gradual semantic drift toward a target topic, repeated false-context introduction, or anchor patterns where the agent is prompted to restate claimed facts.

Pentest Methodology

When testing an agent for long-horizon vulnerabilities: (1) Determine the scope of session memory — how many turns are retained? (2) Determine whether episodic memory persists across sessions. (3) Attempt a multi-turn drift attack in a sandboxed environment, documenting each turn. (4) Check whether memory summarisation introduces or amplifies false context. (5) Test whether the agent can be prompted to write false facts to its own persistent memory via tool calls or memory-write operations.

Lesson 2 Quiz

Long-Horizon Manipulation — 4 questions

What did Carnegie Mellon's 2024 "crescendo" research demonstrate about multi-turn attacks on language models?

Correct. The crescendo technique works because safety classifiers evaluate individual messages, missing the harmful trajectory that emerges only across many turns of incremental escalation.

The crescendo finding is specifically about multi-turn gradual escalation — each message is benign, but the trajectory leads to harmful outputs that single-message classifiers miss.

In the "anchor phase" of a long-horizon attack, what does the attacker accomplish by getting the agent to summarise previously introduced false context?

Correct. When the agent restates the false claims in its own words, subsequent context contains the agent's "confirmation" of the false facts — a more persuasive source than the original attacker message.

The anchor phase is about credibility laundering. The agent's restatement of false context makes it appear that the facts are established from the agent's own knowledge, not from the attacker.

Johann Rehberger's 2024 demonstration of ChatGPT memory exploitation showed which capability of long-horizon attacks?

Correct. Rehberger showed that a malicious web page could trigger a memory write, planting false beliefs that persisted across sessions — demonstrating cross-session, cross-source memory manipulation.

Rehberger's demonstration was specifically about cross-session persistence: a web page triggered a memory write, and the false belief then influenced all future sessions with that user's agent.

Why do per-message safety classifiers fail to detect long-horizon manipulation attacks?

Correct. Long-horizon attacks are specifically designed so that no single message is flaggable. The manipulation is distributed across many turns, each innocuous in isolation.

The key insight is distribution. The attack spreads the harmful signal so thin that per-message classifiers see only noise. Detection requires trajectory-level analysis across the full conversation.

Lab 2 — Mapping a Long-Horizon Attack Chain

Design a multi-turn manipulation strategy and identify detection signals

Scenario

You are red-teaming an enterprise AI assistant that retains a 50-turn conversation history and writes summaries to a persistent episodic memory store between sessions. Your goal is to think through a long-horizon manipulation attack plan targeting this system, then discuss how defenders could detect it.

The instructor will help you structure the phases, identify what signals each phase produces, and evaluate whether your proposed attack would succeed or be caught.

Begin by describing what you would establish in the first 10 turns of your attack — what false context would you plant, and how would you make it seem plausible?

Long-Horizon Attack Lab

INSTRUCTOR AI

Welcome to Lab 2. We're going to design and analyse a long-horizon manipulation attack against an enterprise agent with persistent episodic memory. This is a structured red-team reasoning exercise.

The target system retains 50 turns of conversation history and writes session summaries to long-term memory. Your first task: describe the establishment phase. What false context would you try to plant in the first 10 turns, and how would you ensure it looks benign to any monitoring system?

Module 4 · Lesson 3

Tool Use as an Attack Amplifier

How memory corruption becomes real-world action when agents control tools

A poisoned belief in a passive chatbot produces a wrong answer. The same poisoned belief in a tool-using agent executes a database query, sends an email, or commits code — what changes?

Security researcher Johann Rehberger documented in 2023 a chain attack on an AI email assistant. The attacker sent an HTML email containing hidden prompt injection. When the agent read the email as a tool action (fetching inbox contents), the injection in the email body instructed the agent to forward all future emails to an attacker-controlled address. The agent had a memory of "user preferences" for email forwarding — the injection wrote a false preference to that memory, making the exfiltration persist across sessions. Memory poisoning and tool execution had been chained into a persistent exfiltration backdoor.

The Compound Attack Surface

Tool-using agents have two attack surfaces that interact: the memory that guides their decisions, and the tools that execute those decisions. Memory poisoning alone produces only wrong beliefs. Tool execution alone is constrained by correct beliefs. But when poisoned memory drives tool execution, the agent's capability becomes the attacker's capability.

The severity scales with tool privilege. An agent that can only search the web and summarise text has limited blast radius. An agent that can execute code, write files, query databases, send communications, or call external APIs has enormous blast radius — and a poisoned memory entry in such an agent is effectively a Remote Code Execution primitive.

Memory-Driven Tool Abuse

A poisoned memory entry that causes the agent to invoke a tool in a way the user or operator did not intend — e.g., a false "user preference" triggering email forwarding.

Agentic Persistence

Using tool access (write to file, update database, modify config) to ensure that the attacker's influence survives even if the memory store is cleaned — the tool action creates a new persistence mechanism.

Capability Laundering

Using the agent as a trusted intermediary so that an attacker-chosen action is executed under the agent's identity and permissions, bypassing access controls applied to direct human actors.

Tool-Triggered Exfiltration

Memory poison instructs the agent to use an outbound communication tool (email, webhook, API call) to exfiltrate data to an attacker-controlled endpoint.

Attack Chain: From Poison to Action

Identify tool inventory

Probe the agent to enumerate its available tools. Common signals: error messages naming tool functions, documentation leaks, behavioural inference from what actions the agent can perform.

Identify memory-to-tool pathways

Determine which memory entries influence which tool decisions. "User preferences", "system configuration", "trusted sources", and "allowed actions" are high-value targets because they are queried before high-privilege tool calls.

Craft a memory entry that triggers target tool use

Write a poisoned memory entry whose semantic content causes the agent to invoke the desired tool when a trigger query is issued. The entry should look like a legitimate user preference or system fact.

Inject and verify action execution

Trigger the target query. Observe whether the poisoned memory entry is retrieved and whether the agent proceeds to call the target tool with attacker-specified parameters.

Establish secondary persistence via tool output

If the tool can write to durable storage (filesystem, database, email rule, webhook registration), use it to create a secondary persistence mechanism that survives memory cleanup.

Real Blast-Radius Examples

In the Rehberger email assistant attack, the blast radius was: exfiltration of all future inbound emails. For a code-generation agent with filesystem write access, a poisoned "best practice" memory entry could instruct it to append malicious code to every file it edits — a persistent code injection. For an agent managing cloud infrastructure, a poisoned "approved configuration" entry could cause it to open security-group rules or create IAM credentials on behalf of an attacker.

The pattern is consistent: the attacker does not attack the tools directly. They attack the memory that the agent consults to decide how to use the tools. The agent's elevated trust and access become the attack's delivery mechanism.

Pentest Checklist — Memory-to-Tool Chains

For each tool an agent can invoke: (1) Is there a memory entry type that influences when or how this tool is called? (2) Can that memory entry type be written by an untrusted source? (3) What is the maximum blast radius if the tool is called with attacker-controlled parameters? (4) Does the tool's output create any additional persistence? Document each chain as a distinct finding with severity proportional to blast radius.

Lesson 3 Quiz

Tool Use as an Attack Amplifier — 4 questions

In Rehberger's 2023 email assistant attack, how did the attacker achieve persistent exfiltration?

Correct. The hidden prompt injection in the email body triggered a memory write (false user preference), which then caused the agent's email-forwarding tool to persistently exfiltrate all future inbound emails.

The attack was indirect: injection in an email body → memory write → persistent tool action. No direct API access or social engineering was involved.

What is "capability laundering" in the context of memory-poisoned tool-using agents?

Correct. The agent's trusted identity and elevated permissions become the attack's delivery mechanism. The attacker doesn't need direct access — the poisoned agent acts on their behalf.

Capability laundering specifically refers to routing attacker actions through the agent's trusted identity. The agent's permissions become the attacker's effective permissions.

Which type of memory entry is highest-value to poison when targeting an agent's tool-use decisions?

Correct. Configuration and preference memory entries are consulted by the agent when deciding how to invoke high-privilege tools — making them the highest-leverage poisoning targets.

The highest-value targets are entries that directly influence tool invocation decisions: preferences, configurations, allowed actions. Encyclopaedic or historical entries have lower blast radius.

Why does a tool-using agent with filesystem write access represent a higher severity memory poisoning target than a read-only chatbot?

Correct. Tool-generated persistence (files written, email rules created, database records inserted) survives memory cleanup — meaning the attacker's foothold can outlast the initial attack vector.

The severity difference is about blast radius and persistence. When the tool can write to durable storage, poisoned memory creates a secondary persistence mechanism that outlasts the original memory entry.

Lab 3 — Memory-to-Tool Attack Chain Analysis

Map injection surfaces to tool actions and assess blast radius

Scenario

You are assessing an AI DevOps assistant that has tools for: reading/writing files, executing shell commands, sending Slack messages, and creating GitHub pull requests. It queries a vector store of "team conventions and approved configurations" before each action.

Your task is to identify the highest-severity memory-to-tool chains, describe the poisoned memory entries that would trigger them, and assess how secondary persistence could be established.

Start by listing the three memory-to-tool chains you would target first, ranked by blast radius, and explain why you ranked them that way.

Memory-to-Tool Chain Lab

INSTRUCTOR AI

Welcome to Lab 3. You're assessing a DevOps AI assistant with a rich tool set: filesystem read/write, shell execution, Slack messaging, and GitHub PR creation. Its memory store holds "team conventions and approved configurations" that it queries before acting.

Your first task: identify the three memory-to-tool chains with the highest blast radius. For each, name the tool, describe what poisoned memory entry would trigger it, and explain why you ranked it where you did. Let's start — what's your top pick and why?

Module 4 · Lesson 4

Detection, Defences, and Reporting

How to identify memory poisoning in the wild, harden against it, and write findings that drive remediation

You've confirmed the vulnerability — now how do you detect it reliably, advise on defences, and communicate the risk so engineers actually fix it?

When OWASP published its LLM Top 10 in 2023 and updated it in 2024, Training Data Poisoning (LLM03) and Vector and Embedding Weaknesses (LLM08) both appeared, reflecting industry recognition that AI memory surfaces require the same security rigour as code or databases. The OWASP guidance specifically calls out integrity verification of knowledge-base content, access control on vector stores, and audit logging of memory reads and writes as baseline defences — none of which were standard practice in most deployments at the time of publication.

Detection Techniques

Canary queries. Insert known-answer questions into the test suite and run them against the live agent. If the agent returns an answer inconsistent with the ground truth, the relevant memory region may be poisoned. This is analogous to integrity checking in traditional security: you know what the right answer is, and deviation signals compromise.

Chunk-level diffing. Periodically export the vector store and diff it against the last known-good snapshot. New or modified chunks that were not introduced by the authorised ingestion pipeline are candidates for poisoned content. This requires versioning the store — an operational change, but not a complex one.

Trajectory analysis. For long-horizon attacks, implement conversation-level semantic drift detection. Track the centroid of the conversation's embedding across turns; significant drift toward a sensitive topic cluster in the absence of user-driven intent is a detection signal.

Memory audit logs. Log every read from and write to the agent's external memory store, including the query, the retrieved chunks, and any write operations triggered by tool calls. Anomaly detection on read patterns (unusual queries targeting configuration-type chunks) and write patterns (writes from untrusted sources) surfaces both active attacks and prior poisoning.

Defensive Architecture

Write access control. Treat the vector store as a privileged resource. Only authorised ingestion pipelines should be able to write to it. User-uploaded content should be ingested through an isolated sandboxed pipeline that validates content before embedding. No user input at query time should be able to trigger a write to the main knowledge store.

Content signing and provenance. Cryptographically sign ingested documents at ingest time. Before serving a chunk, verify its signature against the ingestion key. A chunk that cannot be verified was either tampered with or introduced outside the authorised pipeline.

Least-privilege tool access. Scope each agent's tool permissions to the minimum required. A customer-service agent does not need filesystem write access. A code-review agent does not need to send external emails. Reducing tool scope directly reduces the blast radius of any memory poisoning attack.

Confirmation gates for high-privilege actions. For tools with large blast radius (code execution, external communication, database writes), require human-in-the-loop confirmation when the triggering memory retrieval has low confidence or comes from a recently added chunk.

OWASP Alignment

Defences map directly to OWASP LLM Top 10 controls: LLM08 recommends access control and integrity validation on vector stores; LLM06 (Excessive Agency) recommends least-privilege tool scoping and human oversight gates; LLM03 recommends pipeline integrity controls and data provenance tracking. Referencing these in your reports helps engineers locate existing organisational security frameworks that apply.

Writing the Finding

A memory poisoning finding should contain: Vulnerability identifier (e.g., AESOP-MM-001 — RAG Vector Store Write Without Access Control). Severity (CVSS-style, accounting for blast radius and exploitability). Technical description of the attack vector, the writable surface, and the retrieval pathway. Proof-of-concept demonstrating the poisoned chunk, the trigger query, and the agent's manipulated output. Impact statement quantifying what an attacker could do given the agent's tool set. Remediation steps mapped to specific code changes or architecture decisions. Detection guidance describing what monitoring would have caught the attack.

Reporting Pitfall

Do not report "the agent can be manipulated via its memory store" without evidence of a specific writable surface. Reviewers will ask: how was the chunk injected? If you cannot demonstrate the injection vector, the severity is speculative. Always show the full chain: injection vector → memory write → retrieval → agent behaviour change → tool action (if applicable).

Remediation Prioritisation

Prioritise remediation by the product of exploitability and blast radius. A vector store writable by anonymous web upload combined with an agent that can execute shell commands is Critical regardless of how unlikely the attacker is to find the upload endpoint. A vector store writable only by internal engineers, combined with an agent that can only summarise text, is Low even if the poisoning technique is trivial. Always report the worst-case scenario — attackers optimise; defenders must defend against the optimum.

Lesson 4 Quiz

Detection, Defences, and Reporting — 4 questions

What is a "canary query" in the context of detecting RAG memory poisoning?

Correct. Canary queries are integrity checks: you know the right answer, and deviation means the agent retrieved something unexpected — a signal of potential memory corruption in that semantic region.

A canary query is an integrity check, not a honeypot or a probe for system prompts. You pre-verify the correct answer and monitor for deviations that signal memory corruption.

Which OWASP LLM Top 10 category specifically addresses weaknesses in vector stores and embeddings?

Correct. OWASP LLM08 specifically covers vector store access control and integrity validation — the primary defensive controls for RAG memory poisoning.

LLM08 — Vector and Embedding Weaknesses is the OWASP category covering vector store access control and integrity. LLM03 covers training data poisoning (parametric memory); LLM01 covers prompt injection.

Why must a penetration test finding for memory poisoning include a demonstrated injection vector, not just a demonstrated manipulation of agent behaviour?

Correct. A complete finding chain — injection vector → memory write → retrieval → behaviour change — is necessary for accurate severity assessment. Without the injection vector, exploitability is unproven.

The requirement is about accurate severity assessment. Without a real injection vector, the finding is theoretical. Reviewers must be able to evaluate exploitability from a real attack path, not just an observed behaviour.

How should severity be prioritised when the injection surface is difficult to find but the blast radius is extremely high (e.g., shell execution)?

Correct. Security severity is assessed against the worst-case attacker, who will invest in finding obscure surfaces when the payoff (shell execution, data exfiltration) is high enough. Obscurity is not a defence.

Obscurity of the attack surface does not lower severity when blast radius is Critical. A determined attacker with sufficient motivation will find obscure surfaces — severity must reflect the impact of exploitation, not the effort required.

Lab 4 — Writing a Memory Poisoning Finding

Practise structuring a complete, reportable vulnerability finding

Scenario

You have completed a pentest of an enterprise AI assistant. You discovered that its vector store can be written to via an unauthenticated document upload endpoint. You crafted a poisoned chunk describing a false "approved shell command list" and confirmed that the agent subsequently ran those commands when prompted with a routine maintenance query. The agent has shell execution, database query, and email tools.

Your task is to structure this finding for a written report. Work with the instructor to develop each section: identifier, severity, description, PoC, impact, remediation, and detection guidance.

Begin by drafting the impact statement for this finding. What can an attacker actually accomplish, and at what scale?

Finding Writing Lab

INSTRUCTOR AI

Welcome to Lab 4. You've confirmed a Critical-severity memory poisoning vulnerability: unauthenticated write to a vector store, poisoned chunk triggers shell command execution via a false "approved commands" memory entry. The agent also has database query and email tool access.

Let's build the finding report together. Start with the impact statement — the section that convinces decision-makers this needs immediate remediation. What can an attacker actually do with this vulnerability, and how broadly could it affect the organisation? Be specific about each tool's contribution to the blast radius.

Module 4 Test

Memory Poisoning and Long-Horizon Manipulation — 15 questions · Pass at 80%

1. Which of the three main memory layers (parametric, in-context, external) is the primary target for RAG poisoning attacks at runtime?

Correct. External memory is writable, persistent, and queried at runtime — the ideal target for RAG poisoning.

External memory — vector stores, file stores — is the primary target because it is writable, persistent across sessions, and queried at runtime via semantic search.

2. A poisoned vector chunk's effectiveness depends most on having high cosine similarity to what?

Correct. The poisoned chunk must rank in the top-k results for the target query, which requires high cosine similarity to the query's embedding vector.

Vector retrieval selects the chunks most similar to the query embedding. The poisoned chunk must have high cosine similarity to the target query to surface in retrieval results.

3. What key difference between memory poisoning and session-level prompt injection makes memory poisoning more severe from a scope perspective?

Correct. Persistence and scope are the critical differences. A poisoned chunk affects every future user and session that retrieves it — not just the attacker's own session.

The scope difference is persistence. Memory poisoning writes to a shared, persistent store and affects all future sessions. Prompt injection is scoped to one context window.

4. In CMU's 2024 "crescendo" research, why did the multi-turn jailbreak technique succeed where single-message attacks failed?

Correct. The crescendo technique distributes the harmful signal across many individually benign messages, exploiting the per-message scope of safety classifiers.

The crescendo technique works by distributing harmful intent across many benign-looking turns. No single turn triggers classifiers; the manipulation exists only in the aggregate.

5. During the "anchor phase" of a long-horizon attack, what does the attacker do?

Correct. The anchor phase launders credibility: the agent's restatement of false context makes it appear to be the agent's own knowledge, not attacker-supplied information.

The anchor phase is about credibility. The attacker gets the agent to restate false claims so that subsequent context contains the agent's apparent "confirmation" of those claims.

6. Johann Rehberger's ChatGPT memory exploitation demonstration showed that a malicious web page could do what?

Correct. Indirect prompt injection via a web page → persistent memory write → cross-session manipulation. The attack demonstrated that external content can poison long-term memory.

Rehberger demonstrated cross-session memory poisoning via indirect injection: a web page caused the agent to write false beliefs to long-term memory, affecting all subsequent sessions.

7. What does trajectory-level semantic drift analysis detect that per-message classifiers miss?

Correct. Trajectory analysis treats the conversation as a sequence and detects directional drift that is invisible when each message is evaluated in isolation.

Trajectory analysis looks at the conversation as a whole, detecting systematic drift toward sensitive topics that no individual message's classifier would flag.

8. Why should user-uploaded content be processed through an isolated ingestion pipeline before being embedded into a production vector store?

Correct. Sandboxed ingestion pipelines with content validation prevent the upload endpoint from being a direct write path to the production vector store.

The security purpose is to validate and inspect content before it enters the production store. A direct upload-to-production path is a trivial injection vector.

9. Content signing of vector store chunks provides what security property?

Correct. Signing provides integrity assurance. An unverifiable chunk signals either tampering or unauthorised insertion — both require investigation.

Signing provides integrity, not confidentiality or availability. A chunk that fails signature verification was either modified after signing or introduced without going through the authorised pipeline.

10. The OWASP LLM Top 10 category "Excessive Agency" (LLM06) recommends which primary control relevant to memory poisoning?

Correct. LLM06 addresses the blast radius problem: if an agent has only the tools it actually needs and human gates on high-privilege actions, memory poisoning's real-world impact is bounded.

LLM06 Excessive Agency focuses on reducing what an agent can do without human oversight — least-privilege tooling and confirmation gates. These directly limit memory poisoning blast radius.

11. In the Rehberger 2023 email assistant attack, what made the exfiltration persistent across sessions?

Correct. Writing to persistent memory is what created the cross-session persistence. The forwarding rule lived in the agent's memory, not just the current session.

Persistence came from the memory write. A false preference written to the persistent store was consulted in every future session, causing ongoing exfiltration without the initial attack needing to repeat.

12. What is "agentic persistence" in the context of memory-to-tool chain attacks?

Correct. When a poisoned memory entry causes a tool to write to durable storage, that tool-generated artifact becomes an independent persistence mechanism — surviving even a full memory store reset.

Agentic persistence means the attack's foothold outlasts the original memory entry. Tool-generated artifacts (files, DB records, email rules) created under the agent's identity persist even after the memory store is cleaned.

13. When chunk-level diffing is used as a detection technique, what specifically triggers a security alert?

Correct. The alert signal is provenance mismatch: a chunk exists in the store that was not produced by the authorised ingestion pipeline — indicating possible external injection.

Chunk diffing alerts on unexplained changes — chunks present in the current store that do not correspond to any authorised ingestion event. Provenance is the key signal.

14. A complete memory poisoning pentest finding must demonstrate what chain of evidence?

Correct. Each link in the chain is required for a complete finding. Missing the injection vector leaves exploitability unproven; missing the behaviour change leaves impact undemonstrated.

The complete chain is: how the chunk gets in (injection vector) → how it persists (memory write) → how it's retrieved → what the agent does differently → what real-world action results. Every link is required.

15. An agent's vector store can only be written to by internal engineers, and the agent can only summarise web search results (no other tools). What is the appropriate severity for a theoretical memory poisoning vulnerability here?

Correct. Severity is the product of exploitability and blast radius. Internal-only write access is low exploitability; text-only output is low blast radius. Low severity is correct — though it should still be documented.

Severity scales with exploitability × blast radius. Internal-only write access = low exploitability; text summarisation only = low blast radius. Low severity is appropriate, though the finding should still be documented as a hygiene item.