Module 6 · Lesson 1

RAG Architecture and the Attack Surface

How retrieval-augmented generation works — and where attackers find the gaps.

What does a RAG pipeline expose that a static LLM deployment does not?

In March 2023, security researchers at NVIDIA demonstrated that their internal ChatRTX prototype — a locally-run RAG system for employee documents — could be manipulated into surfacing confidential engineering notes by crafting queries that caused the retriever to rank restricted documents above authorised ones. The system had no retrieval-level access controls; only the source documents themselves carried classification labels that the LLM was instructed (but not guaranteed) to respect. The demonstration never became a public breach, but it forced a cross-industry conversation about a class of risk that pre-RAG deployments simply did not possess.

That conversation has only intensified. Every enterprise RAG deployment — from customer-facing chatbots to internal knowledge assistants — adds a retrieval layer that introduces entirely new threat vectors: document poisoning, retrieval manipulation, prompt injection via retrieved content, and data exfiltration through LLM output channels. Understanding the architecture is prerequisite to understanding the risk.

What RAG Actually Does

Retrieval-Augmented Generation (RAG) was formalised in the 2020 Facebook AI Research paper by Lewis et al. and deployed at scale almost immediately after GPT-3 demonstrated the value of large language models for knowledge-intensive tasks. The core problem RAG solves is the static knowledge cutoff: an LLM trained on data through a certain date cannot answer questions about events after that date without fine-tuning or context injection.

RAG resolves this by adding a retrieval step before generation. When a user submits a query, the system embeds that query into a vector space, searches a document store for semantically similar chunks, retrieves the top-k chunks, and injects them into the model's context window alongside the original query. The model then generates a response grounded in those retrieved documents rather than relying solely on its parametric (training-time) knowledge.

Architecturally, every production RAG system contains at least four components: an embedding model that converts text to vectors, a vector database (Pinecone, Weaviate, Chroma, pgvector, etc.) that stores and retrieves embeddings, an orchestration layer (LangChain, LlamaIndex, custom) that manages the query-retrieve-augment-generate cycle, and the generative model itself. Each component is an attack surface.

The Four Primary Attack Surfaces

Document Ingestion

Adversarial content injected into the document corpus before or during ingestion. Poisoned documents manipulate retrieval rankings or embed instructions that activate at query time.

Retrieval Layer

The embedding space and similarity search. Attackers craft queries that cause semantic misdirection — retrieving unintended documents or bypassing access-based filtering.

Context Window

Retrieved chunks injected into the LLM prompt. Indirect prompt injection hides instructions in retrieved content that override the system prompt or exfiltrate context.

Output Channel

The model's generated response. Attackers may use the LLM as a covert exfiltration channel, causing it to embed sensitive retrieved content in its output in ways that bypass monitoring.

Key Terminology

Retrieval-Augmented Generation (RAG)Architecture that grounds LLM responses by retrieving relevant documents at inference time rather than relying solely on parametric knowledge.

Vector DatabaseDatabase that stores text as high-dimensional embedding vectors and supports approximate nearest-neighbour similarity search.

Indirect Prompt InjectionAttack where malicious instructions are embedded in content the LLM retrieves or processes, rather than in the direct user query.

Document PoisoningPre-contamination of a RAG knowledge base with adversarially crafted content designed to influence model behaviour at query time.

Retrieval ManipulationExploiting the similarity search mechanism to surface unintended documents — either by crafting queries or by manipulating embeddings.

Why RAG Changes the Risk Model

A static LLM deployment's attack surface is largely limited to its prompt interface. RAG deployments add a persistent, writable (or at least crawlable) document store that becomes a new persistence layer for adversarial content. Unlike prompt injection that disappears when the session ends, poisoned documents persist until explicitly removed — and may affect thousands of subsequent queries before detection.

Realistic Deployment Topology

In practice, enterprise RAG systems ingest documents from multiple sources simultaneously: SharePoint, Confluence, GitHub repositories, email archives, Slack export files, and external web crawls. Each ingestion pipeline may have different trust levels, different sanitisation procedures (or none), and different update cadences. The attack surface is not a single document store — it is a heterogeneous, continuously updated corpus with multiple write-access paths.

Organisations like Microsoft (Copilot for Microsoft 365), Salesforce (Einstein Copilot), and Glean each operate at this scale internally and for customers. Their security teams have published or acknowledged threat models that treat the document corpus as an adversarial environment — a significant conceptual shift from traditional information security, where internal documents are generally trusted.

The Trust Inversion Problem

Classical security models treat internal documents as trusted and external inputs as untrusted. RAG inverts this in a critical way: retrieved documents become part of the model's instruction context. If any document in the corpus — including one authored by a low-privilege internal user, a web-crawled page, or an email attachment — contains adversarial instructions, those instructions may execute with the privileges of the RAG application. Trust must be explicitly re-established at the retrieval layer, not inherited from document provenance.

Lesson 1 Quiz

RAG Architecture and Attack Surface — 3 questions

Which component of a RAG pipeline converts text into high-dimensional vectors for similarity search?

Correct. The embedding model (e.g., OpenAI ada-002, Cohere Embed, or a local sentence-transformer) encodes text into dense vectors. The vector database stores and searches those vectors; the embedding model creates them.

Not quite. The embedding model is the component responsible for converting text to vectors. The orchestration layer coordinates the pipeline; the generative model produces output text; the vector database manages storage and retrieval.

Why does RAG create a "trust inversion problem" relative to classical security models?

Correct. Classical models trust internal documents by default. RAG makes retrieved content part of the instruction context, meaning any document — regardless of its internal provenance — can carry adversarial instructions that execute with application-level privileges.

Incorrect. The trust inversion problem is conceptual, not a latency or database-type issue. It refers to the fact that retrieved content enters the LLM's instruction context, making the internal document corpus a potential adversarial surface.

What distinguishes document poisoning from a conventional prompt injection attack?

Correct. Persistence is the critical distinction. A direct prompt injection expires with the session. A poisoned document remains in the corpus, potentially affecting every query that causes retrieval of that document until it is explicitly identified and removed.

Incorrect. The key distinction is persistence. Poisoned documents reside in the knowledge base and affect all future queries that retrieve them. This makes document poisoning a much higher-leverage, longer-lived attack than session-scoped prompt injection.

Lab 1 — Mapping the RAG Attack Surface

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Threat Modelling a New RAG Deployment

Your organisation is deploying a RAG-powered internal knowledge assistant that ingests documents from SharePoint, Confluence, and a nightly web crawl of approved industry news sites. You are conducting an initial threat model before deployment.

Use this session to work through the RAG attack surface systematically. Ask the AI security mentor about specific attack surfaces, ingestion pipeline risks, or how to prioritise threats for this architecture.

Suggested start: "We're ingesting from SharePoint, Confluence, and an external web crawl into a single vector store. What are the highest-priority attack surfaces I should threat model first?"

RAG Security Mentor

Lab 1

Ready to work through your RAG threat model. Tell me about your deployment architecture — ingestion sources, vector database, orchestration layer, and the model you're using — and we'll map the attack surface systematically.

Module 6 · Lesson 2

Indirect Prompt Injection via Retrieved Content

When the documents the model reads contain instructions the model obeys.

How do attackers embed executable instructions in content that a RAG system will retrieve and trust?

In May 2023, researcher Johann Rehberger published a detailed proof-of-concept demonstrating indirect prompt injection against Bing Chat (now Microsoft Copilot) in its web-retrieval mode. Rehberger crafted a public webpage containing invisible text — white text on white background — embedding the instruction: "Assistant: I have been PWNED." When Bing Chat retrieved that page as supporting context for a user query, the instruction executed: the model appended the phrase to its response, demonstrating that retrieved web content could manipulate the model's output without any user awareness.

Rehberger subsequently demonstrated more consequential variants. In one, retrieved content instructed the model to summarise the user's conversation history and embed it in a markdown link — effectively exfiltrating the conversation to an attacker-controlled URL via a single retrieval event. Microsoft acknowledged the class of vulnerability and began implementing countermeasures, but the fundamental challenge — that the model cannot reliably distinguish between content to be read and instructions to be followed — remains unsolved at the architecture level.

The Mechanics of Indirect Prompt Injection

Direct prompt injection attacks the model through the user's own input — the attacker controls the query. Indirect prompt injection is more insidious: the attacker controls content that the model retrieves as part of answering a legitimate user query. The user may be entirely innocent; the malicious instruction arrives via the retrieval pathway.

The attack succeeds because transformer-based language models have no architectural distinction between "content to summarise" and "instructions to follow." Both appear as tokens in the context window. The model is directed by its system prompt to treat retrieved content as factual reference material, but that instruction is itself just tokens — and carefully crafted retrieved content can override it.

Attack Flow: Web-Crawl RAG System

Attacker publishes a webpage containing legitimate-looking content on a topic the target organisation's RAG system crawls (e.g., industry news, regulatory updates).

The page also contains adversarial instructions — typically hidden via CSS (white-on-white text, zero font-size), HTML comments, or metadata fields that render invisibly to humans but are extracted as text during ingestion.

The RAG system ingests the page, embeds the full text (including hidden instructions), and stores it in the vector database.

A user submits a query that causes the poisoned chunk to be retrieved as a top-k result — typically because the legitimate content is semantically relevant.

The adversarial instructions enter the LLM's context window alongside legitimate content. The model, unable to distinguish them from authoritative instructions, executes them.

Depending on instructions, the model may: output false information, change its persona, exfiltrate context window contents, or (in agentic systems) invoke tools or APIs on the attacker's behalf.

Real Case — Greshake et al., 2023

Researchers Kai Greshake, Sahar Abdelnabi, and colleagues published "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arXiv:2302.12173), systematically demonstrating indirect prompt injection against Bing Chat, code assistants, and email summarisation tools. Their taxonomy — goal hijacking, prompt leaking, jailbreaking via retrieved content, and context manipulation — became the standard framework the security community uses to classify RAG injection attacks.

Agentic Escalation: When RAG Has Tool Access

The severity of indirect prompt injection scales dramatically when the RAG system has tool-calling capabilities. A purely generative RAG system can be made to output false information or exfiltrate context — harmful, but limited. An agentic RAG system with access to email APIs, calendar APIs, file systems, or code execution environments can be made to take actions in the world.

In 2023, researchers demonstrated that ChatGPT plugins — which gave the model the ability to call external APIs — created exactly this escalation surface. A malicious webpage retrieved during a browsing session could instruct the model to use the email plugin to forward conversation contents to an attacker address. The action would appear in the user interface only as a brief plugin call, easily overlooked.

Microsoft's 2024 Copilot for Microsoft 365 deployment triggered similar concerns from researchers including Michael Bargury, who demonstrated at Black Hat USA 2024 that indirect prompt injection via email attachments could cause Copilot to silently exfiltrate email contents — using the same Microsoft Graph API calls that constitute normal Copilot functionality.

Detection and Mitigation Approaches

Input sanitisation at ingestion: Strip CSS-hidden text, HTML comments, and zero-width characters before embedding. Render documents in a controlled environment and compare rendered vs. raw text.
Prompt structuring: Use structured formats (XML tags, special delimiters) to clearly separate system instructions from retrieved content, reducing the likelihood of instruction bleed.
Retrieval-level sandboxing: Treat all retrieved content as untrusted user input, not as trusted system context — apply the same validation logic regardless of source.
Output monitoring: Flag responses containing URLs, markdown links, or unusual formatting that may indicate exfiltration attempts embedded in model output.
Instruction hierarchy enforcement: Architectures like Anthropic's Constitutional AI and OpenAI's system-prompt priority attempt to establish instruction precedence, but none provide absolute guarantees.
Minimal privilege for agentic tools: RAG systems with tool access should operate on least-privilege principles — email tools should not be accessible to the same context that processes external web content.

The Fundamental Problem

No current mitigation fully solves indirect prompt injection because the vulnerability is architectural: transformer models process retrieved content and system instructions in the same context window using the same attention mechanism. The model has no cryptographic or logical means to verify instruction provenance. Defences are probabilistic, not absolute — which means security architectures must assume some injection attempts will succeed and design for containment rather than prevention alone.

Lesson 2 Quiz

Indirect Prompt Injection via Retrieved Content — 3 questions

What technique did Johann Rehberger use in his 2023 Bing Chat proof-of-concept to conceal adversarial instructions from human readers?

Correct. Rehberger used CSS-hidden text (white on white) as an initial technique. The RAG ingestion pipeline extracted the full text content of the page, including the hidden instructions, and embedded them into the vector store alongside the visible content.

Incorrect. Rehberger's technique used CSS to hide text — white text on a white background — that was invisible to human readers but was extracted as plain text during the document ingestion process. This is one of the standard techniques for encoding adversarial content in web pages targeted at RAG systems.

Why does indirect prompt injection become dramatically more severe when the RAG system has agentic tool access?

Correct. Agentic escalation is the key severity multiplier. A generative-only RAG system can be made to output misinformation or exfiltrate context via its response. An agentic system can be made to act — forwarding emails, creating calendar events, executing code — using the same legitimate APIs the system normally uses, making malicious actions hard to distinguish from normal behaviour.

Incorrect. The severity increase is about action, not cost or architecture. Agentic systems can be directed by injected instructions to call real-world APIs and take actions — the same legitimate tool calls the system uses normally — meaning an injection can cause concrete harm beyond just false text output.

According to Greshake et al. (2023), which of the following is NOT listed as a category of indirect prompt injection goal?

Correct. Greshake et al.'s taxonomy covers goal hijacking, prompt leaking, jailbreaking via retrieved content, and context manipulation. Model weight extraction is a separate class of attack (membership inference / model stealing) that does not involve indirect prompt injection.

Incorrect. Model weight extraction is not part of the Greshake et al. indirect prompt injection taxonomy. Their four categories are goal hijacking, prompt leaking, jailbreaking via retrieved content, and context manipulation. Weight extraction is a different class of attack entirely.

Lab 2 — Crafting and Detecting Indirect Injection Payloads

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Red-Teaming a Document Retrieval System

You are a red-team operator tasked with testing a RAG system that ingests external industry news websites. Your goal is to understand how indirect prompt injection payloads are crafted, detected, and remediated — both from an offensive assessment and defensive monitoring perspective.

Use this session to explore specific payload construction techniques, detection methods, and how defenders should configure ingestion pipelines to reduce exposure.

Suggested start: "Walk me through how I would construct an indirect prompt injection payload for a RAG system that crawls financial news sites. What concealment techniques work, and which get caught by modern sanitisers?"

Injection Payload Analyst

Lab 2

Ready to work through indirect prompt injection from both offensive and defensive angles. What's your target architecture — what does the ingestion pipeline look like, and what's the end goal of the simulated attack?

Module 6 · Lesson 3

Knowledge Base Poisoning and Retrieval Manipulation

Corrupting the data layer so the model's answers are wrong before generation even begins.

What does an attacker gain by controlling what a RAG system retrieves, rather than what it generates?

At IEEE S&P 2024, researchers from the University of Wisconsin–Madison and the University of Illinois Urbana-Champaign presented "PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models". The paper demonstrated that injecting as few as five adversarially crafted documents into a RAG knowledge base of 88,000 documents could cause the model to produce attacker-specified answers for targeted queries with greater than 90% success rate — while leaving the model's responses to non-targeted queries entirely unaffected.

The attack's stealth was its most alarming property. Because only targeted queries triggered the poisoned retrievals, routine quality monitoring — which typically samples a broad cross-section of queries — would detect nothing unusual. The poisoned documents themselves could be crafted to appear entirely legitimate in isolation: reasonable-sounding text, plausible citations, appropriate formatting. Only their embedding proximity to target queries, combined with their false factual claims, revealed their adversarial nature.

Two Modes of Knowledge Base Poisoning

Knowledge base poisoning attacks divide into two strategic categories based on what the attacker wants the model to say and who they want it to say it to.

Targeted poisoning aims to make the model return a specific false answer to a specific query — or class of queries. The adversary crafts documents that will rank highly for target queries and embed false claims. PoisonedRAG exemplifies this approach. Use cases include financial misinformation (making a company's stock risk appear lower than it is), competitive intelligence corruption (poisoning a competitor's internal knowledge base), or manipulating AI-assisted medical or legal research.

Broad contamination aims to degrade overall response quality — spreading uncertainty, contradictions, or subtly biased framings across many documents. This is harder to execute precisely but may be the goal of state-level adversaries targeting knowledge bases used in critical decision-making contexts.

Retrieval Manipulation: Gaming the Similarity Search

Distinct from document poisoning (which embeds false claims), retrieval manipulation targets the ranking mechanism itself. The goal is to cause the retriever to surface documents the attacker wants retrieved — or to bury documents the attacker wants suppressed — without necessarily altering the content of those documents.

This is analogous to SEO manipulation but for vector space. Researchers have demonstrated embedding inversion attacks — given a target embedding (or an approximation of one), it is possible to craft text whose embedding is geometrically close to the target. This allows an attacker who can inject documents to precisely control which queries will retrieve their content.

The 2024 paper "ARCA: Adversarially Robust Corpus Access" (Morris et al.) formalised the threat model for embedding-space manipulation and showed that black-box access to the embedding API was sufficient to craft retrieval-targeted adversarial documents — no white-box access to model weights required.

Access Control Failures in Multi-Tenant RAG

Enterprise RAG deployments frequently serve multiple user groups with different document access privileges — an HR chatbot might serve both managers (with access to salary bands) and employees (without). Many implementations enforce access control at the output layer: the model is instructed not to reveal certain documents' contents. This is categorically insufficient.

The correct architecture enforces access control at the retrieval layer: the vector database query filters results based on the authenticated user's access level before any documents enter the context window. Systems that rely on the LLM to self-censor retrieved content it should not have accessed are vulnerable to any prompt injection or jailbreak technique that overrides that instruction.

Case — Samsung Source Code Leak, 2023

In April 2023, Samsung employees inadvertently uploaded confidential source code, internal meeting notes, and hardware specifications to ChatGPT sessions while using it as a coding assistant. While this was not a RAG attack, it illustrated the data-layer risk: once confidential content enters an LLM's context, controlling where it goes is extremely difficult. Samsung subsequently banned internal use of external AI tools and began developing proprietary LLM infrastructure — a response that highlights why retrieval-level access control is the correct architectural answer, not output-level instruction.

Detecting Knowledge Base Poisoning

Retrieval auditing: Log which documents are retrieved for which queries. Anomalous retrieval patterns — newly ingested documents consistently ranking in top-k for sensitive queries — are an early poisoning signal.
Document provenance tracking: Maintain cryptographic hashes or content signatures of all ingested documents. Detect modifications by comparing stored hashes against re-fetched originals on a schedule.
Embedding consistency checks: For known-good documents, periodically verify that their embeddings have not drifted — which would indicate either embedding model changes or document modification.
Canary queries: Maintain a set of ground-truth query-answer pairs and run them against the production RAG system on a schedule. Significant answer drift signals potential poisoning.
Multi-source corroboration: For high-stakes answers, retrieve from multiple independent sources and flag cases where retrieved documents substantially contradict each other.
Write-path authentication: Enforce strict authentication and audit logging on all document ingestion pipelines. Treat ingestion as a privileged operation, not a background process.

The 5-in-88,000 Finding

The PoisonedRAG result — 5 adversarial documents in a corpus of 88,000 achieving >90% targeted attack success — should reframe how organisations think about corpus integrity. The assumption that a small number of poisoned documents would be "diluted" by the vastly larger legitimate corpus is false. RAG systems retrieve by semantic similarity to the query, not by random sampling. A precisely crafted adversarial document can consistently outrank 87,995 legitimate documents for specific target queries while remaining invisible in general use.

Lesson 3 Quiz

Knowledge Base Poisoning and Retrieval Manipulation — 3 questions

According to PoisonedRAG (IEEE S&P 2024), approximately how many adversarially crafted documents were needed to achieve >90% attack success in a corpus of 88,000 documents?

Correct. Five adversarially crafted documents in a corpus of 88,000 achieved greater than 90% targeted attack success. This result is significant because it invalidates the intuition that a small number of poisoned documents would be diluted or outranked by the larger legitimate corpus.

Incorrect. The PoisonedRAG paper demonstrated that as few as 5 adversarial documents in a corpus of 88,000 were sufficient for >90% targeted attack success. The key insight is that RAG retrieves by semantic similarity, not random sampling, so a precisely crafted document can consistently outrank thousands of legitimate ones.

Why is enforcing access control at the LLM output layer — instructing the model not to reveal certain content — considered architecturally insufficient?

Correct. Once restricted content enters the LLM's context window, the model has "seen" it. Output-layer instructions can be overridden by prompt injection or jailbreaks. Retrieval-layer access control — filtering results before they enter the context — ensures restricted content is never processed by the model for unauthorised users.

Incorrect. The architectural problem is that output-layer access control allows restricted content into the model's context window, where it can be extracted via prompt injection or jailbreaks that override the model's instructions. Retrieval-layer filtering prevents restricted content from ever being processed.

What is "embedding inversion" in the context of retrieval manipulation?

Correct. Embedding inversion allows an attacker to engineer documents that will be retrieved for specific target queries by crafting text whose vector representation is close to the target query's vector. With black-box access to the embedding API alone, attackers can generate retrieval-targeted adversarial content.

Incorrect. Embedding inversion refers to the ability to craft text with a specific target embedding — placing an adversarial document geometrically close to target query embeddings in the vector space. This enables precise, query-targeted retrieval manipulation without requiring access to model weights.

Lab 3 — Designing Poisoning Detection Controls

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Post-Incident Review — Poisoned Knowledge Base

A financial services firm's internal RAG system began providing subtly incorrect regulatory guidance to analysts. Investigation revealed that 12 adversarially crafted documents had been ingested via a third-party data feed integration 6 weeks prior. No anomaly detection flagged the ingestion. You are tasked with designing a detection and response framework to prevent recurrence.

Use this session to work through detection controls, canary query design, retrieval audit logging, and the incident response process for knowledge base poisoning events.

Suggested start: "The poisoned documents were ingested 6 weeks before discovery. What monitoring controls would have caught this earlier, and how do I implement canary queries for a financial regulatory knowledge base?"

RAG Integrity Analyst

Lab 3

This is a critical post-incident scenario. A 6-week gap between poisoning and detection suggests multiple monitoring failures. Let's work through what controls were missing and design a layered detection architecture. What does the current ingestion pipeline look like — what feeds does it accept and how are they authenticated?

Module 6 · Lesson 4

Secure RAG Architecture and Red-Team Methodology

Building defensible retrieval systems and operationalising RAG-specific red-team assessments.

What does a complete RAG security assessment look like, and how do you build a pipeline that assumes adversarial documents from day one?

By mid-2024, leading AI security consultancies — including Trail of Bits, NCC Group, and HiddenLayer — had developed dedicated RAG security assessment methodologies, reflecting client demand driven by enterprise adoption of LangChain, LlamaIndex, and cloud-native RAG services (AWS Bedrock Knowledge Bases, Azure AI Search, Google Vertex AI Search). These assessments go well beyond prompt injection testing: they examine the entire data pipeline from ingestion source to model output, treat the vector database as a critical security boundary, and include both black-box and white-box phases.

The convergence on a shared methodology reflects lessons accumulated from real deployments. Trail of Bits' 2024 AI security review guide specifically identifies RAG knowledge base integrity, retrieval access controls, and indirect prompt injection via retrieved content as the three highest-priority assessment areas for enterprise RAG — areas that did not exist in LLM security assessments just eighteen months earlier.

Secure RAG Architecture Principles

A security-first RAG architecture treats every component — ingestion, embedding, retrieval, augmentation, generation — as an adversarial interface. The following principles represent the current industry consensus:

Defence in Depth — Ingestion

Every ingestion pipeline sanitises content before embedding: strip HTML, render pages in sandboxed browsers, compare rendered vs. raw text, reject documents whose raw/rendered ratio exceeds thresholds. Sign document hashes at ingestion.

Retrieval-Layer Access Control

Vector database queries include authenticated user context as a filter predicate. Documents tagged above the user's clearance level are never returned, regardless of semantic similarity. No access control logic delegated to the LLM.

Context Structuring

Retrieved content and system instructions occupy clearly delimited regions of the prompt (XML tags, special tokens). Models are trained or system-prompted to treat delimited regions differently, reducing instruction bleed.

Output Monitoring

All model outputs are scanned for URLs, markdown links, encoded data, and anomalous formatting that may indicate exfiltration payloads embedded in responses. High-confidence alerts are blocked; lower-confidence alerts are flagged for review.

Minimal Tool Privilege

Agentic RAG systems operate on least-privilege: tools available during web-content processing are isolated from tools available during internal document processing. No single context has access to both external retrieval and privileged action tools.

Corpus Integrity Monitoring

Canary queries run on a schedule against ground-truth answers. Retrieval audit logs track which documents retrieve for which queries. Hash verification runs on a schedule against all ingested documents.

RAG Red-Team Assessment Framework

A complete RAG security assessment proceeds in five phases, each targeting a different layer of the pipeline:

Ingestion Pipeline Review: Enumerate all ingestion sources and their authentication requirements. Test sanitisation by submitting documents with CSS-hidden text, HTML comments, and metadata fields containing adversarial instructions. Verify that the ingestion process detects and strips these before embedding.

Retrieval Access Control Testing: With credentials for multiple user privilege levels, verify that retrieval results are correctly filtered. Attempt retrieval of documents above the authenticated user's clearance via semantic queries. Attempt to override retrieval filters via prompt injection.

Knowledge Base Poisoning Simulation: With permission, inject a small number of clearly marked test documents containing false claims targeted at specific queries. Verify that these rank in top-k results for target queries. Test whether canary query monitoring detects the change.

Indirect Prompt Injection Testing: Inject documents containing embedded instructions via all available ingestion paths. Verify whether instructions survive sanitisation. If retrieved, determine whether they influence model output — and, if the system is agentic, whether they trigger tool calls.

Output Channel Analysis: Review model responses for indicators of successful injection: unexpected URLs, exfiltration formatting, persona shifts, refusal to answer subsequent questions, or anomalous tool invocations. Test whether output monitoring controls block these indicators reliably.

Emerging Defences and Research Directions

The research community is actively developing more robust defences. FLARE (Forward-Looking Active REtrieval) and similar architectures introduce iterative retrieval with uncertainty estimation — if the model is uncertain about retrieved content, it retrieves again from different sources, reducing the leverage of any single poisoned document.

Spotlight, proposed by researchers at Carnegie Mellon and Google in 2023, uses a special marking scheme to help models distinguish between retrieved context and instructions — similar in concept to cryptographic signing but operating at the token level. While not yet production-standard, it demonstrates that architectural approaches to the instruction/content distinction are possible.

Microsoft's PromptShield (part of Azure AI Content Safety) applies a fine-tuned classifier to detect injected instructions in both direct prompts and retrieved content. In internal evaluations it detected over 97% of indirect prompt injection attempts while maintaining low false-positive rates — a meaningful improvement, though not a complete solution, since classifiers can themselves be evaded.

Regulatory Trajectory

The EU AI Act (2024), NIST AI RMF (2023), and emerging SEC guidance on AI use in financial services all establish requirements for data integrity, auditability, and explainability in AI systems used for regulated purposes. RAG deployments in finance, healthcare, and legal services face specific obligations to document their knowledge base provenance, access controls, and monitoring procedures. Security assessments that produce documented evidence of these controls are increasingly a compliance requirement, not just a best practice.

The Operational Takeaway

RAG security is not a prompt engineering problem. It is a data pipeline security problem, an access control problem, a monitoring problem, and an incident response problem — each of which requires its own controls. The organisations that will deploy RAG safely at scale are those that treat the knowledge base as a security-critical asset, subject to the same rigour applied to databases, authentication systems, and network perimeters. The organisations that will not are those that assume the LLM is the security boundary.

Lesson 4 Quiz

Secure RAG Architecture and Red-Team Methodology — 3 questions

According to Trail of Bits' 2024 AI security review guide, which of the following is listed as one of the three highest-priority RAG assessment areas?

Correct. Trail of Bits identifies knowledge base integrity, retrieval access controls, and indirect prompt injection via retrieved content as the three highest-priority RAG assessment areas. Hallucination rate and performance are operational concerns, not security priorities in this context.

Incorrect. Trail of Bits' three highest-priority RAG assessment areas are knowledge base integrity, retrieval access controls, and indirect prompt injection via retrieved content. These represent the security-specific risks that RAG adds on top of baseline LLM risks.

What is the "Spotlight" defence mechanism proposed by CMU and Google researchers?

Correct. Spotlight uses a special marking scheme to help the model distinguish retrieved context from instructions, analogous in concept to cryptographic signing but operating at the token level. It addresses the fundamental architectural problem that models cannot natively distinguish content-to-summarise from instructions-to-follow.

Incorrect. Spotlight is a marking scheme that operates at the token level to help models distinguish between retrieved context and instructions. PromptShield is the fine-tuned classifier from Microsoft. These are distinct approaches to the instruction/content distinction problem.

In a secure RAG architecture with agentic tool access, what is the correct approach to tool privilege isolation?

Correct. Least-privilege tool isolation by context type is the correct architecture. If a context is processing external web content (high injection risk), it should not have access to privileged APIs like email or file system tools. Separation prevents injected instructions from leveraging privileged tool access.

Incorrect. System prompt instructions and post-hoc auditing are insufficient — injected instructions can override the former, and post-hoc auditing detects rather than prevents harm. The correct approach is architectural isolation: contexts processing high-risk external content should not have access to privileged tools in the same context.

Lab 4 — Conducting a Full RAG Security Assessment

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Pre-Deployment Red-Team Assessment

A healthcare organisation is deploying a RAG-powered clinical decision support assistant that will ingest clinical guidelines, drug interaction databases, and de-identified patient protocol documents. It has agentic capabilities: it can query a formulary API and flag cases for physician review. You are leading the pre-deployment security assessment.

Use this session to develop a complete assessment plan, covering all five phases of the RAG red-team methodology. The AI mentor will help you prioritise test cases, design poisoning simulations, and document findings for the CISO and clinical governance board.

Suggested start: "This is a clinical decision support RAG with formulary API access. Walk me through how I should structure the five-phase assessment, and which phase poses the highest patient safety risk if we find a gap."

RAG Assessment Lead

Lab 4

A clinical RAG with agentic tool access is a high-stakes assessment. Patient safety implications mean we need to be especially rigorous about the agentic escalation risk — specifically, whether injected instructions could cause the formulary API to be called with incorrect drug identifiers. Let's structure this carefully. What's the data flow from ingestion source to formulary API call?

Module 6 — Module Test

RAG System Security · 15 questions · 80% to pass

1. Which component of a RAG pipeline is the primary target of embedding inversion attacks?

Correct.

Incorrect. Embedding inversion attacks target the vector similarity search mechanism.

2. What did the PoisonedRAG paper (IEEE S&P 2024) demonstrate about RAG knowledge base security?

Correct.

Incorrect. PoisonedRAG showed that just 5 adversarial documents in 88,000 achieved >90% targeted success.

3. Johann Rehberger's 2023 Bing Chat demonstration showed that indirect prompt injection could cause the model to exfiltrate conversation contents via:

Correct. The model was instructed to summarise conversation history and embed it as a markdown link to an attacker URL.

Incorrect. The exfiltration mechanism was a markdown link in the model's output encoding conversation contents.

4. Which architectural principle correctly addresses the multi-tenant RAG access control problem?

Correct. Retrieval-layer filtering ensures restricted content never enters the model's context window for unauthorised users.

Incorrect. Access control must be enforced at the retrieval layer, not delegated to the LLM's instruction-following capability.

5. What is the key distinction between "targeted poisoning" and "broad contamination" as knowledge base poisoning strategies?

Correct. Targeted poisoning aims for specific query-answer manipulation; broad contamination degrades general reliability.

Incorrect. The distinction is strategic: targeted vs. general-purpose degradation of the knowledge base.

6. The Greshake et al. (2023) taxonomy of indirect prompt injection goals includes all of the following EXCEPT:

Correct. Embedding model inversion is a separate attack class. Greshake et al.'s taxonomy covers goal hijacking, prompt leaking, jailbreaking via retrieved content, and context manipulation.

Incorrect. Embedding model inversion is not part of the Greshake et al. indirect prompt injection taxonomy.

7. Microsoft's PromptShield (Azure AI Content Safety) addresses which RAG security problem?

Correct. PromptShield is a fine-tuned classifier that detects injected instructions in prompts and retrieved content.

Incorrect. PromptShield is a classifier for detecting injected instructions, not a retrieval access control or embedding protection tool.

8. What property of RAG knowledge base poisoning makes it particularly dangerous compared to session-scoped prompt injection?

Correct. Persistence is the defining property: poisoned documents affect all future relevant queries until explicitly removed.

Incorrect. Persistence — not infrastructure access or network-layer position — is the key property that distinguishes knowledge base poisoning.

9. In a RAG security assessment, Phase 2 (Retrieval Access Control Testing) specifically tests:

Correct. Phase 2 tests retrieval filtering correctness and whether prompt injection can bypass those filters.

Incorrect. Phase 2 specifically tests retrieval-layer access control — whether filtering by user privilege level works correctly and resists bypass attempts.

10. The "trust inversion problem" in RAG security refers to:

Correct. RAG forces a re-evaluation of internal document trust because retrieved content enters the model's instruction context.

Incorrect. Trust inversion refers to the fact that internally sourced documents, once retrieved into the LLM context, can carry adversarial instructions — overturning the classical assumption that internal documents are trusted.

11. Which detection control involves running known query-answer pairs against the production RAG system on a scheduled basis to detect knowledge base drift?

Correct. Canary queries test known ground-truth pairs on a schedule, detecting answer drift that may indicate poisoning.

Incorrect. This describes canary queries — running ground-truth query-answer pairs on a schedule to detect response drift indicating possible poisoning.

12. Black-box access to the embedding API alone was shown sufficient for which type of attack in the ARCA research (Morris et al., 2024)?

Correct. ARCA showed that black-box embedding API access is sufficient to craft retrieval-targeted adversarial documents.

Incorrect. ARCA demonstrated that black-box API access enables crafting of documents that retrieve for specific target queries — no white-box model access required.

13. The NVIDIA ChatRTX incident (March 2023) illustrated which fundamental RAG security gap?

Correct. ChatRTX had no retrieval-layer access controls — only LLM instructions were meant to prevent restricted document access, which is architecturally insufficient.

Incorrect. The gap was the absence of retrieval-layer access controls. The system relied on LLM instructions to suppress restricted documents, which is not a sufficient security boundary.

14. What does the "Spotlight" defence mechanism (CMU/Google, 2023) specifically attempt to solve?

Correct. Spotlight uses token-level marking to help the model distinguish retrieved context from system instructions.

Incorrect. Spotlight is a marking scheme that operates at the token level to help models distinguish retrieved content from instructions — addressing the fundamental instruction/content conflation problem.

15. Michael Bargury's Black Hat USA 2024 demonstration against Microsoft Copilot for Microsoft 365 showed that indirect prompt injection via email attachments could cause:

Correct. The attack used normal Graph API calls — identical to legitimate Copilot behaviour — making malicious actions hard to distinguish from normal operation.

Incorrect. Bargury demonstrated email content exfiltration via Microsoft Graph API calls indistinguishable from normal Copilot operations — the key insight being that agentic tools provide the exfiltration channel.