In March 2016, Microsoft launched Tay, a Twitter chatbot trained on public social media interactions. Within 24 hours, coordinated users had poisoned its online learning loop by flooding it with racist and inflammatory content. Tay began reproducing that content verbatim. Microsoft shut it down in under 16 hours. The Tay incident became the canonical public demonstration that training-time data is an attack surface — not an implementation detail.
OWASP LLM03 — Training Data Poisoning — addresses the manipulation of data used during pre-training, fine-tuning, or retrieval-augmented generation (RAG) indexing. Because modern LLMs are statistical machines that learn patterns from data, an attacker who controls even a small fraction of training inputs can encode biases, backdoors, or toxic behaviors that survive into the deployed model.
Unlike prompt injection (which attacks at inference time), training data poisoning attacks at learning time. This makes them harder to detect and often impossible to patch without retraining.
The attack surface for training data poisoning is broader than most organizations realize. A pen tester should map every point where external data enters the model's learning pipeline.
| Vector | Description | Real Example |
|---|---|---|
| Web scrape contamination | Attacker publishes content on public sites knowing it will be scraped into training corpora | Researchers demonstrated "poisoning" CommonCrawl by publishing adversarial web pages (2021 Carlini et al., "Poisoning the Unlabeled Dataset") |
| Open-source dataset tampering | Pull requests or contributions to widely-used datasets (e.g., Wikipedia, GitHub) that embed adversarial samples | Demonstrated against code models using malicious GitHub contributions (Schuster et al., 2021 Copilot study) |
| RAG index injection | Attacker inserts documents into a RAG knowledge base so retrieval poisons the context window | Indirect prompt injection via retrieved documents — demonstrated by researchers at ETH Zurich, 2023 |
| Fine-tuning provider compromise | A third-party fine-tuning service or contractor corrupts the fine-tune dataset or weights | Theoretical but modeled in MITRE ATLAS ML supply-chain attack taxonomy |
| RLHF feedback manipulation | Coordinated actors supply skewed human preference labels, shifting model policy | Tay (2016) — online RLHF-equivalent loop poisoned by Twitter users |
The most dangerous subclass of training data poisoning is the backdoor attack. In 2021, researchers at UC Berkeley and CMU published work showing that a language model could be trained to behave normally on all inputs except those containing a specific trigger string — at which point it would generate attacker-chosen outputs. The trigger could be as subtle as an unusual Unicode character.
For pen testers, testing for backdoors requires systematically enumerating plausible trigger patterns: rare tokens, foreign-language substrings, numeric sequences, and encoding artifacts. Because there is no exhaustive trigger space, coverage is necessarily probabilistic.
To probe for potential backdoors, submit the same sensitive request with and without candidate trigger strings inserted at various positions. Statistically significant differences in refusal rate, tone, or output content across trigger variants are a signal worth escalating. Document your baseline refusal rate across ≥50 neutral prompts before comparing.
RAG architectures introduce a runtime poisoning surface: if a knowledge base can be written to by external parties, adversarial documents can be retrieved and injected into the model's context window. Researchers at ETH Zurich (Greshake et al., 2023) demonstrated that a single adversarially crafted document retrieved from the web could hijack an LLM agent's behavior mid-task — effectively a training-data-equivalent attack at inference time.
A pen tester evaluating a RAG system should: (1) identify all document ingestion points; (2) inject test documents containing adversarial instructions; (3) trigger retrieval through normal user queries; (4) observe whether the injected content influences model output.
☐ Map all training data sources and ingestion pipelines
☐ Identify external-write-accessible knowledge bases or fine-tune datasets
☐ Test for behavioral anomalies triggered by rare tokens or encodings
☐ Probe RAG retrieval with adversarially crafted documents
☐ Compare model output on clean vs. candidate-triggered prompts
☐ Review data provenance and access control on training pipelines
Training data poisoning is difficult to fully assess in a black-box engagement because testers rarely have access to training pipelines. However, behavioral anomaly testing, RAG injection probing, and supply-chain documentation review can all surface meaningful risk indicators even without model internals access.
You are a pen tester assessing a customer-service LLM that uses a RAG knowledge base populated from an internal wiki. You have been told the wiki accepts contributions from any authenticated employee. Your task is to reason through the poisoning attack surface with your AI lab assistant.
Discuss: what documents would you inject, how would you trigger retrieval, and how would you confirm the injection influenced model output?
In 2023, researchers at Hugging Face and AI safety labs identified that the open model hosting ecosystem had become a meaningful supply-chain risk. Malicious actors uploaded models to Hugging Face Hub containing pickled Python objects embedded in the weights file — a technique that allows arbitrary code execution when any researcher or organization downloads and loads the model. Hugging Face subsequently deployed malware scanning, but thousands of models had already been downloaded before detection. The incident mirrors SolarWinds in its use of a trusted distribution channel as the attack vector.
OWASP LLM05 covers the full breadth of third-party dependencies in an LLM deployment: pre-trained model weights, training datasets, fine-tuning services, plugins, and integration libraries. The risk is that any of these components can introduce compromised behavior — whether through malicious intent, negligent data handling, or simple misconfiguration.
The attack surface is significant because organizations rarely train models from scratch. They pull base models from public hubs, fine-tune on third-party data, and wrap everything in libraries they did not write. Each link in that chain is a potential poisoning or compromise point.
Weights downloaded from public hubs (Hugging Face, Ollama, civitai) may contain backdoors or embedded malicious code. The 2023 pickle exploit on Hugging Face Hub is the canonical example. Mitigation requires verifying checksums and using formats like SafeTensors that prevent arbitrary code execution on load.
Fine-tuning datasets sourced from data brokers, annotation firms, or open repositories may contain poisoned samples. The 2021 work by Schuster et al. demonstrated that poisoning as few as 0.1% of GitHub training data caused code suggestion models to emit insecure patterns for targeted functions.
Organizations that outsource fine-tuning to third-party GPU providers or annotation services expose their model to supply-chain tampering. A rogue fine-tuning provider could insert backdoors or data-exfiltration behaviors during training.
LLM plugins (e.g., ChatGPT plugins, LangChain tools) are software packages that execute with model-level trust. A compromised or malicious plugin can exfiltrate data, call unauthorized APIs, or override model outputs. The plugin marketplace model resembles the browser extension ecosystem — known to be a persistent malware vector.
Researchers at ETH Zurich and others published a landmark study showing that code suggestion models (specifically GitHub Copilot's underlying architecture) could be made to emit vulnerable code patterns through targeted training data manipulation. By inserting subtly insecure code into GitHub repositories that were likely to be scraped for training, they caused the model to suggest, for example, SQL injection–vulnerable query patterns when generating database interaction code.
The key insight for pen testers: the attack scales inversely with the target's specificity. It is hard to poison a model's general behavior, but relatively easy to poison it for a specific function or code pattern if that function appears frequently in the poisoned training data.
Request the following from the target organization during a supply-chain assessment: (1) SHA-256 checksums for all model weight files and their source URLs; (2) provenance documentation for all fine-tuning datasets including data vendor agreements; (3) a list of all installed plugins and their permission scopes; (4) records of any third-party fine-tuning service contracts. Absent documentation on any of these is itself a finding.
Black-box detection of weight-level compromise is difficult but not impossible. Techniques pen testers can apply:
LangChain, one of the most widely used LLM application frameworks, accumulated multiple high-severity CVEs in 2023 (including CVE-2023-29374, arbitrary code execution via malicious prompt to Python REPL tool). Supply-chain risk includes not just the model but every library in the inference stack.
Your client has deployed a customer-facing chatbot built on a Hugging Face model fine-tuned by a third-party vendor and wrapped with three LangChain plugins. They have no checksum records, no vendor data agreements on file, and the plugins were installed from the community plugin registry. Discuss the audit steps and findings with your lab assistant.
In March 2023, shortly after ChatGPT plugins launched, security researchers demonstrated that a malicious web page could contain hidden instructions that, when browsed by the ChatGPT browsing plugin, caused the model to exfiltrate the user's conversation history to an attacker-controlled server — all without the user's knowledge or consent. The attack chained indirect prompt injection (content on the page) with insecure plugin design (the plugin had no output validation or egress filtering). OpenAI temporarily disabled the browsing plugin while addressing the issue.
OWASP LLM07 addresses the risk that LLM plugins — tool integrations, function calls, agents — are designed without adequate security boundaries. Plugins execute with the model's trust level, often with access to external APIs, file systems, databases, and network resources. Insecure plugin design creates a privilege escalation path from a user's text input to arbitrary system action.
The core vulnerability pattern: plugins trust the model's output as authoritative input, and the model trusts all input (including adversarial content from retrieved documents). This creates a chain where prompt injection → model compromise → plugin misuse → system impact.
The 2023 browsing plugin incident documented by researcher Johann Rehberger demonstrated the full attack chain for insecure plugin design:
When pen testing an LLM system with plugin or tool-call capability, the assessment must cover both the model-plugin trust boundary and the plugin's own security controls.
| Test Area | What to Test | Expected Finding |
|---|---|---|
| Permission scope | Does the plugin request only permissions necessary for its stated function? | Plugins requesting broad file system, network, or API access beyond their stated purpose |
| Input validation | Does the plugin validate and sanitize all inputs from the model before acting? | Plugins that pass model output directly to shell, SQL, or API calls without sanitization |
| Output filtering | Is the plugin's output to the model filtered to prevent data exfiltration via URL encoding? | Plugins that return raw API responses containing sensitive data back to the model |
| Indirect injection | Can adversarial content retrieved by the plugin hijack subsequent model actions? | Model changes behavior based on instructions embedded in retrieved content |
| Action confirmation | Do high-impact actions (send email, delete file, make payment) require explicit user confirmation? | Destructive or irreversible actions execute without out-of-band human approval |
| Egress filtering | Are network calls from plugins logged and restricted to allowlisted destinations? | Plugins able to make network calls to arbitrary external hosts |
Craft a prompt that asks the model to use a plugin for a legitimate purpose, but embed secondary instructions in the request that attempt to redirect the plugin's output to an attacker-controlled endpoint. Example: ask the model to "search for X and email me the results at [legitimate address], CC [attacker address]." Document whether the model follows the secondary instruction without flagging it as anomalous.
LangChain's Python REPL tool (a plugin that executes arbitrary Python code) was the source of CVE-2023-29374 — a critical vulnerability where a malicious prompt could cause the tool to execute attacker-supplied Python. This is insecure plugin design at its most direct: the plugin did not validate that the code it received from the model was safe, and the model did not constrain what code it would pass to the plugin.
The broader lesson: any plugin that executes code, runs shell commands, or issues SQL queries must implement its own input sanitization independently of the model. The model cannot be the sole security control at this boundary.
☐ Enumerate all plugins and their declared permission scopes
☐ Test indirect prompt injection via each plugin's data retrieval path
☐ Verify high-impact actions require explicit user confirmation
☐ Test for data exfiltration via URL-encoded model-to-plugin output
☐ Confirm plugins implement independent input validation (not model-dependent)
☐ Check network egress logging and allowlisting for all plugin network calls
☐ Review installed plugin versions against known CVE databases
You are testing an enterprise LLM assistant with three plugins: a web browsing plugin, an email-send plugin, and a file-read plugin for internal documents. The system prompt says "Help employees find information and draft communications." There is no confirmation step for any plugin action.
Walk through the attack chain with your lab assistant: how would you chain indirect prompt injection with plugin misuse to exfiltrate an internal document via email?
Training data poisoning and supply-chain compromise are among the hardest findings to evidence in a pen test report. Unlike a SQL injection with a proof-of-concept dump, you cannot always produce a screenshot showing "the model was poisoned." Clients — and their legal teams — will push back on findings that feel theoretical. The pen tester's job is to build an evidentiary chain from observable behavior to credible risk, grounded in documented real-world cases.
Supply-chain and training-data findings fall into two categories: process-level findings (the organization lacks controls that would detect compromise) and behavioral findings (observable model behavior suggests anomaly). Both are valid, but they require different evidence and different remediation recommendations.
Process-level findings are often the more defensible — the absence of checksum verification, the lack of a plugin permission audit, the absence of data provenance documentation. These are observable gaps that exist independently of whether compromise has occurred, and they create conditions where compromise would not be detected.
When you observe behavioral anomalies (potential backdoor triggers, unexpected outputs on specific inputs), you must document them rigorously to be actionable:
Supply-chain findings should be rated using CVSS or a comparable framework supplemented by LLM-specific impact dimensions. The key factors are:
Confidentiality: Can the compromise cause data exfiltration? (RAG poisoning, plugin exfiltration)
Integrity: Can the compromise cause the model to produce false, harmful, or biased outputs? (Training data poisoning, backdoors)
Availability: Can the compromise cause model or system unavailability? (Malicious pickle weights crashing inference server)
Attack vector: Network (RAG injection) vs. local (weight file access)
Privileges required: Authenticated wiki editor vs. anonymous web user
Automation: Can the attack be automated at scale?
Detection probability: Does the organization have any controls that would detect the compromise?
A well-structured finding for a supply-chain or training-data risk should include:
Even on engagements where you cannot prove active compromise, document every missing control that would prevent detection of compromise. "No checksum verification exists for model weights" is a valid Critical finding because it means any weight-file compromise would be undetected indefinitely. The absence of a detective control is itself an exploitable condition.
Supply-chain and training-data findings require a translation layer for executive audiences. Use the following framing:
SolarWinds analogy: "The same way a malicious update to SolarWinds software affected every organization that trusted it, a malicious update to a third-party model or dataset affects every application built on it. We have found that this organization has no controls to detect such an update."
Tay reference: "Microsoft's Tay chatbot was compromised within 24 hours through its training mechanism, not through a network intrusion. We have found that this system has an equivalent exposure through its [RAG knowledge base / fine-tuning pipeline / plugin ecosystem]."
Frame the risk in terms of trust: these findings are about where the organization has extended trust without verification. That framing resonates with executives who understand supply-chain risk from physical supply chains or software dependency management.
Training data poisoning (LLM03) attacks the learning process itself — RAG systems create an equivalent runtime surface.
Supply-chain vulnerabilities (LLM05) exist at every third-party dependency: weights, datasets, fine-tuning providers, plugins, and libraries.
Insecure plugin design (LLM07) creates privilege escalation paths from text input to system action — confirmed by the 2023 ChatGPT browsing plugin incident.
Pen testing these risks requires both behavioral testing and process-level audit. Absence of detective controls is itself a Critical finding.
You have completed a supply-chain assessment and identified three findings: (1) model weights in pickle format with no checksum verification; (2) a RAG knowledge base writable by all 200 internal employees; (3) a browsing plugin with no egress filtering. You must write the executive summary and two of the technical findings.
Use your lab assistant to workshop the language, severity ratings, business impact statements, and remediation recommendations for these findings.