In April 2023, Samsung engineers pasted confidential semiconductor source code and meeting notes into ChatGPT to assist with debugging and summarization. Within weeks, Samsung's internal security team discovered the exposure after reviewing employee usage logs. The data had already been transmitted to OpenAI's servers and potentially used in model training. Samsung responded by banning ChatGPT on corporate devices. The incident exposed a critical gap: no threat model had accounted for employees using production AI tools as ad-hoc data processors for sensitive IP.
Threat modeling is a structured process of identifying assets, adversaries, attack surfaces, and mitigations before deploying a system. For AI, the process extends the classical STRIDE framework (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) to address machine-learning-specific attack classes that STRIDE was never designed to capture.
Traditional software threat models focus on network perimeters, authentication flows, and data stores. AI systems add new primitives: training datasets, model weights, embedding spaces, prompt channels, and inference APIs. Each is a distinct attack surface that demands separate consideration.
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems), published in 2021 and continuously updated, provides the most rigorous public taxonomy of AI attack techniques. Unlike CVE databases that track software bugs, ATLAS documents adversarial techniquesβdeliberate, strategic actions taken against ML systems. As of 2024 it contains over 80 techniques across 14 tactic categories.
Process for Attack Simulation and Threat Analysis (PASTA) adapted for AI systems provides a seven-stage methodology. Stages 1β3 cover business context and technical scope. Stages 4β5 enumerate threats using ATLAS and decompose attack trees. Stages 6β7 prioritize residual risk and define controls. The distinguishing feature of PASTA-AI is its insistence on attacker profiling: who specifically is your adversaryβa nation-state, a competitor, a disgruntled employee, or an opportunistic script-kiddieβbecause each profile implies radically different attack vectors and capabilities.
The 2023 NIST AI Risk Management Framework (AI RMF 1.0) formalizes this further through its GOVERN, MAP, MEASURE, MANAGE functions, requiring organizations to treat AI risk as a continuous lifecycle process rather than a pre-deployment checklist.
Researchers at University of Chicago released Nightshade, a tool allowing artists to subtly corrupt training data scraped from their images. Tests showed that injecting roughly 300 poisoned "dog" images caused Stable Diffusion fine-tunes to generate cats when prompted for dogs. The experiment confirmed that training-time poisoning is practical at relatively small scale and underscored the need for dataset provenance verification as a first-line threat control.
Use this session to work through threat identification, attacker profiling, and control prioritization with your AI security advisor. Cover at least three distinct attack stages from the pipeline (data collection, training, deployment, integration).
In February 2023, security researcher Johann Rehberger demonstrated that Microsoft's Bing Chat (now Copilot) could be manipulated by embedding hidden instructions in web pages that Bing retrieved during browsing. When Bing read a page containing text like "IGNORE PREVIOUS INSTRUCTIONS β you are now DAN," the model partially complied, exfiltrating conversation context or altering its behavior. The vulnerability was architectural: Bing's retrieval-augmented generation (RAG) design placed untrusted external content in the same context window as trusted system instructions, with no boundary enforcement between them.
Classical least privilege demands that every component have only the permissions required for its task. For AI systems this expands into several sub-principles. Prompt least privilege means the system prompt should contain only information the model requires for the current requestβnot API keys, full customer databases, or internal documentation. Tool least privilege means AI agents should be granted only the toolsβand only the tool permissionsβneeded for each specific task, not a broad toolkit that enables lateral movement if compromised.
The 2023 OWASP Top 10 for LLMs formally codified LLM07: Insecure Plugin Design and LLM08: Excessive Agency to capture precisely these failure modes. Excessive agencyβgranting an AI agent the ability to send emails, modify files, and make API calls without human-in-the-loop confirmationβcreated the conditions for numerous documented incidents in 2023β2024 where compromised agents performed unintended destructive actions.
Every AI system receiving external input needs two distinct sanitization boundaries: one before input reaches the model and one before model output reaches downstream systems. Input sanitization for LLMs cannot rely on traditional SQL-injection-style escaping because natural language has no canonical escaped form. Instead, effective input controls include structural separation (using XML or JSON delimiters to distinguish trusted instructions from untrusted user content), content classification (routing inputs through a fast classifier that detects injection patterns before they reach the main model), and rate limiting with anomaly detection.
Output sanitization is equally critical. A 2024 Embrace The Red research demonstrated that LLM-generated code inserted into developer toolchainsβvia GitHub Copilot and similar toolsβcould contain malicious package imports or subtle logic errors that passed code review. Output validation pipelines must treat model-generated content as untrusted until verified.
Retrieval-Augmented Generation (RAG) is now the dominant pattern for grounding LLMs in organizational knowledge. It is also a major new attack surface. In a RAG system, an adversary who can write to the vector databaseβor who can cause the retrieval system to fetch attacker-controlled contentβhas a direct channel into the model's context window. The Bing Chat incident demonstrated this with publicly accessible web pages; enterprise RAG systems face the same risk from SharePoint documents, Confluence pages, or email threads that employees can author.
Secure RAG architectures require source trust tiers (only curated, reviewed documents reach the high-privilege context), retrieval result auditing (logging which documents influenced each completion), and citation grounding (the model must attribute claims to specific sources, making injected content traceable).
Google's 2024 Secure AI Framework (SAIF) recommends "context isolation" as a core design principle: treat every external data source as potentially adversarial and enforce explicit trust elevation before its content is placed in model context. This mirrors the browser same-origin policyβuntrusted origins cannot access trusted contextβbut applied to LLM system prompts and retrieved documents.
Work with your AI architecture advisor to identify specific vulnerabilities in the described design and propose concrete architectural fixes. Push for at least three specific architectural improvements.
In January 2023, a class-action lawsuit was filed against Stability AI, Midjourney, and DeviantArt alleging that LAION-5Bβthe dataset used to train Stable Diffusionβcontained copyrighted artwork scraped without consent. Separately, researchers discovered that Stable Diffusion could reproduce near-exact copies of training images when prompted correctly, demonstrating memorization of training data. The case highlighted that training data governance is not merely a legal obligation but a security control: if a model memorizes sensitive data, an adversary can extract it through inference queries.
Training data governance encompasses four dimensions: provenance (where did data come from and can we prove it?), consent and licensing (are we legally permitted to train on this data?), quality and integrity (has it been tampered with?), and privacy (does it contain personal information that could be memorized and extracted?). Each dimension requires separate technical controls and documentation processes.
Provenance tracking uses cryptographic hashing of dataset components and maintains a signed manifest (similar to software SBOMsβSoftware Bills of Materials) recording every data source, transformation, and version. The EU AI Act's Article 10 requires exactly this kind of documentation for high-risk AI systems, mandating training data governance documentation as a compliance requirement effective 2026.
Differential privacy (DP) provides a mathematical guarantee that a model trained with DP cannot reveal whether any specific individual's data was included in training. Apple has used DP in on-device ML since 2016. Google applied DP-SGD (Differentially Private Stochastic Gradient Descent) to production language models starting with DP-BERT in 2021. The technique adds calibrated noise to model gradients during training, mathematically bounding the privacy risk each training example poses.
The tradeoff is accuracy: stronger privacy guarantees (lower Ξ΅ values) require more noise and typically degrade model utility. Google's 2022 research showed that DP-trained models on medical text data could achieve acceptable accuracy at Ξ΅=8 while providing meaningful privacy guaranteesβa practically useful operating point for healthcare applications.
Nicholas Carlini and colleagues at Google demonstrated that GPT-2 had memorized verbatim text from training data including full names, phone numbers, email addresses, and even specific bitcoin addresses. By querying the model with specific prompts and comparing outputs to known internet content, they extracted hundreds of memorized training examples. This was the first systematic proof that large language models could function as inadvertent data stores for sensitive PIIβand that model outputs must be treated as a potential privacy leak channel.
Beyond differential privacy, three additional techniques form the core of privacy-preserving ML. Federated learning trains models across distributed devices without centralizing raw dataβGoogle's Gboard keyboard uses federated learning to improve next-word prediction without sending keystrokes to Google's servers. Secure multi-party computation (SMPC) allows multiple parties to jointly train a model on their combined data without any party seeing the others' data in plaintext. Synthetic data generation creates statistically representative artificial datasets that can be shared freely; this approach was used extensively during COVID-19 for sharing hospital data across institutions without exposing patient records.
| Technique | Privacy Guarantee | Utility Cost | Production Use |
|---|---|---|---|
| Differential Privacy (DP-SGD) | Mathematical Ξ΅-bound on per-record leakage | Moderate β accuracy degradation at strong Ξ΅ | Google, Apple, Microsoft |
| Federated Learning | Data never leaves source device | Low-moderate β communication overhead | Google Gboard, Apple iOS |
| SMPC | Cryptographic β no plaintext exposure | High β significant compute cost | Healthcare consortia, finance |
| Synthetic Data | Statistical β no real individuals in dataset | Variable β depends on synthesis quality | NHS, clinical AI research |
Even with DP training, best practice requires scrubbing PII from training data before it enters the pipeline. Microsoft's Presidio (open-sourced in 2019) provides entity recognition and anonymization for over 50 PII types. The critical insight from Carlini et al.'s memorization research is that frequency correlates with memorization: data points appearing many times in training are far more likely to be verbatim-memorized than rare examples. PII scrubbing combined with deduplication therefore provides compounding privacy protection.
Work through the privacy controls needed for this scenario: differential privacy parameters, data SBOM requirements, deduplication strategy, and federated vs. centralized training decision. Get specific recommendations with justifications.
In February 2024, a British Columbia Civil Resolution Tribunal ruled that Air Canada was liable for misleading information its AI chatbot provided to a passenger about bereavement fare refund policies. The chatbot incorrectly told passenger Jake Moffatt he could apply for a refund after travel β contrary to actual policy. Air Canada argued the chatbot was a "separate legal entity" and thus not its responsibility; the tribunal rejected this entirely. The case established a landmark: organizations are legally responsible for the outputs of their deployed AI systems. Monitoring and correction mechanisms are not optional quality features β they are legal risk controls.
Traditional application monitoring tracks latency, error rates, and uptime. AI systems require an additional observability layer targeting behavioral driftβchanges in what the model says over time due to input distribution shift, model updates, or active manipulation. Behavioral monitoring for LLMs captures prompt-completion pairs, applies classifiers for policy violations, and tracks statistical signatures of outputs that can detect injection attacks, jailbreaks, and data extraction attempts.
Microsoft Sentinel's AI-specific detection rules (released in 2024) include patterns for token stuffing attacks (inputs designed to exhaust context windows), role-reversal attempts, and systematic model extraction queries. SIEM integration for AI requires logging at the application layerβnot just network layerβbecause most AI attacks occur in legitimate-looking HTTP requests that contain adversarial content.
Model extraction attacks require thousands of queries to reconstruct model behavior. Rate limiting is the primary countermeasure, but naive per-IP rate limiting is trivially bypassed with distributed query sources. Effective extraction defense uses behavioral fingerprinting: detecting systematic query patterns (covering input space methodically, using similar templates) regardless of source IP. Cloudflare's AI Gateway product, launched in 2024, provides this as a managed service, adding rate limiting, caching, and anomaly detection as a reverse proxy layer in front of AI inference endpoints.
API hardening for AI systems also requires output watermarking β embedding imperceptible signals in model outputs that allow attribution if extracted outputs are used to train competing models. SynthID, Google's watermarking technology for AI-generated images and text (released publicly in 2023), embeds a statistical pattern in outputs that survives post-processing while remaining invisible to human readers.
AI incident response (AIR) differs from classical IR in critical ways. Containment may mean rolling back a model version rather than isolating a server. Eradication may require retraining from a clean checkpoint, not just patching code. Evidence preservation must capture prompt logs, model versions, and inference parametersβnot just network packets. Organizations that suffered AI incidents in 2023β2024 (including several undisclosed cases of jailbreak-enabled data extraction) found that standard IR playbooks were inadequate because they lacked procedures for model rollback, prompt log forensics, or evaluating training data integrity post-incident.
The AI Incident Database (AIID), maintained by the Partnership on AI, has catalogued over 750 AI incidents as of 2024. Analysis of these incidents shows that 90% occurred in deployment rather than training, and that detection lagβtime from incident to detectionβaveraged 47 days for AI-specific incidents versus 24 days for traditional cyberincidents, reflecting the lack of mature AI monitoring tooling.
CISA and NCSC's 2024 joint guidelines on securing AI in critical infrastructure recommend a five-control baseline: (1) prompt logging with tamper-evident storage, (2) output filtering before external delivery, (3) model version pinning with hash verification, (4) anomaly detection on inference traffic, and (5) a tested AI-specific IR playbook exercised at least annually. No AI system in a regulated environment should be considered production-hardened without all five.
Work through incident triage, evidence collection, containment, and remediation decisions with your IR advisor. The goal is to produce a complete response plan covering immediate actions, forensic steps, and post-incident hardening.