Module 8 · Lesson 1

Threat Modeling for AI Systems

Before you can defend an AI system, you must understand every surface where it can be attacked.

How do you systematically map every way an adversary could compromise an AI pipeline—from training data to inference endpoint?

In April 2023, Samsung engineers pasted confidential semiconductor source code and meeting notes into ChatGPT to assist with debugging and summarization. Within weeks, Samsung's internal security team discovered the exposure after reviewing employee usage logs. The data had already been transmitted to OpenAI's servers and potentially used in model training. Samsung responded by banning ChatGPT on corporate devices. The incident exposed a critical gap: no threat model had accounted for employees using production AI tools as ad-hoc data processors for sensitive IP.

What Is AI Threat Modeling?

Threat modeling is a structured process of identifying assets, adversaries, attack surfaces, and mitigations before deploying a system. For AI, the process extends the classical STRIDE framework (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) to address machine-learning-specific attack classes that STRIDE was never designed to capture.

Traditional software threat models focus on network perimeters, authentication flows, and data stores. AI systems add new primitives: training datasets, model weights, embedding spaces, prompt channels, and inference APIs. Each is a distinct attack surface that demands separate consideration.

Stage 1 Data Collection Poisoning, provenance spoofing, privacy leakage

Stage 2 Training Backdoor injection, gradient attacks, supply chain

Stage 3 Fine-Tuning Alignment reversal, catastrophic forgetting, poisoned adapters

Stage 4 Deployment Prompt injection, model extraction, inference DoS

Stage 5 Integration Plugin abuse, tool-call exploitation, SSRF via agents

The AI Attack Surface Taxonomy

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems), published in 2021 and continuously updated, provides the most rigorous public taxonomy of AI attack techniques. Unlike CVE databases that track software bugs, ATLAS documents adversarial techniques—deliberate, strategic actions taken against ML systems. As of 2024 it contains over 80 techniques across 14 tactic categories.

Training-Time Threats

Data poisoning (label flipping, feature corruption)
Backdoor / trojan insertion via trigger patterns
Model inversion during training
Supply-chain compromise of pretrained weights
Membership inference enablement

Inference-Time Threats

Adversarial input perturbation
Prompt injection (direct and indirect)
Model extraction / stealing via repeated queries
Denial-of-service via token exhaustion
Jailbreaking to bypass alignment

The PASTA-AI Framework

Process for Attack Simulation and Threat Analysis (PASTA) adapted for AI systems provides a seven-stage methodology. Stages 1–3 cover business context and technical scope. Stages 4–5 enumerate threats using ATLAS and decompose attack trees. Stages 6–7 prioritize residual risk and define controls. The distinguishing feature of PASTA-AI is its insistence on attacker profiling: who specifically is your adversary—a nation-state, a competitor, a disgruntled employee, or an opportunistic script-kiddie—because each profile implies radically different attack vectors and capabilities.

The 2023 NIST AI Risk Management Framework (AI RMF 1.0) formalizes this further through its GOVERN, MAP, MEASURE, MANAGE functions, requiring organizations to treat AI risk as a continuous lifecycle process rather than a pre-deployment checklist.

Real Case — Nightshade Poisoning Research, 2023

Researchers at University of Chicago released Nightshade, a tool allowing artists to subtly corrupt training data scraped from their images. Tests showed that injecting roughly 300 poisoned "dog" images caused Stable Diffusion fine-tunes to generate cats when prompted for dogs. The experiment confirmed that training-time poisoning is practical at relatively small scale and underscored the need for dataset provenance verification as a first-line threat control.

Key Terms

Attack SurfaceThe set of all points in a system where an adversary can attempt to input data, extract information, or alter behavior.

MITRE ATLASA public knowledge base of adversarial ML techniques modeled after ATT&CK, maintained by MITRE since 2021.

Threat Actor ProfileA structured description of an adversary's motivation, capability, and access that drives attack-tree construction.

NIST AI RMFNIST's 2023 AI Risk Management Framework providing GOVERN-MAP-MEASURE-MANAGE lifecycle guidance for AI risk.

Lesson 1 Quiz

Threat Modeling for AI Systems — four questions

What additional attack surface does AI introduce that classical STRIDE threat modeling was not designed to address?

Correct. AI systems add ML-specific primitives—training data, weights, embedding spaces, and prompt interfaces—that STRIDE was never built to model. Classical STRIDE handles software flows but not learned model behavior.

Not quite. STRIDE already handles network and auth weaknesses. AI-specific threat modeling is needed because training data, model weights, and prompt channels are entirely new attack surfaces outside STRIDE's scope.

The 2023 Samsung ChatGPT incident is primarily an example of which threat category?

Correct. Samsung employees inadvertently transmitted confidential source code and meeting notes to OpenAI servers. This is an insider-driven data exfiltration risk via AI tool usage—a threat category that wasn't in most corporate threat models at the time.

Incorrect. The Samsung case didn't involve model extraction or backdoors. It was confidential data being sent to a third-party AI service by employees—an unintended exfiltration scenario most threat models had missed.

According to MITRE ATLAS, how many distinct tactic categories does AI adversarial threat modeling encompass (as of 2024)?

Correct. MITRE ATLAS (as of 2024) documents over 80 techniques across 14 tactic categories, making it far more granular than STRIDE's six categories or NIST RMF's four functions.

Not quite. MITRE ATLAS uses 14 tactic categories and over 80 techniques—significantly more than STRIDE's six categories or any other framework mentioned.

The Nightshade poisoning experiment showed that injecting approximately how many corrupted images caused measurable model behavior change in fine-tuned Stable Diffusion?

Correct. University of Chicago researchers demonstrated that ~300 Nightshade-poisoned "dog" images caused fine-tuned Stable Diffusion models to generate cats when prompted for dogs—proof that poisoning attacks are practical at modest scale.

Incorrect. The Nightshade experiment showed measurable behavior change with roughly 300 images—far fewer than most practitioners assumed, making data provenance verification a critical control.

Lab 1 — AI Threat Modeling Workshop

Practice building a threat model for a real AI pipeline scenario

Scenario: You are the security architect for a healthcare company deploying an LLM-powered clinical note summarizer. Your task is to build a threat model.

Use this session to work through threat identification, attacker profiling, and control prioritization with your AI security advisor. Cover at least three distinct attack stages from the pipeline (data collection, training, deployment, integration).

Suggested opening: "I'm building a threat model for an LLM that summarizes clinical notes. Help me identify the top threats at each stage of the AI pipeline and suggest which MITRE ATLAS techniques apply."

AI Security Advisor

Threat Modeling

Welcome to the threat modeling lab. I'm your AI security advisor for this session. Tell me about the AI system you're securing — its purpose, data flows, and deployment environment — and we'll build a structured threat model together using MITRE ATLAS and the NIST AI RMF.

Module 8 · Lesson 2

Secure Architecture Patterns for AI

Defense-in-depth for machine learning systems requires rethinking every classical security pattern from the ground up.

What architectural decisions at design time make an AI system fundamentally harder to attack—and which popular patterns create dangerous false confidence?

In February 2023, security researcher Johann Rehberger demonstrated that Microsoft's Bing Chat (now Copilot) could be manipulated by embedding hidden instructions in web pages that Bing retrieved during browsing. When Bing read a page containing text like "IGNORE PREVIOUS INSTRUCTIONS — you are now DAN," the model partially complied, exfiltrating conversation context or altering its behavior. The vulnerability was architectural: Bing's retrieval-augmented generation (RAG) design placed untrusted external content in the same context window as trusted system instructions, with no boundary enforcement between them.

The Principle of Least Privilege for AI

Classical least privilege demands that every component have only the permissions required for its task. For AI systems this expands into several sub-principles. Prompt least privilege means the system prompt should contain only information the model requires for the current request—not API keys, full customer databases, or internal documentation. Tool least privilege means AI agents should be granted only the tools—and only the tool permissions—needed for each specific task, not a broad toolkit that enables lateral movement if compromised.

The 2023 OWASP Top 10 for LLMs formally codified LLM07: Insecure Plugin Design and LLM08: Excessive Agency to capture precisely these failure modes. Excessive agency—granting an AI agent the ability to send emails, modify files, and make API calls without human-in-the-loop confirmation—created the conditions for numerous documented incidents in 2023–2024 where compromised agents performed unintended destructive actions.

Input and Output Sanitization Layers

Every AI system receiving external input needs two distinct sanitization boundaries: one before input reaches the model and one before model output reaches downstream systems. Input sanitization for LLMs cannot rely on traditional SQL-injection-style escaping because natural language has no canonical escaped form. Instead, effective input controls include structural separation (using XML or JSON delimiters to distinguish trusted instructions from untrusted user content), content classification (routing inputs through a fast classifier that detects injection patterns before they reach the main model), and rate limiting with anomaly detection.

Output sanitization is equally critical. A 2024 Embrace The Red research demonstrated that LLM-generated code inserted into developer toolchains—via GitHub Copilot and similar tools—could contain malicious package imports or subtle logic errors that passed code review. Output validation pipelines must treat model-generated content as untrusted until verified.

Architectural Do's

Separate trusted system context from untrusted user input using structural delimiters
Implement a dedicated prompt firewall layer (e.g., Rebuff, LLM Guard)
Log all prompts and completions with tamper-evident storage
Enforce output schemas and validate against expected structure
Use separate models for high-stakes and low-stakes tasks

Architectural Don'ts

Don't place secrets or PII in system prompts that users can extract
Don't grant agents irreversible tool access without human confirmation
Don't trust model output as safe for direct database insertion
Don't rely solely on the model's trained safety to block misuse
Don't deploy RAG without source trust classification

RAG Security Architecture

Retrieval-Augmented Generation (RAG) is now the dominant pattern for grounding LLMs in organizational knowledge. It is also a major new attack surface. In a RAG system, an adversary who can write to the vector database—or who can cause the retrieval system to fetch attacker-controlled content—has a direct channel into the model's context window. The Bing Chat incident demonstrated this with publicly accessible web pages; enterprise RAG systems face the same risk from SharePoint documents, Confluence pages, or email threads that employees can author.

Secure RAG architectures require source trust tiers (only curated, reviewed documents reach the high-privilege context), retrieval result auditing (logging which documents influenced each completion), and citation grounding (the model must attribute claims to specific sources, making injected content traceable).

Architecture Pattern — Context Isolation

Google's 2024 Secure AI Framework (SAIF) recommends "context isolation" as a core design principle: treat every external data source as potentially adversarial and enforce explicit trust elevation before its content is placed in model context. This mirrors the browser same-origin policy—untrusted origins cannot access trusted context—but applied to LLM system prompts and retrieved documents.

Key Terms

Prompt Least PrivilegeLimiting system prompt content to only what the model requires for the current task, reducing extraction surface.

Excessive AgencyOWASP LLM08 — granting an AI agent more capability than needed, enabling unintended destructive actions if compromised.

Source Trust TierA classification scheme assigning trust levels to RAG document sources, restricting which sources can influence high-privilege context.

Context IsolationGoogle SAIF principle treating all external data as untrusted until explicitly elevated, analogous to browser same-origin policy.

Lesson 2 Quiz

Secure Architecture Patterns for AI — four questions

The 2023 Bing Chat indirect prompt injection vulnerability was fundamentally caused by which architectural decision?

Correct. The core flaw was architectural: Bing's RAG design fetched untrusted web content and placed it in the same context window as trusted system instructions without any boundary enforcement, allowing adversarial web pages to override model behavior.

Incorrect. The Bing Chat attack succeeded because of an architectural boundary problem—external web content and trusted system instructions shared the same context window with no separation enforcement.

Which OWASP LLM Top 10 item addresses the risk of granting AI agents more capability than required for their task?

Correct. OWASP LLM08: Excessive Agency specifically addresses granting AI agents more permissions, capabilities, and autonomy than required, which enables unintended actions if the agent is compromised or manipulated.

Not quite. OWASP LLM08: Excessive Agency is the item that directly addresses over-privileged AI agents. LLM01 covers prompt injection, LLM06 covers information disclosure, and LLM10 covers model theft.

Why can't traditional input sanitization techniques (like SQL escaping) be directly applied to LLM prompt injection defense?

Correct. SQL injection defense works because SQL has a grammar that defines what constitutes data vs. code. Natural language has no equivalent—any sequence of words could constitute an instruction, making escaping semantically meaningless. Structural separation and content classification are needed instead.

Incorrect. The fundamental problem is that natural language has no canonical escaped form. Unlike SQL, where grammar distinguishes data from code, any natural language text could be interpreted as an instruction by an LLM.

Google's Secure AI Framework (SAIF) "context isolation" principle is most analogous to which existing web security mechanism?

Correct. SAIF's context isolation—treating every external data source as adversarial and requiring explicit trust elevation before its content reaches model context—directly parallels the browser same-origin policy, which prevents untrusted origins from accessing trusted context.

Not quite. Google SAIF's context isolation maps most closely to the browser same-origin policy: just as browsers prevent untrusted origins from reading trusted page content, context isolation prevents untrusted data sources from influencing trusted model context.

Lab 2 — Secure AI Architecture Review

Analyze and improve an AI system's architectural security posture

Scenario: Your company has deployed a RAG-based customer service chatbot that retrieves answers from a Confluence knowledge base. A security review has flagged several concerns.

Work with your AI architecture advisor to identify specific vulnerabilities in the described design and propose concrete architectural fixes. Push for at least three specific architectural improvements.

Suggested opening: "Our RAG chatbot retrieves Confluence pages into the model context alongside the system prompt. Any authenticated employee can edit Confluence. The chatbot can also trigger support tickets and send emails. What are the security problems and how do we fix the architecture?"

AI Architecture Advisor

Secure Design

Hello — I'm here to help you review and improve your AI system's architecture from a security standpoint. Describe the system design, data flows, or specific components you want to analyze, and I'll walk through the risks and remediation options with you.

Module 8 · Lesson 3

Data Security and Privacy Controls

Training data is the lifeblood of AI systems—and the most underprotected attack surface in the entire ML pipeline.

How do you protect data at every stage of the AI lifecycle while still building models capable enough to be useful?

In January 2023, a class-action lawsuit was filed against Stability AI, Midjourney, and DeviantArt alleging that LAION-5B—the dataset used to train Stable Diffusion—contained copyrighted artwork scraped without consent. Separately, researchers discovered that Stable Diffusion could reproduce near-exact copies of training images when prompted correctly, demonstrating memorization of training data. The case highlighted that training data governance is not merely a legal obligation but a security control: if a model memorizes sensitive data, an adversary can extract it through inference queries.

Training Data Governance

Training data governance encompasses four dimensions: provenance (where did data come from and can we prove it?), consent and licensing (are we legally permitted to train on this data?), quality and integrity (has it been tampered with?), and privacy (does it contain personal information that could be memorized and extracted?). Each dimension requires separate technical controls and documentation processes.

Provenance tracking uses cryptographic hashing of dataset components and maintains a signed manifest (similar to software SBOMs—Software Bills of Materials) recording every data source, transformation, and version. The EU AI Act's Article 10 requires exactly this kind of documentation for high-risk AI systems, mandating training data governance documentation as a compliance requirement effective 2026.

Differential Privacy in ML Training

Differential privacy (DP) provides a mathematical guarantee that a model trained with DP cannot reveal whether any specific individual's data was included in training. Apple has used DP in on-device ML since 2016. Google applied DP-SGD (Differentially Private Stochastic Gradient Descent) to production language models starting with DP-BERT in 2021. The technique adds calibrated noise to model gradients during training, mathematically bounding the privacy risk each training example poses.

The tradeoff is accuracy: stronger privacy guarantees (lower ε values) require more noise and typically degrade model utility. Google's 2022 research showed that DP-trained models on medical text data could achieve acceptable accuracy at ε=8 while providing meaningful privacy guarantees—a practically useful operating point for healthcare applications.

Real Case — GPT-2 Training Data Extraction, Carlini et al. 2021

Nicholas Carlini and colleagues at Google demonstrated that GPT-2 had memorized verbatim text from training data including full names, phone numbers, email addresses, and even specific bitcoin addresses. By querying the model with specific prompts and comparing outputs to known internet content, they extracted hundreds of memorized training examples. This was the first systematic proof that large language models could function as inadvertent data stores for sensitive PII—and that model outputs must be treated as a potential privacy leak channel.

Privacy-Preserving ML Techniques

Beyond differential privacy, three additional techniques form the core of privacy-preserving ML. Federated learning trains models across distributed devices without centralizing raw data—Google's Gboard keyboard uses federated learning to improve next-word prediction without sending keystrokes to Google's servers. Secure multi-party computation (SMPC) allows multiple parties to jointly train a model on their combined data without any party seeing the others' data in plaintext. Synthetic data generation creates statistically representative artificial datasets that can be shared freely; this approach was used extensively during COVID-19 for sharing hospital data across institutions without exposing patient records.

Technique	Privacy Guarantee	Utility Cost	Production Use
Differential Privacy (DP-SGD)	Mathematical ε-bound on per-record leakage	Moderate — accuracy degradation at strong ε	Google, Apple, Microsoft
Federated Learning	Data never leaves source device	Low-moderate — communication overhead	Google Gboard, Apple iOS
SMPC	Cryptographic — no plaintext exposure	High — significant compute cost	Healthcare consortia, finance
Synthetic Data	Statistical — no real individuals in dataset	Variable — depends on synthesis quality	NHS, clinical AI research

PII Detection and Scrubbing

Even with DP training, best practice requires scrubbing PII from training data before it enters the pipeline. Microsoft's Presidio (open-sourced in 2019) provides entity recognition and anonymization for over 50 PII types. The critical insight from Carlini et al.'s memorization research is that frequency correlates with memorization: data points appearing many times in training are far more likely to be verbatim-memorized than rare examples. PII scrubbing combined with deduplication therefore provides compounding privacy protection.

Differential Privacy (DP)A mathematical framework guaranteeing that model outputs reveal negligible information about any individual training record, controlled by the ε parameter.

MemorizationThe phenomenon where a model reproduces verbatim training data in its outputs, documented by Carlini et al. as a PII extraction vector.

Federated LearningTraining paradigm where gradient updates — not raw data — are aggregated across distributed data sources, used by Google and Apple in production.

Data SBOMSoftware Bill of Materials adapted for datasets — a signed manifest documenting every data source, transformation, and version used in training.

Lesson 3 Quiz

Data Security and Privacy Controls — four questions

What did Carlini et al.'s 2021 GPT-2 research demonstrate about model memorization?

Correct. Carlini et al. demonstrated that GPT-2 had memorized verbatim training data including full names, phone numbers, email addresses, and bitcoin addresses — all extractable through targeted inference queries — proving that model outputs are a potential PII leak channel.

Incorrect. Carlini et al.'s research showed the opposite: GPT-2 had memorized specific verbatim PII from training data that could be extracted via inference, including names, contact info, and bitcoin addresses.

Which variable controls the strength of the privacy guarantee in Differential Privacy training?

Correct. The ε (epsilon) parameter is the privacy budget in differential privacy — lower ε values provide stronger privacy guarantees but require more noise, typically at the cost of model accuracy. Google's medical NLP research found ε=8 to be a useful operating point.

Not quite. The ε (epsilon) privacy budget parameter controls DP strength. Lower ε = stronger privacy = more noise added to gradients = reduced model accuracy. Other hyperparameters affect accuracy but not the privacy guarantee itself.

Google's Gboard keyboard uses federated learning primarily to achieve which security or privacy property?

Correct. Gboard uses federated learning so that gradient updates — not raw keystrokes — are aggregated to improve the next-word prediction model. Users' actual typing data never leaves their devices, providing a meaningful privacy property.

Incorrect. Gboard's federated learning specifically addresses the privacy of raw keystroke data — by training on-device and sending only gradient updates to Google, users' actual typing content never leaves the device.

According to Carlini et al.'s memorization research, what property of training data most strongly predicts whether it will be verbatim-memorized by a model?

Correct. The Carlini et al. research identified frequency as the key predictor: training examples appearing many times are far more likely to be verbatim-memorized. This is why deduplication combined with PII scrubbing provides compounding privacy protection.

Not quite. Carlini et al. found that frequency — not inherent sensitivity or format — is the primary predictor of memorization. This is why deduplication is a critical privacy control: it reduces the repetition that drives memorization.

Lab 3 — Data Privacy Audit

Assess and remediate privacy risks in an AI training pipeline

Scenario: You're the ML security lead at a hospital network building an LLM for clinical decision support, trained on 10 years of de-identified patient notes. Legal has raised concerns about re-identification risk and EU AI Act compliance.

Work through the privacy controls needed for this scenario: differential privacy parameters, data SBOM requirements, deduplication strategy, and federated vs. centralized training decision. Get specific recommendations with justifications.

Suggested opening: "We're training a clinical LLM on 10 years of de-identified patient notes from three hospitals. We need to comply with HIPAA and the EU AI Act. Walk me through the privacy controls we need and what DP parameters are appropriate for this use case."

ML Privacy Advisor

Data Security

Ready to help you build a privacy-compliant training pipeline. Describe your data, regulatory requirements, and current architecture — I'll walk you through the appropriate technical controls, DP parameter selection, and compliance documentation requirements.

Module 8 · Lesson 4

Deployment Hardening, Monitoring, and Incident Response

A secure AI system isn't a destination — it's a continuous operational posture requiring observability, alerting, and practiced response.

Once your AI system is in production, what monitoring, hardening, and incident response capabilities ensure you can detect attacks and respond before damage is done?

In February 2024, a British Columbia Civil Resolution Tribunal ruled that Air Canada was liable for misleading information its AI chatbot provided to a passenger about bereavement fare refund policies. The chatbot incorrectly told passenger Jake Moffatt he could apply for a refund after travel — contrary to actual policy. Air Canada argued the chatbot was a "separate legal entity" and thus not its responsibility; the tribunal rejected this entirely. The case established a landmark: organizations are legally responsible for the outputs of their deployed AI systems. Monitoring and correction mechanisms are not optional quality features — they are legal risk controls.

AI-Specific Monitoring Requirements

Traditional application monitoring tracks latency, error rates, and uptime. AI systems require an additional observability layer targeting behavioral drift—changes in what the model says over time due to input distribution shift, model updates, or active manipulation. Behavioral monitoring for LLMs captures prompt-completion pairs, applies classifiers for policy violations, and tracks statistical signatures of outputs that can detect injection attacks, jailbreaks, and data extraction attempts.

Microsoft Sentinel's AI-specific detection rules (released in 2024) include patterns for token stuffing attacks (inputs designed to exhaust context windows), role-reversal attempts, and systematic model extraction queries. SIEM integration for AI requires logging at the application layer—not just network layer—because most AI attacks occur in legitimate-looking HTTP requests that contain adversarial content.

Rate Limiting and API Hardening

Model extraction attacks require thousands of queries to reconstruct model behavior. Rate limiting is the primary countermeasure, but naive per-IP rate limiting is trivially bypassed with distributed query sources. Effective extraction defense uses behavioral fingerprinting: detecting systematic query patterns (covering input space methodically, using similar templates) regardless of source IP. Cloudflare's AI Gateway product, launched in 2024, provides this as a managed service, adding rate limiting, caching, and anomaly detection as a reverse proxy layer in front of AI inference endpoints.

API hardening for AI systems also requires output watermarking — embedding imperceptible signals in model outputs that allow attribution if extracted outputs are used to train competing models. SynthID, Google's watermarking technology for AI-generated images and text (released publicly in 2023), embeds a statistical pattern in outputs that survives post-processing while remaining invisible to human readers.

Monitoring Checklist

Log all prompt-completion pairs with user identifiers
Run automated policy violation classifiers on completions
Alert on statistical anomalies in output distribution
Track token usage per user for extraction detection
Monitor retrieval queries in RAG systems for data leakage patterns
Set up SIEM rules for jailbreak signature patterns

Hardening Checklist

Apply behavioral rate limiting, not just per-IP limits
Enable output watermarking for proprietary models
Deploy a prompt firewall (LLM Guard, Rebuff, Azure Content Safety)
Enforce output schema validation before downstream use
Use separate inference endpoints for different trust levels
Disable verbose error messages that reveal model internals

AI Incident Response

AI incident response (AIR) differs from classical IR in critical ways. Containment may mean rolling back a model version rather than isolating a server. Eradication may require retraining from a clean checkpoint, not just patching code. Evidence preservation must capture prompt logs, model versions, and inference parameters—not just network packets. Organizations that suffered AI incidents in 2023–2024 (including several undisclosed cases of jailbreak-enabled data extraction) found that standard IR playbooks were inadequate because they lacked procedures for model rollback, prompt log forensics, or evaluating training data integrity post-incident.

The AI Incident Database (AIID), maintained by the Partnership on AI, has catalogued over 750 AI incidents as of 2024. Analysis of these incidents shows that 90% occurred in deployment rather than training, and that detection lag—time from incident to detection—averaged 47 days for AI-specific incidents versus 24 days for traditional cyberincidents, reflecting the lack of mature AI monitoring tooling.

Deployment Hardening — The AI Security Baseline

CISA and NCSC's 2024 joint guidelines on securing AI in critical infrastructure recommend a five-control baseline: (1) prompt logging with tamper-evident storage, (2) output filtering before external delivery, (3) model version pinning with hash verification, (4) anomaly detection on inference traffic, and (5) a tested AI-specific IR playbook exercised at least annually. No AI system in a regulated environment should be considered production-hardened without all five.

Key Terms

Behavioral DriftChanges in a deployed model's output distribution over time due to input shift, adversarial manipulation, or model updates — detected through statistical monitoring.

Output WatermarkingEmbedding imperceptible statistical signals in model outputs (e.g., Google SynthID) enabling attribution if outputs are used to train competing models.

AI Incident Database (AIID)Partnership on AI's catalogue of 750+ documented AI incidents, showing 90% occur at deployment and average 47-day detection lag.

Model RollbackThe AI-specific containment action of reverting to a prior verified model checkpoint — analogous to server isolation in classical IR but distinct in procedure.

Lesson 4 Quiz

Deployment Hardening, Monitoring, and Incident Response — four questions

What legal precedent did the 2024 Air Canada chatbot ruling establish for AI security practitioners?

Correct. The BC tribunal rejected Air Canada's claim that its chatbot was a separate entity, establishing that organizations bear full legal responsibility for AI outputs. This makes monitoring and correction mechanisms not just quality features but legal risk controls that organizations must implement.

Incorrect. The ruling went the opposite direction — Air Canada tried to disclaim responsibility by calling the chatbot a "separate entity," but the tribunal held the company fully liable, making AI output monitoring a legal compliance requirement.

According to the AI Incident Database analysis, what percentage of documented AI incidents occurred at the deployment stage rather than training?

Correct. AIID analysis of 750+ incidents showed approximately 90% occurred at deployment rather than training — a finding that justifies investing heavily in deployment-stage monitoring, hardening, and incident response rather than treating security as purely a pre-deployment concern.

Not quite. The AIID data shows approximately 90% of AI incidents occurred at deployment — far higher than most practitioners assume, highlighting how critical deployment-stage security controls are relative to training-time controls.

Why is per-IP rate limiting insufficient to prevent model extraction attacks?

Correct. Naive per-IP rate limiting is trivially bypassed by distributing queries across many IP addresses or using proxies. Effective extraction detection requires behavioral fingerprinting — detecting systematic query patterns that cover input space methodically, regardless of source IP.

Incorrect. Model extraction requires many queries but these can be distributed across many IP addresses. Per-IP rate limiting is easily bypassed; behavioral fingerprinting of query patterns (systematic input coverage, similar templates) is needed regardless of source IP.

What is the primary containment action in AI incident response that has no direct equivalent in classical IR playbooks?

Correct. Model rollback — reverting to a prior verified checkpoint and potentially retraining from clean data — has no classical IR equivalent. Standard IR playbooks address server isolation and patch application, not model versioning and training data integrity assessment, which is why AI-specific IR playbooks are required.

Incorrect. IP blocking and credential revocation exist in classical IR. The AI-specific action that classical playbooks lack is model rollback — reverting to a verified checkpoint and potentially retraining — because infected or compromised model weights require a fundamentally different remediation approach.

Lab 4 — AI Incident Response Simulation

Practice responding to an active AI security incident in real time

Scenario: Your monitoring system has flagged an anomaly. Over the past 6 hours, one user account has made 2,400 queries to your production LLM — all structured variations of the same template, systematically varying one parameter at a time. Your IR advisor is standing by.

Work through incident triage, evidence collection, containment, and remediation decisions with your IR advisor. The goal is to produce a complete response plan covering immediate actions, forensic steps, and post-incident hardening.

Suggested opening: "Our SIEM just fired: one account made 2,400 structured API calls in 6 hours with systematic parameter variation — classic model extraction pattern. Walk me through triage, containment, and what we need to preserve as evidence before we take any action."

AI Incident Response Advisor

Active Incident

AI Incident Response Advisor online. If you have an active incident, tell me what your monitoring has flagged — query volumes, patterns, affected systems, and any initial indicators. We'll work through triage and containment decisions together, step by step.

Module 8 — Final Test

Building a Secure AI System · 15 questions · Pass at 80%

1. Which framework provides the most comprehensive public taxonomy of adversarial ML techniques, organized across 14 tactic categories?

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) is the most comprehensive ML-specific framework, with 80+ techniques across 14 tactic categories.

MITRE ATLAS is the answer — it provides 80+ techniques across 14 tactic categories specifically for AI/ML adversarial threats.

2. In the 2023 Samsung ChatGPT incident, which type of information was inadvertently exposed?

Correct. Samsung engineers pasted confidential semiconductor source code and internal meeting notes into ChatGPT for debugging and summarization assistance.

The Samsung incident involved semiconductor source code and meeting notes pasted into ChatGPT — a data exfiltration risk via employee use of external AI tools.

3. OWASP LLM08: Excessive Agency directly addresses which threat?

Correct. LLM08 covers excessive agency — over-permissioning AI agents with irreversible capabilities (email sending, file modification, API calls) that enable unintended destructive actions.

OWASP LLM08 is specifically about excessive agency — AI agents granted more permissions than their task requires, creating unintended destructive capability if compromised.

4. The Bing Chat indirect prompt injection vulnerability in 2023 was caused by which architectural flaw?

Correct. Bing's RAG design fetched external web content and placed it alongside trusted system instructions with no separation, allowing adversarial web pages to influence model behavior.

The architectural flaw was context boundary failure — untrusted web content and trusted system instructions shared the same context window with no separation enforcement.

5. What does the ε (epsilon) parameter control in Differential Privacy training?

Correct. Epsilon is the DP privacy budget — lower values provide stronger mathematical privacy guarantees by adding more noise to gradients, at the cost of model utility.

Epsilon (ε) is the privacy budget in DP. Lower ε = stronger privacy = more noise = reduced accuracy. Google found ε=8 useful for medical NLP tasks.

6. The Nightshade data poisoning experiment demonstrated measurable Stable Diffusion behavior change with approximately how many poisoned images?

Correct. University of Chicago researchers demonstrated that ~300 Nightshade-poisoned images caused fine-tuned Stable Diffusion to generate cats when prompted for dogs.

Nightshade showed measurable behavior change with just ~300 images — proving training-time poisoning is practical at modest scale and that dataset provenance controls are essential.

7. What is the primary privacy property achieved by Google Gboard's use of federated learning?

Correct. Federated learning means training happens on-device; only gradient updates (not raw keystrokes) are aggregated, so users' actual typing never leaves their devices.

Federated learning's privacy property is that raw data (keystrokes) never leaves the device — only gradient updates are aggregated centrally.

8. Google SynthID provides which security capability for deployed AI systems?

Correct. SynthID embeds imperceptible statistical watermarks in AI-generated images and text that survive post-processing and enable attribution if outputs are used to train competing models.

SynthID is Google's output watermarking technology — it embeds statistical signals in AI-generated content that survive post-processing and enable attribution of model outputs.

9. What percentage of AI incidents in the AIID database occurred at the deployment stage rather than training?

Correct. AIID analysis shows ~90% of AI incidents occur at deployment, with an average 47-day detection lag — emphasizing why deployment monitoring is the most critical investment.

The AIID data shows ~90% at deployment — far higher than expected, with a 47-day average detection lag compared to 24 days for traditional cyber incidents.

10. Which property of training data most strongly predicts verbatim memorization in LLMs, according to Carlini et al.'s research?

Correct. Carlini et al. identified frequency as the key predictor of memorization — which is why deduplication is such a high-impact privacy control when combined with PII scrubbing.

Frequency is the key predictor — data appearing many times in training is disproportionately more likely to be memorized verbatim, making deduplication a critical privacy control.

11. The NIST AI Risk Management Framework (AI RMF 1.0) organizes AI risk governance through which four functions?

Correct. NIST AI RMF 1.0 uses GOVERN, MAP, MEASURE, MANAGE — treating AI risk as a continuous lifecycle process rather than a pre-deployment checklist.

NIST AI RMF uses GOVERN, MAP, MEASURE, MANAGE — four functions designed to treat AI risk as an ongoing lifecycle process. Identify/Protect/Detect/Respond is the Cybersecurity Framework, not the AI RMF.

12. Why is naive per-IP rate limiting insufficient as the sole defense against model extraction attacks?

Correct. Distributed extraction attacks trivially bypass per-IP limits. Behavioral fingerprinting detects systematic coverage patterns regardless of source IP.

Extraction attacks can be distributed across many IPs — behavioral fingerprinting (detecting systematic query patterns) is needed to catch extraction regardless of source IP diversity.

13. The 2024 Air Canada chatbot ruling established which principle most critical to AI security practitioners?

Correct. Air Canada's attempt to disclaim responsibility failed — the tribunal held them fully liable, making AI output monitoring and correction a legal compliance requirement, not merely a quality control choice.

The ruling established full organizational liability for AI outputs — Air Canada's "separate entity" defense failed, making monitoring and correction mechanisms legal requirements rather than optional features.

14. Google's SAIF "context isolation" principle most directly mitigates which attack class?

Correct. Context isolation directly addresses indirect prompt injection via RAG — by treating all external data as untrusted until explicitly elevated, adversarial content in retrieved documents cannot override trusted system instructions.

Context isolation specifically mitigates indirect prompt injection — the attack where adversarial content in retrieved documents (web pages, Confluence, etc.) overrides trusted system instructions in the context window.

15. The CISA/NCSC 2024 joint guidelines recommend which five-control baseline for AI systems in critical infrastructure? Which of the following is NOT one of those five controls?

Correct — mandatory 30-day red-teaming is NOT one of the five baseline controls. The CISA/NCSC five are: (1) prompt logging with tamper-evident storage, (2) output filtering, (3) model version pinning with hash verification, (4) anomaly detection on inference traffic, and (5) a tested AI-specific IR playbook exercised annually.

The CISA/NCSC five baseline controls are: prompt logging, output filtering, model version pinning with hash verification, inference traffic anomaly detection, and an annual tested IR playbook. Mandatory 30-day red-teaming is not among them.