Module 5 · Lesson 1

Model Inversion: Reconstructing What Models Memorized

When you query a model enough times, you can sometimes reconstruct the data it was trained on — faces, medical records, personal text.

How does querying a black-box model reveal its training data?

In 2015, researchers Matt Fredrikson, Somesh Jha, and Thomas Ristenpart published the first formal model inversion attack. They targeted a pharmacogenetics model — one that recommends warfarin dosages based on patient features. By querying the model's confidence scores and optimizing inputs toward maximum confidence, they reconstructed approximate feature vectors for individual patients whose data was used in training. The same paper demonstrated reconstruction of recognizable facial images from a commercial facial recognition API, recovering images that closely resembled specific individuals' training photos.

What Is Model Inversion?

Model inversion is a class of attack in which an adversary uses a model's outputs — predictions, confidence scores, or logits — to infer sensitive information about the training data. The attacker does not need access to model weights or training data directly. Access to the API is sufficient.

The core intuition: a model trained on data D learns a function that compresses and encodes information from D into its parameters. Some of that information can be "decoded" by systematically probing the model and inverting its learned mapping. The richer the model's outputs (logits vs. hard labels), the easier inversion becomes.

White-Box Inversion

Attacker has gradient access. Can directly optimize an input x to maximize P(class c | x) using backpropagation through the model. Fredrikson et al.'s original facial reconstruction used this path against a known architecture.

Black-Box Inversion

Attacker only sees output probabilities. Uses zeroth-order optimization — genetic algorithms, Bayesian optimization, or surrogate models — to reconstruct inputs. Slower but applicable to any hosted model.

The Pharmacogenetics Case in Detail

The warfarin dosage model used by Fredrikson et al. was trained on real patient records including age, weight, genotype markers (CYP2C9, VKORC1), and other clinical features. The model output a predicted dosage. By treating confidence as a loss function and minimizing over input feature space, the researchers could recover feature combinations that the model associated with specific individuals.

Key finding: Even when the model output only a continuous dosage value (not probabilities), enough signal remained to recover approximate training records. This challenged the assumption that limiting output granularity provides meaningful protection.

Attack Mechanism

Inversion attacks solve an optimization problem: find input x* such that model(x*) ≈ target output, or find x* that maximizes the confidence for a target class. With sufficient queries, the optimizer converges on inputs that resemble real training examples — because the model has overfit to them.

Generative Model Inversion (2019 Onward)

By 2019, researchers had developed GAN-based model inversion (MI-FACE, GMI). Instead of optimizing a raw image pixel-by-pixel, attackers trained a generative model to produce realistic images that maximize the target classifier's confidence. This dramatically improved reconstruction quality and reduced query counts.

The 2020 paper "Variational Model Inversion Attacks" by Vardam et al. demonstrated recovery of training-set-like faces from facial recognition classifiers with PSNR values indicating near-photographic quality — even against models accessed only via API. The 2022 paper "Plug & Play Attacks" showed that pre-trained generative models from unrelated datasets could be repurposed as strong inversion priors, requiring no white-box access.

Why This Matters for Privacy

Healthcare AI systems, biometric systems, and HR screening tools trained on sensitive personal data are all potentially vulnerable. The EU AI Act and GDPR both impose obligations around training data; model inversion may constitute unauthorized processing of personal data even when no training set is directly accessed.

Key Terms

Model InversionReconstructing training data or sensitive input features by querying a model's predictions or confidence scores.

Confidence Score OracleA model API that returns probability vectors, enabling gradient-free optimization of adversarial inputs.

GAN-Based InversionUsing a generative adversarial network as a prior to produce realistic reconstructions that maximize classifier confidence for a target class.

Privacy LeakageThe unintended disclosure of sensitive training data attributes through a model's learned parameters or outputs.

Lesson 1 Quiz — Model Inversion

Three questions · Select the best answer

1. In Fredrikson et al.'s 2015 model inversion attack, what signal did the attacker exploit to reconstruct training data?

Correct. Fredrikson et al. treated the model's output confidence as a loss function and optimized inputs to reconstruct training features — no weight access was required.

Not quite. The attack operated entirely through the model's prediction API, using output confidence to guide an optimization procedure toward training-data-like inputs.

2. GAN-based model inversion attacks (e.g., GMI, Plug & Play Attacks) improved over pixel-optimization inversion primarily by:

Correct. A generative prior (GAN or diffusion model) ensures that optimization stays within the distribution of realistic images, dramatically improving reconstruction quality and reducing required queries.

Not quite. GAN-based inversion constrains the search space using a learned generative prior, producing realistic outputs without requiring access to model internals.

3. Which of the following model output formats provides the MOST information to a model inversion attacker?

Correct. Full probability vectors (logits/softmax) provide a continuous, differentiable signal that enables gradient-based or zeroth-order optimization toward training-data-like inputs.

Not quite. Full softmax probability vectors over all classes give the richest signal — each dimension encodes gradient information about how the model would respond to perturbations.

Lab 1 — Model Inversion Threat Modeling

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Healthcare Classifier Risk Assessment

A hospital is deploying a machine learning model that predicts patient readmission risk. The model is accessed via an API that returns confidence scores. Your task is to think through model inversion risks with the AI assistant.

Start by asking: "What model inversion risks does a readmission prediction API with confidence scores expose?" — then explore mitigations, attacker capabilities, and real-world precedents.

AI Security Lab Assistant

Model Inversion

Ready when you are. Ask me about model inversion risks for the hospital readmission API scenario — we'll work through attacker capabilities, data leakage vectors, and practical defenses.

Module 5 · Lesson 2

Model Extraction: Stealing the Model Itself

With enough queries, an adversary can train a functionally equivalent copy of a proprietary model — at a fraction of the original training cost.

How do you steal a model you can only query, not inspect?

In 2016, Tramèr et al. published the foundational model extraction paper, demonstrating that classifiers from BigML and Amazon ML could be extracted with high fidelity using only API queries — sometimes with fewer than 5,000 queries. The stolen models matched the decision boundaries of the originals to within a few percentage points. In 2020, Krishna et al. extended this to large NLP models, showing that BERT-based classifiers could be extracted for commercial sentiment APIs using task-agnostic queries, with the extracted model matching original performance on held-out data.

In 2023, researchers demonstrated extraction of significant portions of OpenAI's text-embedding-ada-002 model — recovering an approximation of its embedding space using a structured query strategy disclosed in a paper titled "Stealing Part of a Production Language Model."

The Mechanics of Model Extraction

Model extraction (also called model stealing or model cloning) transforms a black-box API into a local substitute model. The attacker submits a set of inputs — natural, synthetic, or adversarially chosen — records the model's outputs, and uses those input-output pairs as a training dataset for a surrogate model.

The critical insight from Tramèr et al.: for many model classes, the decision boundary can be recovered with dramatically fewer samples than the original training set. A model that required millions of labeled training examples can sometimes be cloned using tens of thousands of API calls because the attacker is learning from the model's outputs, not noisy ground-truth labels.

Query Strategy Matters

Random queries are wasteful. Adaptive extraction strategies select queries near decision boundaries (where outputs are most informative), use active learning to prioritize uncertain regions, or leverage unlabeled in-domain data as a query set. Krishna et al.'s BERT extraction used task-agnostic Wikipedia passages as the query corpus.

Taxonomy of Extraction Attacks

Attack Type	Query Strategy	Target Model	Fidelity
Equation-Solving	Solve for parameters analytically	Linear / logistic models	Exact for simple models
Path-Finding	Random walk across decision boundaries	Decision trees, shallow classifiers	Near-exact structure
Knowledge Distillation	In-domain or synthetic queries	Deep neural networks	High functional similarity
Embedding Extraction	Structured pair queries	Embedding / retrieval models	Partial — linear subspace

The OpenAI Embedding Extraction Case

In 2023, a team from Google DeepMind and other institutions published "Stealing Part of a Production Language Model." They showed that by querying text-embedding-ada-002 with carefully chosen input pairs and analyzing output cosine similarities, they could recover a significant portion of the model's hidden representation space — including inferring the exact hidden dimension of the model.

Cost: The attack required approximately $20 in API calls to recover substantial structural information about a model that cost millions of dollars to train. This cost asymmetry is the core economic threat of extraction attacks — a competitor or adversary can approximate expensive proprietary models at negligible cost.

IP and Legal Dimensions

Model extraction has active legal dimensions. OpenAI's terms of service explicitly prohibit "using our services to develop any products or services that compete with OpenAI." In 2023, separate litigation explored whether an extracted model trained on API outputs constitutes a derivative work. Courts have not yet settled these questions definitively.

Key Terms

Model ExtractionTraining a surrogate model using input-output pairs obtained by querying a target model's API, producing a functionally similar clone.

Surrogate ModelA locally trained model that approximates the behavior of a target black-box model for attacker purposes.

Query BudgetThe number of API calls available to an attacker; a key constraint that determines extraction fidelity and attack strategy.

Fidelity vs. AccuracyFidelity measures how often the surrogate and victim agree; accuracy measures how often each is correct. High-fidelity extraction is possible even if the surrogate is less accurate on some tasks.

Lesson 2 Quiz — Model Extraction

Three questions · Select the best answer

1. Tramèr et al.'s 2016 model extraction research demonstrated that commercial ML APIs could be cloned primarily because:

Correct. The fundamental insight is that a surrogate trained on model outputs learns from near-noiseless "soft labels" that encode richer gradient information than human annotations, requiring dramatically fewer samples.

Not quite. APIs return predictions, not weights. The efficiency advantage comes from learning from model outputs as near-noiseless soft labels, which encode more information than original noisy training annotations.

2. In the 2023 embedding extraction attack against text-embedding-ada-002, the attacker's primary method was:

Correct. By carefully analyzing how cosine similarities between query pairs mapped to output similarities, the researchers could recover structural properties of the embedding space — including the exact hidden dimension — for approximately $20 in API costs.

Not quite. The attack used structured pairs and similarity analysis to reverse-engineer the embedding geometry, not any exploitation of API bugs or training data access.

3. What distinguishes "fidelity" from "accuracy" in the context of model extraction?

Correct. An attacker cares primarily about fidelity — matching the victim's behavior — because a high-fidelity surrogate enables transfer attacks, bypass testing, and IP theft even if both models have identical failure modes.

Not quite. The key distinction: fidelity measures how often surrogate and victim agree with each other (regardless of ground truth). High fidelity is what enables the attacker to transfer adversarial examples and test evasion strategies locally.

Lab 2 — Model Extraction Query Strategy

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Red-Teaming a Fraud Detection API

A financial institution runs a proprietary fraud detection model as a REST API returning confidence scores. As a red-team consultant, you're tasked with assessing extraction risk and designing a responsible disclosure briefing.

Begin by asking: "What query strategy would be most efficient for extracting a fraud detection classifier, and how should I quantify extraction risk in a red-team report?"

AI Security Lab Assistant

Model Extraction

Let's work through the fraud detection extraction scenario. Ask me about query strategies, measuring extraction fidelity, detection risk, or how to structure a red-team report around these findings.

Module 5 · Lesson 3

Membership Inference: Did Your Data Train This Model?

Membership inference attacks determine whether a specific data record was part of a model's training set — a privacy breach with legal and ethical implications.

Can an attacker determine that your medical record trained a hospital's AI system?

In 2017, Shokri, Stronati, Song, and Shmatikoff published the first systematic study of membership inference against machine learning models. By training shadow models that mimic the target, they showed that the target model's confidence on training members was statistically distinguishable from its confidence on non-members — achieving attack AUCs above 0.9 for overfitted models. In 2022, Carlini et al. published "Membership Inference Attacks from First Principles," establishing tight lower bounds and demonstrating high-confidence per-sample membership inference against production models including GPT-2.

The Core Observation

Machine learning models tend to exhibit overfitting — they fit training data more precisely than held-out data. This manifests as higher confidence scores on training examples than on similar non-training examples. An attacker who can observe this confidence gap can train a binary classifier (the attack model) to distinguish members from non-members.

Carlini et al.'s 2022 formalization uses the likelihood ratio test: for a sample x, compare the target model's loss on x against the loss distribution of models trained without x. If the loss is anomalously low, x was likely a training member. This approach achieves high precision at very low false positive rates — meaning individual records can be identified with high confidence.

Shadow Model Attack

Attacker trains multiple "shadow" models on datasets of known membership, learns the statistical signature of membership in confidence outputs, then applies this to the target model. Requires a distribution-similar dataset to train shadows.

Likelihood Ratio Test

Compare target loss L(x) against a reference distribution of losses from models that did not train on x. If L(x) is anomalously low, infer membership. Requires only a few reference models — often achievable with public checkpoints.

GPT-2 and LLM Membership Inference

Carlini et al.'s 2021 paper "Extracting Training Data from Large Language Models" (NeurIPS 2021) showed that GPT-2 had memorized verbatim passages from its training corpus. By generating text at low temperature and scoring outputs against known internet text, they identified hundreds of memorized sequences including personally identifiable information, private contact details, and copyrighted content.

A 2023 follow-up, "Quantifying Memorization Across Neural Language Models," found that larger models memorize more. GPT-Neo (125M) memorized around 1% of tested samples; larger models memorized significantly more. This relationship between model scale and memorization has direct implications for the safety of training on scraped internet data.

GDPR Right to Erasure Implications

If a data subject invokes GDPR Article 17 (right to erasure), the responsible party must demonstrate the record is no longer processed. Membership inference attacks could, in principle, be used to verify that a record has been removed — or to prove it has not. Machine unlearning research directly addresses this problem.

Factors That Amplify Membership Inference Risk

Overfitting: Models with high train/test accuracy gaps are most vulnerable. Regularization (dropout, weight decay, early stopping) reduces the confidence gap and therefore the attack's signal.

Small, unique training sets: Models trained on rare or unique data (specific individuals' records, rare medical conditions) show stronger memorization of those records.

Confidence score exposure: APIs that return full probability vectors provide far more signal than hard-label APIs. Restricting output to top-k classes or hard labels significantly degrades attack AUC.

Real-World Stakes

A 2023 study by researchers at ETH Zurich demonstrated membership inference against a clinical NLP model trained on hospital discharge notes. They achieved per-patient membership inference with AUC above 0.85, effectively allowing identification of patients whose notes trained the system — without access to any hospital data system.

Key Terms

Membership InferenceAn attack that determines whether a specific data record was included in a model's training set by analyzing model output statistics.

Shadow ModelA surrogate model trained by the attacker on data of known membership, used to learn the statistical signature of training-set membership.

Likelihood Ratio TestA statistical method comparing a sample's loss against a reference distribution to infer whether it was in the training set.

MemorizationThe phenomenon in which a model stores and can reproduce specific training examples — distinct from generalization.

Lesson 3 Quiz — Membership Inference

Three questions · Select the best answer

1. The shadow model attack (Shokri et al. 2017) succeeds primarily because:

Correct. The fundamental signal is the confidence gap caused by overfitting — training examples elicit higher, more peaked confidence distributions than similar non-training examples. The shadow model learns to classify this difference.

Not quite. Shadow models are trained by the attacker on data of known membership status, learning the statistical signature of membership in model outputs — which is caused by overfitting.

2. Carlini et al.'s 2021 extraction of memorized data from GPT-2 demonstrated that LLMs can memorize:

Correct. GPT-2 memorized and reproduced verbatim sequences containing PII, private contact details, and copyrighted text — generated via low-temperature sampling and verified against the Common Crawl training source.

Not quite. Carlini et al. demonstrated verbatim extraction of specific training sequences including personally identifiable information, credit card numbers found in crawled text, and copyrighted passages.

3. Which defensive measure most directly reduces membership inference attack success?

Correct. Differential privacy (DP-SGD) provides formal guarantees on how much any single training record can influence model parameters, directly bounding the information leakage that membership inference attacks exploit.

Not quite. Differential privacy training (DP-SGD) is the most principled defense — it adds calibrated gradient noise that mathematically bounds the per-sample influence on model parameters, directly reducing the confidence gap that membership inference exploits.

Lab 3 — Membership Inference Risk Analysis

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Clinical NLP Model Audit

A healthcare AI company is auditing a clinical text classification model trained on hospital discharge notes. A regulator has asked whether membership inference could expose individual patients. You are the AI security consultant preparing the technical brief.

Start with: "How would I quantify membership inference risk for a clinical NLP model, and what privacy guarantees can I offer to the regulator?"

AI Security Lab Assistant

Membership Inference

I'm ready to help with the clinical NLP membership inference audit. Ask me about quantifying attack risk, shadow model methodology, differential privacy guarantees, or how to communicate findings to a healthcare regulator.

Module 5 · Lesson 4

Defenses: Differential Privacy, Output Restriction, and Watermarking

Practical defenses against inversion and extraction exist — but each involves tradeoffs between security, utility, and cost.

What defenses actually work, at what cost, and how do attackers adapt?

In 2017, Apple published details of deploying local differential privacy for emoji and QuickType keyboard learning — one of the first large-scale DP deployments. The same year, Google open-sourced its RAPPOR system for differentially private telemetry collection. In 2020, Google Brain released TensorFlow Privacy, providing DP-SGD implementations that have since been used to train models for clinical and financial applications where membership inference risk is legally significant. These deployments demonstrated that DP training at scale is operationally feasible but comes with measurable accuracy costs — Google's published work noted 2–4% accuracy reduction on standard benchmarks at epsilon values that provide meaningful privacy.

Defense 1: Differential Privacy (DP-SGD)

Differential privacy, formalized by Dwork et al. in 2006, provides a mathematical guarantee: a mechanism M satisfies (ε, δ)-DP if the presence or absence of any single training record changes M's output distribution by at most a factor of e^ε (with probability 1−δ). In the training context, DP-SGD clips per-sample gradients to bound their individual influence, then adds calibrated Gaussian noise before each gradient update.

What it buys: A formal upper bound on membership inference attack advantage — regardless of attacker sophistication, the attacker cannot distinguish whether a specific record was in the training set beyond the DP guarantee. Cost: Accuracy degradation (1–5% typical), significantly increased training time, and hyperparameter sensitivity. Lower ε (stronger privacy) = larger accuracy hit.

Epsilon Values in Practice

Published DP deployments typically report ε = 1 to ε = 10. ε = 1 is considered "strong" privacy (academic standard); ε = 8–10 is more common in production where utility is prioritized. Apple's emoji DP uses ε = 4. The US Census Bureau's 2020 decennial census used ε ≈ 17.14 after public debate about the utility/privacy tradeoff.

Defense 2: Output Restriction and Confidence Truncation

Restricting model outputs is a practical, low-cost mitigation. Returning only hard labels (no probabilities) degrades inversion and extraction attack quality significantly. Returning only top-1 or top-3 class predictions rather than full softmax vectors limits the information available per query. Rounding confidence scores to two decimal places reduces the precision available to gradient estimators.

Limitation: even hard-label APIs leak some signal. Choquette-Choo et al. (2021) showed that label-only membership inference attacks achieve meaningful AUC against hard-label classifiers through a large number of queries. Output restriction is a speed bump, not a wall — but it meaningfully raises attacker cost and reduces fidelity of extracted surrogates.

Defense 3: Model Watermarking

Model watermarking embeds detectable, persistent signatures into a model's behavior — typically by training on a small set of backdoor "trigger" examples whose outputs encode an identifying pattern. If a suspected stolen model is queried on the trigger set and reproduces the watermark outputs, ownership can be asserted.

In 2018, Adi et al. demonstrated watermarking of deep neural networks by fine-tuning on abstract trigger images with memorized labels. Zhang et al. (2018) and subsequent work showed the watermarks survive model compression, fine-tuning, and pruning to varying degrees. Limitations: Sophisticated attackers can remove watermarks via model fine-tuning or distillation on unlabeled data. 2021–2023 research showed that most existing watermarks can be removed or overwritten with modest computational budget. Active research continues on robust, watermark-aware training.

Defense 4: Query Rate Limiting and Anomaly Detection

Operational defenses — monitoring API query volume, detecting sequential boundary-probing patterns, requiring authentication for high-confidence queries — add friction without modifying the model. Major cloud ML providers (AWS SageMaker, Google Vertex AI, Azure ML) now offer built-in query logging and anomaly detection. These do not prevent determined, patient attackers but significantly raise cost and risk of detection.

The Arms Race: Adaptive Attacks

Every defense spawns an adaptive attack. DP-SGD? Carlini et al. showed that membership inference against DP models remains feasible at high ε values. Output restriction? Label-only attacks work with more queries. Watermarking? Distillation and fine-tuning can remove most watermarks. Anomaly detection? Attackers can throttle query rates and use residential proxy networks.

The practical conclusion: defenses are most effective in combination — DP training + output restriction + rate limiting + watermarking — and should be selected based on the threat model's attacker budget and sophistication. No single defense provides complete protection against a well-resourced, adaptive adversary.

Key Terms

DP-SGDDifferentially private stochastic gradient descent — clips per-sample gradients and adds Gaussian noise to provide formal ε-differential privacy guarantees during training.

Epsilon (ε)The privacy budget in differential privacy — smaller ε means stronger privacy and typically more utility loss.

Model WatermarkingEmbedding persistent, detectable behavioral signatures in a model to enable ownership verification of extracted surrogates.

Adaptive AttackAn attack that is specifically designed knowing the defense in use, representing the worst-case security evaluation scenario.

Lesson 4 Quiz — Defenses

Three questions · Select the best answer

1. In DP-SGD, what two operations are applied to protect individual training record privacy?

Correct. DP-SGD clips each training sample's gradient to a maximum L2 norm (bounding individual influence), then adds Gaussian noise scaled to the sensitivity and desired ε — ensuring no single record can shift the model's parameters beyond the privacy budget.

Not quite. DP-SGD works by (1) clipping per-sample gradients to bound maximum individual influence, and (2) adding calibrated Gaussian noise to each gradient update — the combination provides formal (ε, δ)-differential privacy guarantees.

2. Model watermarking as a defense against extraction is primarily limited by:

Correct. Research from 2021–2023 demonstrated that most existing watermarking schemes can be defeated by adversarial fine-tuning on unlabeled data or knowledge distillation, which washes out the trigger-response associations without requiring knowledge of which trigger images were used.

Not quite. The core limitation of current watermarking schemes is their vulnerability to removal via fine-tuning or distillation — an attacker with modest compute can often wash out watermark signatures without access to the trigger set.

3. What is the main limitation of output restriction (returning hard labels instead of confidence scores) as a defense against membership inference?

Correct. Choquette-Choo et al. (2021) demonstrated label-only membership inference attacks that use model perturbation and decision boundary proximity to infer membership with higher query counts but still meaningful success — output restriction raises cost but does not eliminate the vulnerability.

Not quite. Choquette-Choo et al. showed that label-only membership inference remains feasible — by probing the model with augmented/perturbed versions of a target sample and analyzing label consistency patterns, attackers can still infer membership without confidence scores.

Lab 4 — Designing a Defense Stack

Interactive AI lab · Minimum 3 exchanges to complete

Scenario: Securing a Fintech Credit Scoring API

A fintech company exposes a credit scoring model via API. The model was trained on proprietary customer behavioral data. Leadership wants a defense strategy against model extraction, membership inference, and inversion — with acceptable accuracy tradeoffs documented.

Start by asking: "How should I prioritize and layer defenses — DP training, output restriction, watermarking, and query rate limiting — for a credit scoring API where both privacy and accuracy are critical?"

AI Security Lab Assistant

Defense Design

Let's design a defense stack for the fintech credit scoring scenario. Ask me about the tradeoffs between DP epsilon values and accuracy, how to layer output restriction with rate limiting, watermarking strategies, or how to present these tradeoffs to a non-technical leadership team.

Module 5 Test — Model Inversion and Extraction

15 questions · Score 80% or higher to pass · All lessons covered

1. Fredrikson et al.'s 2015 model inversion attack against a pharmacogenetics model demonstrated that:

Correct.

Review L1: Fredrikson et al. operated only through the API, using confidence as an optimization objective.

2. GAN-based model inversion improves over pixel-optimization attacks primarily by:

Correct.

Review L1: Generative priors constrain optimization to realistic images, dramatically improving reconstruction quality.

3. Which API output format provides the richest signal for model inversion attacks?

Correct.

Review L1: Full probability vectors provide a continuous, differentiable signal for gradient-based optimization.

4. Tramèr et al.'s 2016 model extraction paper found that commercial ML APIs could be cloned because:

Correct.

Review L2: Soft label outputs encode richer gradient information, enabling efficient decision boundary approximation.

5. In the 2023 "Stealing Part of a Production Language Model" paper, researchers extracted structural information from text-embedding-ada-002 at a cost of approximately:

Correct. The extreme cost asymmetry — $20 to approximate a model costing millions to train — is the core economic threat of model extraction.

Review L2: The attack cost approximately $20 in API calls, illustrating extreme cost asymmetry versus the victim's training investment.

6. Model extraction "fidelity" measures:

Correct.

Review L2: Fidelity is surrogate-victim agreement — an attacker needs high fidelity to enable transfer attacks, even if both models fail on some examples.

7. The shadow model membership inference attack (Shokri et al. 2017) exploits:

Correct.

Review L3: Overfitting causes training members to receive higher, more peaked confidence distributions — the shadow model learns to classify this pattern.

8. Carlini et al.'s 2022 "Membership Inference Attacks from First Principles" used which statistical framework?

Correct.

Review L3: The likelihood ratio test compares a sample's loss against a reference distribution from models trained without it — achieving high per-sample precision at low false positive rates.

9. Carlini et al.'s 2021 GPT-2 memorization study found that verbatim extraction was possible for passages including:

Correct.

Review L3: Carlini et al. extracted PII, contact details, and copyrighted text — verified against the Common Crawl training source — via low-temperature generation.

10. DP-SGD protects individual training records by:

Correct.

Review L4: DP-SGD clips gradients (bounding per-sample influence) then adds noise — the two operations together provide (ε, δ)-differential privacy.

11. Apple's production deployment of differential privacy for keyboard learning used an epsilon value of approximately:

Correct. Apple's emoji and QuickType DP deployment used ε ≈ 4, a production pragmatism between the academic standard of ε = 1 and the near-meaningless ε > 10 range.

Review L4: Apple used ε ≈ 4 for its keyboard learning DP deployment — a practical balance between privacy and utility.

12. Model watermarking schemes are primarily vulnerable to removal via:

Correct.

Review L4: Fine-tuning and distillation on unlabeled data washes out watermark trigger-response associations — no knowledge of the trigger set required.

13. Choquette-Choo et al.'s (2021) label-only membership inference attack demonstrated that:

Correct.

Review L4: Output restriction raises the bar but does not eliminate the attack — label-only membership inference uses boundary proximity signals from augmented/perturbed queries.

14. Which factor MOST directly amplifies membership inference attack success against a deployed model?

Correct.

Review L3: Overfitting creates the confidence gap that membership inference exploits. Well-regularized models with small train/test gaps are significantly more resistant.

15. Which combination of defenses provides the strongest practical protection against the full range of inversion, extraction, and membership inference attacks?

Correct. No single defense fully addresses the adaptive adversary. Layered defenses raise attacker cost at each stage, forcing tradeoffs between query budget, extraction fidelity, and detection risk.

Review L4: Adaptive attackers can defeat any single defense. Layering DP training, output restriction, rate limiting, and watermarking forces the attacker to overcome multiple independent barriers.