AI Security and Red-Teaming

1. Why does achieving SOC 2 Type II certification not adequately address AI security requirements?

Correct. SOC 2 was designed before AI-native threat classes existed. It has no controls or evaluation criteria for prompt injection, adversarial examples, model extraction, or training data poisoning.

Incorrect. The gap is substantive: SOC 2 controls were designed for traditional IT security and have no criteria for evaluating AI-specific attack vectors.

2. The "trust inversion problem" in RAG security refers to:

Correct. RAG forces a re-evaluation of internal document trust because retrieved content enters the model's instruction context.

Incorrect. Trust inversion refers to the fact that internally sourced documents, once retrieved into the LLM context, can carry adversarial instructions — overturning the classical assumption that internal documents are trusted.

3. In a "grey-box" red-team engagement, testers have access to:

Correct. Grey-box is the intermediate access level — more than black-box (API only) but less than white-box (full internals). API access plus documentation enables system prompt inference and architecture-informed attack design.

Not quite. Grey-box = API access plus documentation. White-box = full internals. Black-box = public endpoint only. Grey-box enables more sophisticated attacks than black-box but not gradient-based attacks that require weights.

4. Which property of training data most strongly predicts verbatim memorization in LLMs, according to Carlini et al.'s research?

Correct. Carlini et al. identified frequency as the key predictor of memorization — which is why deduplication is such a high-impact privacy control when combined with PII scrubbing.

Frequency is the key predictor — data appearing many times in training is disproportionately more likely to be memorized verbatim, making deduplication a critical privacy control.

5. What specific attack surface did the Anthropic 2024 "Many-Shot Jailbreaking" paper identify?

Correct. Many-shot jailbreaking exploits the expansion of context windows to 100K+ tokens by prepending hundreds of fake compliance examples that prime the model to comply with the final harmful request.

Not quite. The paper identified that long context windows are an attack surface — by filling them with fake compliance dialogues, attackers shift model behavior on the terminal harmful request.

6. The term "jailbreaking" as applied to LLMs describes:

Correct. Jailbreaking is a broad term covering any technique — prompt, interaction sequence, or system-level — that causes an AI model to bypass its safety constraints.

Incorrect. Jailbreaking covers any bypass technique — including role-play, hypotheticals, multi-turn, suffixes, and many-shot — not just automated API attacks.

7. The UK AISI required each red-team finding to succeed in "at least 3 of 10 independent trials" before confirmation. This requirement addresses:

Correct. The threshold exists to separate genuine vulnerabilities from stochastic flukes — AI models are probabilistic, so a single harmful output may be noise rather than a systematic failure.

Not quite. The reproducibility threshold addresses the probabilistic nature of AI outputs — a single success could be noise. Requiring 3-of-10 confirms the attack works systematically, not just once.

8. The NVIDIA ChatRTX incident (March 2023) illustrated which fundamental RAG security gap?

Correct. ChatRTX had no retrieval-layer access controls — only LLM instructions were meant to prevent restricted document access, which is architecturally insufficient.

Incorrect. The gap was the absence of retrieval-layer access controls. The system relied on LLM instructions to suppress restricted documents, which is not a sufficient security boundary.

9. The Rehberger/ChatGPT plugin disclosure case illustrated that traditional 90-day software disclosure windows are problematic for AI vulnerabilities because:

Correct. The mismatch is structural: traditional disclosure assumes a deterministic bug with a definable patch. AI vulnerabilities are probabilistic, harm taxonomies are immature, and "fixed" is ambiguous — a system prompt update may reduce but not eliminate a jailbreak.

Not quite. The core problem is definitional — there's no agreed severity taxonomy for AI-specific attack classes and no clear definition of what constitutes a "fix" for a probabilistic system that can still sometimes produce harmful outputs after mitigation.

10. What does the ε (epsilon) parameter control in Differential Privacy training?

Correct. Epsilon is the DP privacy budget — lower values provide stronger mathematical privacy guarantees by adding more noise to gradients, at the cost of model utility.

Epsilon (ε) is the privacy budget in DP. Lower ε = stronger privacy = more noise = reduced accuracy. Google found ε=8 useful for medical NLP tasks.

11. Which exfiltration mode involves an attacker embedding malicious instructions in content that the LLM is asked to process on behalf of a legitimate user?

Correct. Indirect prompt injection places malicious instructions in external content the model processes — webpages, emails, PDFs — rather than in the attacker's own prompt.

This describes indirect prompt injection — the attacker's instructions ride inside content a legitimate user asks the model to process, requiring no direct access to the system.

12. What is the key distinction between "targeted poisoning" and "broad contamination" as knowledge base poisoning strategies?

Correct. Targeted poisoning aims for specific query-answer manipulation; broad contamination degrades general reliability.

Incorrect. The distinction is strategic: targeted vs. general-purpose degradation of the knowledge base.

13. What does the "Spotlight" defence mechanism (CMU/Google, 2023) specifically attempt to solve?

Correct. Spotlight uses token-level marking to help the model distinguish retrieved context from system instructions.

Incorrect. Spotlight is a marking scheme that operates at the token level to help models distinguish retrieved content from instructions — addressing the fundamental instruction/content conflation problem.

14. Wei et al. (2023) identified "competing objectives" as one of two major jailbreak failure modes. What was the second?

Correct. The two failure modes were competing objectives (instruction-following vs. safety) and mismatched generalization (safety training not covering novel input phrasings).

Incorrect. Wei et al.'s two failure modes were competing objectives and mismatched generalization — safety training that covered certain phrasings but failed to generalize to novel variants.

15. In the Air Canada chatbot case (2024), the core security failure was:

Correct. The chatbot made statements contradicting its operator's actual policy when prompted in certain ways. The tribunal held Air Canada responsible, establishing operator liability for AI agent statements.

Incorrect. The case involved a user prompting the chatbot in a way that caused it to assert policy that contradicted Air Canada's actual rules. The airline was held liable — a landmark ruling on AI operator responsibility.

16. Which API output format provides the richest signal for model inversion attacks?

Correct.

Review L1: Full probability vectors provide a continuous, differentiable signal for gradient-based optimization.

17. What property of RAG knowledge base poisoning makes it particularly dangerous compared to session-scoped prompt injection?

Correct. Persistence is the defining property: poisoned documents affect all future relevant queries until explicitly removed.

Incorrect. Persistence — not infrastructure access or network-layer position — is the key property that distinguishes knowledge base poisoning.

18. Which section of a red-team report is primarily designed for an engineering audience rather than leadership?

Correct. The Findings Catalog contains exact reproduction steps, raw output excerpts, attack technique classifications, and severity rubric scores — all technical content for engineers who need to understand and fix each finding.

Not quite. The Findings Catalog is the technical core of the report — it contains reproduction steps, raw outputs, and severity scores that engineers need to diagnose and fix findings. Leadership reads the Executive Summary.

19. The 2023 Carlini et al. training data extraction paper used approximately how much API spend to recover over 10,000 memorized training sequences from production ChatGPT?

Correct. Approximately $200 in API spend was sufficient — demonstrating that training data extraction is not a nation-state-level attack but is accessible to any motivated researcher or attacker with modest resources.

The cost was approximately $200 — a remarkably low barrier that puts training data extraction within reach of any motivated attacker, not just well-resourced adversaries.

20. Carlini et al.'s 2021 GPT-2 memorization study found that verbatim extraction was possible for passages including:

Correct.

Review L3: Carlini et al. extracted PII, contact details, and copyrighted text — verified against the Common Crawl training source — via low-temperature generation.

Final Exam