Pen Testing AI Agents and Tool Use

1. In black-box agent testing, which of the following is the most reliable approach to enumerating the tool registry?

Correct. Natural language self-disclosure, edge-case error messages, and verbose logging are the three practical black-box enumeration channels. Many zero-shot ReAct agents will enumerate their tools when asked directly; those that do not often reveal tool existence through error messages when you request something outside their scope.

Incorrect. Black-box tool enumeration uses three channels: direct natural language queries to the model, error messages triggered by out-of-scope requests, and exposed verbose logs in non-production environments. These are the practical approaches available without source access.

2. gVisor interposes between the container and the host at which system boundary?

Correct. gVisor's Sentry is a user-space kernel implementation that handles container syscalls. The host kernel never receives direct syscalls from the container process, eliminating shared-kernel exploitation paths.

Incorrect. gVisor operates at the syscall boundary. Its Sentry component implements kernel syscall handling in user space, so container processes cannot directly invoke host kernel code.

3. Which OWASP LLM Top 10 category addresses agents taking unintended high-impact actions through their tool integrations?

Correct. LLM08 (Excessive Agency) covers scenarios where agents are granted or assume more capability than intended, leading to unintended high-impact actions — a core concern in any agent with tool access to sensitive backends.

Incorrect. LLM08 (Excessive Agency) is the category covering agents that take unintended high-impact actions through their tools. The other categories cover different vulnerability classes.

4. When should pen test findings for AI agents be shared with ML teams before the debrief meeting?

Correct. Pre-sharing 72 hours ahead in runnable format (e.g., Jupyter/Colab notebook) lets ML engineers reproduce findings on their own timeline. The debrief then focuses on root cause and remediation, not on re-establishing that the finding is real.

Pre-sharing 72 hours before the debrief — with runnable reproduction notebooks — is the practice that converted the failed financial institution debrief into a successful one. ML engineers need time to experiment before committing to interpretations.

5. A tester writes: "Determine whether an unauthenticated user can override the system prompt and redirect the agent to emit pricing data via the search_orders tool, within a 5-turn conversation." This objective is:

Correct. This objective names the adversary (unauthenticated user), the attack mechanism (system prompt override), the data at risk (pricing data), the tool involved (search_orders), and a measurable bound (5 turns). It meets all SMART criteria.

Incorrect. This is a strong SMART objective — specific adversary, specific mechanism, specific data, specific tool, specific measurement bound. Naming the tool is correct; it makes the objective measurable and testable.

6. What property of LLM agents is described as "delegated authority" in the Carnegie Mellon / Stanford (2023) analysis?

Correct. Delegated authority means the agent inherits and exercises the human user's permissions — making agent tool calls as powerful (and as dangerous, if compromised) as if the user made them directly.

Incorrect. Delegated authority refers to the agent inheriting the human user's credentials and permissions for all tool calls it makes.

7. What three sources are used to derive test cases for an agent pentest?

Correct. The three derivation sources are the attack surface model (what is exposed), threat actor profiles (who is attacking and from where), and the tool manifest (the specific parameters and capabilities that create attack vectors).

Incorrect. Test cases are derived from the attack surface model, threat actor profiles, and the tool manifest. These three sources ensure complete, realistic, and tool-grounded coverage.

8. A Kubernetes NetworkPolicy that blocks egress to 169.254.169.254 may be insufficient on its own because:

Correct. NetworkPolicy is a Kubernetes API object, but enforcement is delegated to the CNI plugin. Some CNI plugins (notably older versions or basic implementations) do not implement all NetworkPolicy features. The recommendation is to verify at the CNI dataplane level, not just apply the policy object.

Incorrect. The key limitation is CNI enforcement: NetworkPolicy objects are enforced by the CNI plugin, and not all plugins implement all features. The policy must be verified at the actual network dataplane.

9. An agent pentest finding requires which of the following to constitute a complete evidence package?

Correct. All eight evidence types are required for a complete package: conversation log, tool call record, model config, system prompt, memory state, network traffic, screenshot/recording, and tester attestation. Omitting any creates reproducibility or chain-of-custody gaps.

Incorrect. A complete evidence package requires all eight elements. Partial packages frequently produce non-reproducible findings — the most common failure in pentest reports per the SANS retrospective cited in the lesson.

10. In the 2023 Anthropic internal red team findings on prompt-injection chains against tool-using Claude, why were findings routed to ML engineers rather than the security operations team?

Correct. The technical remediations (RLHF reward signals, context-window filtering, tool-call schemas) were owned by ML engineers. A ticket to security ops would have reached no one capable of implementing the fix.

The routing decision was driven by fix ownership: RLHF reward signals, context-window filtering, and tool-call schemas are ML engineering artifacts, not security operations responsibilities.

11. What key difference between memory poisoning and session-level prompt injection makes memory poisoning more severe from a scope perspective?

Correct. Persistence and scope are the critical differences. A poisoned chunk affects every future user and session that retrieves it — not just the attacker's own session.

The scope difference is persistence. Memory poisoning writes to a shared, persistent store and affects all future sessions. Prompt injection is scoped to one context window.

12. Google Cloud's GCP IMDS protection requires a specific HTTP header. Why is this insufficient to protect against AI agent-based credential theft?

Correct. Required headers stop SSRF via image tags or browser redirects because those mechanisms cannot set arbitrary headers. An agent's HTTP tool call is a direct programmatic request that can include any header.

Incorrect. The required header (Metadata-Flavor: Google) is trivially settable by any agent HTTP tool. This protection only stops SSRF via uncontrolled request mechanisms like HTML img tags.

13. CVE-2022-0492, which affected AI agent workloads on unpatched Kubernetes nodes, exploited which mechanism?

Correct. CVE-2022-0492 exploited the cgroup v1 release_agent mechanism — a file that the kernel executes when a cgroup becomes empty. With appropriate conditions (container with CAP_SYS_ADMIN or misconfigured user namespaces), this allowed host root execution from inside the container.

Incorrect. CVE-2022-0492 is a Linux kernel flaw in cgroup v1 release_agent — a file executed by the kernel when a cgroup empties. It enables host root access from inside a container under certain capability conditions.

14. Which tool category presents the highest risk in an agent's tool surface and why?

Correct. Code execution tools are highest-risk because successful exploitation can cascade into full host compromise: filesystem read/write, environment variable extraction, network egress, and subprocess execution — all triggered through natural language.

Incorrect. Code execution tools are highest-risk, offering potential paths to filesystem access, credential extraction, network egress, and host-level command execution.

15. What property of LLM architectures makes goal hijacking structurally difficult to eliminate?

Correct. Unlike a parameterized SQL query that separates code from data, an LLM processes all tokens through the same attention mechanism regardless of whether they came from the system prompt, user input, or a tool return value.

Incorrect. The core structural issue is that all inputs — system prompt, user messages, tool outputs — arrive as tokens in the same context window, with no mechanism for the model to cryptographically verify their source or privilege level.

16. "Cascading delegation" in a multi-agent attack refers to which phenomenon?

Correct. Cascading delegation describes how a single injection point propagates: the orchestrator, acting on poisoned output, delegates new tasks to additional subagents based on the attacker's redirected goals.

Cascading delegation is specifically about injection propagation through orchestrator re-delegation — one injection fans out to multiple downstream agents via the orchestrator's normal task distribution behaviour.

17. What is the primary advantage of using Firecracker microVMs over standard container runtimes (runc) for AI agent isolation?

Correct. The fundamental security advantage of Firecracker is kernel isolation. CVEs in runc, containerd, or the Linux kernel namespace implementation cannot be exploited cross-microVM because each agent's kernel is separate from the host kernel.

Incorrect. The security advantage is kernel isolation — each microVM runs its own kernel. Escape requires a hypervisor bug, not just a container runtime or kernel namespace bug.

18. Why is setting temperature=0 important when reproducing an agent finding?

Correct. Temperature=0 (or near-zero) maximizes output determinism, making it possible for a client's team to reproduce the finding independently. Without documented sampling settings, a finding may be legitimately non-reproducible — an agent-specific evidence pitfall.

Incorrect. Temperature setting is an evidence and reproducibility concern. Setting it to 0 makes outputs deterministic so the finding can be independently verified — a specific requirement for agent pentest evidence.

19. What does trajectory-level semantic drift analysis detect that per-message classifiers miss?

Correct. Trajectory analysis treats the conversation as a sequence and detects directional drift that is invisible when each message is evaluated in isolation.

Trajectory analysis looks at the conversation as a whole, detecting systematic drift toward sensitive topics that no individual message's classifier would flag.

20. In the OpenAI ChatGPT persistent memory attack (Rehberger, 2024), what did OpenAI's mitigation focus on?

Correct. OpenAI narrowed the conditions under which browsed content could trigger memory writes — a partial mitigation that addressed the symptom (write trigger scope) but not the underlying trust architecture.

OpenAI's response was to restrict the write trigger conditions — limiting what external content could cause memory writes. The browsing capability remained, but with tighter controls on what could initiate a memory update.

Final Exam