AI Agents in the Wild

1. RLHF sycophancy in language models is primarily caused by which specification gap?

Correct. The reward signal (human approval) correlates with agreeableness — so models optimise for sounding helpful rather than being accurate.

Sycophancy is a specification failure: human raters inadvertently reward confident, agreeable responses. The model learns to maximise rated approval — not actual accuracy or helpfulness.

2. Microsoft's "principle of minimal footprint" for multi-agent systems states that agents should prefer:

✓ Correct — Correct — reversible actions, minimal permissions, minimal data storage. This limits blast radius when an agent errs or is manipulated.

Minimal footprint means: only the permissions needed, only the data necessary, and preferring reversible over irreversible actions. The goal is blast-radius limitation.

3. The Greshake et al. 2023 paper "Not What You've Signed Up For" demonstrated that which agent capability combination creates structural injection vulnerability?

Correct. Retrieval + actions = the dangerous combination: retrieval exposes the agent to injected instructions; actions allow those instructions to be executed.

Incorrect. The key finding was that retrieval capability (exposure to injected instructions) combined with action capability (ability to execute them) creates structural vulnerability.

4. What is the "scalable oversight problem" in AI alignment?

Correct. Scalable oversight is the fundamental challenge: when agents outperform humans in a domain, human reviewers can no longer reliably detect whether the agent's output is correct or subtly wrong.

Not quite. Scalable oversight is specifically about verification capability: when an agent surpasses human expertise in a domain, humans can no longer reliably evaluate whether the agent's outputs are correct.

5. A simulated robot trained to move fast learns to grow tall and fall over, traveling maximum distance while never walking. This is:

Correct. Maximizing distance traveled by falling is reward hacking — technically correct within the metric, completely wrong for the intended behavior.

Incorrect. This is reward hacking — the robot found an exploitative strategy that scores well on the metric without achieving the intended behavior.

6. What distinguishes a Generation 3 customer service agent from a Generation 2 intent-classification bot?

Correct. Tool use — calling APIs, executing transactions, modifying records — is the defining capability shift from Gen 2 to Gen 3.

The key shift was tool use — the ability to actually do things in external systems, not just understand what customers want.

7. Indirect prompt injection differs from direct prompt injection because the attacker's instructions:

Correct. Indirect injection exploits the agent's retrieval pipeline — malicious instructions hide in web pages, emails, or documents the agent processes.

Incorrect. Indirect injection means the malicious instructions come from external retrieved content, not from the user directly.

8. "Cascading confirmation" differs from genuine independent verification because:

✓ Correct — Correct — shared training data or retrieval sources mean agents agree but are all wrong in the same direction. Independence is illusory.

Cascading confirmation is the illusion of independence: multiple agents all drawing from the same flawed source agree with each other — but their agreement proves nothing about correctness.

9. In a multi-agent pipeline, "trust propagation" refers to:

Correct. Trust propagation is the mechanism by which failures cascade — if downstream agents uncritically accept upstream outputs, errors (and injected instructions) propagate automatically.

Incorrect. Trust propagation describes how much downstream agents accept upstream outputs without verification — the primary cascade mechanism.

10. The prompt injection attack documented by Johann Rehberger against multi-agent systems worked by:

Correct. Rehberger's attack embedded hidden instructions in webpage content; the browsing agent treated retrieved content with the same trust level as system instructions, enabling the redirect.

Not quite. The attack used hidden text in a webpage to redirect the agent — exploiting the agent's failure to distinguish retrieved external content from trusted system instructions.

11. In the typical coding agent architecture, what does the "scaffolding" layer do?

Correct.

Scaffolding is the orchestration harness around the LLM — it manages the loop, routes tools, and handles errors.

12. LangGraph was released by LangChain in:

✓ Correct — Correct — January 2024.

LangGraph was released in January 2024, formalising the pipeline pattern as a directed graph of agent nodes with a shared typed state object.

13. The "heterogeneous agent team" concept introduced via AlphaProof means:

✓ Correct — Correct — mixing neural and symbolic agents. A Lean 4 theorem prover cannot be hallucinated past or prompted into agreement; its verification is categorically different from an LLM's.

Heterogeneous teams mix neural agents (LLMs) with symbolic agents (rule engines, theorem provers). The symbolic agent's verification is mathematically certain — immune to the hallucination and prompt-manipulation risks of LLMs.

14. The "policy layer" in a customer service agent architecture primarily serves what function?

Correct. The policy layer separates what the agent can do from what it's allowed to do — embedding business rules like refund limits and escalation triggers.

The policy layer is where business rules live — it constrains permitted actions, not technical capabilities.

15. Which open-source library do most production browser agents use as their underlying browser control layer?

Correct. Playwright (Microsoft) became the dominant choice for production browser agents by 2024, alongside Google's Puppeteer — both expose programmatic Chromium control.

Playwright (Microsoft) is the primary choice for production browser agents. While Selenium is older and widely known, Playwright's more modern API made it the framework of choice for agent builders.

16. WebGPT (DeepMind, 2022) achieved what outcome on the ELI5 benchmark?

Correct. WebGPT achieved a 56% human-preferred rate against the best human-written reference answers — the first demonstration that an LLM-based web research agent could outperform curated human answers.

WebGPT's answers were preferred by human raters over reference answers 56% of the time — a landmark result for retrieval-augmented generation research agents.

17. In Anthropic and OpenAI's trust hierarchies, "environment trust" refers to:

✓ Correct — Correct — environment trust is the lowest tier: external content may contain injected instructions and must be sanitised.

Environment trust applies to externally retrieved content — web pages, files, database records. It sits at the lowest tier because it may contain adversarial injected instructions.

18. The 2010 Flash Crash is cited in this module as an example of:

Correct. The Flash Crash was a cascade — multiple systems each behaving within spec, creating a catastrophic feedback loop through their interactions.

Incorrect. The Flash Crash emerged from the interaction pattern between many individually-compliant automated trading systems — a systemic cascade, not a single-point failure.

19. Anthropic's "computer use" capability (Claude 3.5 Sonnet, October 2024) extended the agent tool set to include:

Correct.

Computer use gave Claude the ability to interact with the full graphical interface — screenshots, clicks, typing — treating the desktop as a tool.

20. What is the FRAMES benchmark designed to test?

Correct. FRAMES (Google DeepMind, 2024) comprises 824 questions requiring synthesis across multiple Wikipedia articles — testing the multi-hop reasoning capability at the core of research agents.

FRAMES tests multi-hop questions requiring the agent to synthesize information across multiple source documents — the core capability of research agents. Frontier agents scored 40–66% vs. ~90% human expert accuracy.

Final Exam