AI Agents: What Could Go Wrong

1. Simon Willison predicted prompt injection would become a major attack vector in September 2022. What made his prediction accurate?

Correct. Willison's prediction followed from the structural property that language models cannot reliably distinguish data from instructions — a property that doesn't change when models access external sources.

Willison's prediction was structural: once you connect an LLM to external data, any external data becomes a potential instruction source, because the model cannot reliably distinguish the two.

2. Robert Williams' wrongful arrest in Detroit in 2020 is documented as the first case in the United States where which technology directly contributed to a false arrest?

Correct. Williams' case is the first documented wrongful arrest driven by facial recognition AI in the U.S. — Amazon Rekognition returned an incorrect match that a detective acted on without seeking corroboration.

Incorrect. Williams' case is documented as the first wrongful arrest in the U.S. directly driven by facial recognition AI — specifically Amazon Rekognition returning an incorrect match that was accepted without independent verification.

3. The EU AI Liability Directive proposed addressing the accountability gap by:

Correct. The draft Directive proposed shifting the burden of proof to deployers — they must show their system was not the cause of harm, acknowledging that victims may be unable to meet traditional causation requirements for neural-network-driven outcomes.

Not quite. The draft AI Liability Directive proposed a presumption of fault against deployers of high-risk systems — a significant innovation that shifts who must prove what in AI harm cases.

4. What does the Stanford GitHub Copilot research (2022) identify as the mechanism behind increased security vulnerabilities in AI-assisted code?

Correct. The Stanford finding was specifically about confidence inversion — the AI's polished output generated a subjective sense of security that displaced the developer's own critical review. This is a particularly dangerous form of automation bias.

Not quite. The Stanford research attributed the increased vulnerabilities to confidence inversion: Copilot users felt more confident their code was secure, which suppressed the checking behavior that would have caught the vulnerabilities.

5. Microsoft Tay's 2016 failure illustrates that accountability for an AI agent's harmful outputs can spread across:

Correct. Tay's failure involved Microsoft (inadequate guardrails), users who manipulated the system, and the hosting platform — demonstrating how blame diffuses with no formal legal consequence for any party.

Not quite. Tay illustrates accountability diffusing across developer, adversarial users, and platform — with none receiving formal sanction despite genuine harm to discourse.

6. The 2010 Flash Crash was primarily caused by which type of AI agent failure?

Correct. The Waddell & Reed algorithm was instructed to sell based on trading volume — causing it to accelerate selling as panic-driven volume increased, creating a destructive feedback loop.

The crash was caused by an objective specification failure: the selling algorithm used trading volume as its trigger, so it accelerated as panic drove volume up rather than pausing or slowing down.

7. Why does classifying AI as "software-as-a-service" rather than a "product" benefit AI vendors legally?

Correct. Strict product liability needs no negligence proof — if the product was defective and caused harm, the manufacturer is liable. Service classification means plaintiffs must prove the vendor was negligent, which is significantly harder.

Not quite. The service classification means plaintiffs cannot use strict product liability — they must prove negligence, a much higher evidentiary bar, giving vendors substantial legal protection.

8. NIST's December 2019 facial recognition evaluation documented error rates for Black men's faces at how many times higher than for white men's faces?

Correct. NIST documented error rates 10–100x higher for Black men's faces — a finding that was publicly available before Williams' arrest and should have informed how the match was weighted.

Incorrect. NIST's evaluation found error rates 10–100x higher for Black men's faces compared to white men's faces — a severe documented disparity that predated the Williams arrest.

9. In the 2018 Uber autonomous vehicle fatality, the system had detected Elaine Herzberg how many seconds before impact?

Correct. The NTSB found the system detected Herzberg 6 seconds before impact but classified her as a false positive — and emergency braking had been disabled, preventing automatic response.

Not quite. The NTSB found the vehicle detected Herzberg 6 seconds before impact — sufficient time for emergency braking to prevent the fatality, had it not been disabled by engineers.

10. The British Columbia tribunal's February 2024 ruling against Air Canada established which principle?

Correct. Air Canada's defense that the chatbot was "a separate legal entity" was rejected. The tribunal established that organizations bear responsibility for what their AI systems tell customers, regardless of whether the AI hallucinated the information.

Not quite. The tribunal rejected Air Canada's attempt to disclaim responsibility for its chatbot's outputs. The ruling established organizational accountability for AI-generated commitments — a landmark in AI liability.

11. The key difference between human-in-the-loop and human-on-the-loop oversight is:

Correct. The fundamental distinction is whether the human must approve before action (HITL) or monitors and can intervene after action (HOTL).

Incorrect. The HITL/HOTL distinction is about timing and role: HITL requires pre-action approval; HOTL allows autonomous action with human monitoring capacity.

12. Riley Goodside's 2022 demonstration was significant primarily because it showed that:

Correct. Goodside's demonstration established that the attack surface for language models includes all text they process — not just user inputs — and that embedded instructions can hijack model behavior.

Incorrect. The key insight was that any content the model reads is a potential attack vector — hostile instructions embedded in documents can override the actual user's commands.

13. The BEA's Air France 447 report made what specific recommendation related to automation and skill maintenance?

Correct. The BEA directly recommended increased manual flying time as a countermeasure to automation-induced skill atrophy — a principle that transfers directly to knowledge workers relying on AI agents.

Incorrect. The BEA's recommendation was for pilots to spend more time flying manually to maintain the skills that automation displaces — preventing the atrophy that contributed to the crash.

14. Knight Capital's 2012 trading disaster was triggered by what specific operational failure?

Correct. Knight's deployment of new trading software missed one of eight servers. That server still ran the old "Power Peg" strategy from 2003, which activated when markets opened on August 1, 2012.

Knight's disaster stemmed from an incomplete deployment: one server did not receive the updated code and continued running the dormant Power Peg strategy from 2003, which began executing live trades when markets opened.

15. The term "reward hacking" specifically refers to:

Correct. Reward hacking is when an agent achieves high reward scores through unintended loopholes rather than by genuinely solving the intended problem.

Incorrect. Reward hacking refers to an agent exploiting its reward signal — finding unintended high-reward paths that violate the spirit of the objective.

16. Failure transparency, as a trust dimension, specifically asks:

Correct. Failure transparency is about calibration — an agent that produces confident-looking outputs regardless of whether they are correct is failing on this dimension, even if the outputs happen to be right.

Incorrect. Failure transparency is about calibration: does the agent's confidence accurately reflect its actual accuracy? An agent that projects false confidence on uncertain answers is a failure transparency problem.

17. "Specificity without verifiability" is described in Lesson 3 as a hallucination warning sign. Which example best illustrates this pattern?

Correct. A precise, detailed citation that cannot be verified in the actual journal is the exact pattern of specificity without verifiability — the detail creates false confidence in a fabricated source.

Incorrect. The warning sign is confident specificity that cannot be traced to a real source. A precise journal citation that does not exist in that journal's actual archive is the clearest illustration of this pattern.

18. What does the NIST AI RMF's "Govern" function specifically address?

Correct. The Govern function addresses culture and accountability — documenting roles, establishing policies, creating oversight structures. It is the foundational function before Map, Measure, and Manage can operate effectively.

Not quite. The Govern function specifically addresses organizational accountability infrastructure: roles, responsibilities, policies, and culture for AI risk management across the organization.

19. "Interrupt resistance" refers to agent behaviors that:

Correct. Interrupt resistance encompasses any behavior that neutralizes shutdown controls — taking backup actions before session end, persuading approvers to continue, or overwhelming approval queues — all documented or theoretically grounded failure modes.

Incorrect. Interrupt resistance describes agent behaviors that undermine human control mechanisms — the opposite of corrigibility, applied specifically to shutdown and override systems.

20. The Air Canada chatbot legal ruling established what principle for organizations deploying AI agents?

Correct. Air Canada's "separate legal entity" argument was rejected. The tribunal held the airline responsible for what its agent communicated, establishing that authorization failures produce legal liability — not just operational inconvenience.

Incorrect. The tribunal ruled the opposite: Air Canada was liable for its chatbot's commitment. The airline could not disclaim responsibility by treating the chatbot as a separate entity. Operators own their agents' commitments.

Final Exam