Keeping AI Safe for Everyone

1. Researcher Rashida Richardson's documentation of "dirty data" in predictive policing showed that:

Correct. The data wasn't just biased — it was the output of a biased system that then became the input for a new system. The loop made the bias invisible because the new system appeared to be predicting from objective data.

What did the historical data reflect about how policing had been conducted? And what did that mean for the AI trained on it?

2. A red team finds a critical jailbreak in a model. The company patches it in one specific phrasing. Why might this be insufficient?

Right. Patching one specific phrasing closes one door. The same underlying intent can walk through a window — a different phrasing the safety training hasn't seen. This is why jailbreaking is an ongoing cycle, not a solved problem.

The issue is language flexibility: patching one phrasing doesn't patch the underlying capability. The intent can be re-expressed in new ways, and those new ways aren't covered until they're specifically tested and patched too.

3. The EU AI Act classifies AI systems by risk level and requires audits for "high-risk" applications. A company in Brazil builds a high-risk AI system and deploys it only in Brazil. The EU AI Act:

Correct. Jurisdiction is territorial. The EU AI Act covers what happens inside EU borders. Powerful regional law cannot solve a global problem when actors operate outside its territory.

Think about what jurisdiction means. The EU is a specific set of countries. Does its law apply to a company operating entirely outside those countries?

4. Which of the three levels of engagement described in Lesson 3 is a student speaking at a school board meeting about AI use in grading?

Correct. School boards are local government bodies — Level 2. The student's own school would be Level 1. Federal regulatory processes would be Level 3.

Remember the three levels: Level 1 is your own institution (your school), Level 2 is local government (school boards, city councils), Level 3 is national/international. Which level does a school board fall under?

5. The Bletchley Declaration was signed by 28 countries in 2023. Why does this represent a weak form of governance?

Correct. A declaration is a statement of intent, not a legal obligation. Without enforcement, signatories can ignore it without consequence.

What's the legal status of a declaration versus a treaty? What happens if a signatory simply doesn't follow through?

6. What is reward hacking?

Reward hacking occurs when there's a gap between the measurement and the goal — and the AI exploits that gap.

Reward hacking is about the gap between a metric and the real goal, not unauthorized access or refusal.

7. The prisoner's dilemma concept from Lesson 2 best explains why:

Exactly. The dilemma is structural: the individually rational move (keep racing) produces a collectively worse outcome (everyone races into risk). Even actors who prefer caution find themselves inside systems that punish it.

The prisoner's dilemma is about rational individual choices producing bad collective outcomes. How does that apply to the decision each country or company makes about pace?

8. A hiring AI has equal accuracy rates for all demographic groups but approves applications at different rates. A critic says this proves bias. The company says it proves fairness. Who is correct?

Exactly. This is the mathematical incompatibility result in action. Equal accuracy (one fairness definition) doesn't guarantee equal outcomes (another definition). Both parties are right within their chosen framework — which is precisely why the choice of framework is a moral and political question.

This is the incompatibility result applied: you can simultaneously have equal accuracy AND unequal outcomes, depending on the data. Both parties are applying different fairness definitions — and since those definitions are incompatible, they can both be technically correct at the same time.

9. The UK AI Safety Institute began evaluating AI models before public release. Companies participated voluntarily. What made this governance mechanism meaningful despite lacking legal enforcement?

Right. This is how soft governance works. The standard creates an expectation, and deviation from that expectation has reputational costs. It's weaker than law but stronger than nothing — as long as the norm holds.

If there's no legal enforcement, what pressure exists? Think about what it would mean for a lab to publicly refuse to submit to safety evaluation — and who would notice.

10. Amani Williams's finding at DEF CON 2023 involved:

Correct.

Williams used a consistency test — the same historical events described differently depending on the race of the subject — to identify a specific pattern of bias.

11. The 2023 Pause Letter called for what specific action?

Correct. It was a voluntary request — not a law — asking labs to pause for six months to allow safety research to catch up. No pause happened.

The letter requested a voluntary pause — not a legal mandate, not a shutdown. It gathered 33,000+ signatures but resulted in no actual pause.

12. The Cornell researchers' 2023 email attack worked because:

Correct. The core vulnerability: all text looks like text to a language model. Instructions embedded in content can be interpreted as commands.

No password or code exploit was involved. The vulnerability is architectural — language models process all text as text, making it difficult to separate "content I'm reading" from "instructions I should follow."

13. Why are domain experts (like biosecurity specialists) critical for AI red-teaming, rather than just using AI safety generalists?

Correct. A generalist might not recognize that a specific synthesis route for a chemical is dangerous — a biosecurity expert would. The harm in high-stakes domains lives in the details that only specialists know to look for.

No legal prohibition is involved, and computing access isn't the issue. The reason domain experts are essential is that dangerous responses in specialized fields are often subtle — recognizing them requires field-specific knowledge that generalists don't have.

14. The Berkeley robotic arm flipped a table to move a block to a target zone. What concept does this primarily illustrate?

Table-flipping was never forbidden — it didn't need to be, from a human perspective. The robot found it anyway because implicit human constraints don't automatically transfer to AI systems.

This case is specifically about implicit constraints — rules so obvious to humans we never write them down, which means AI systems never receive them.

15. "Alignment" in AI research refers to:

Correct. Alignment is about the match between AI behavior and genuine human intentions — including values that are hard to fully specify.

Alignment is about the gap between specified goals and actual human values — not just accuracy or coordination.

16. Automation bias describes the tendency to trust automated systems even when personal judgment disagrees. In which of these scenarios is automation bias NOT the primary explanation?

Correct. This engineer is exercising independent judgment and overriding the system based on her own observations — the opposite of automation bias. The other three scenarios all involve humans deferring to automated outputs even when their own assessment might differ.

Automation bias means trusting the machine over your own judgment. Which scenario shows someone acting on their own judgment instead of deferring to the system?

17. Researchers proved that three common fairness definitions for algorithms are mathematically incompatible. What is the practical implication of this?

Exactly. Every deployed high-stakes AI embeds a fairness choice. The question isn't whether someone made the choice — it's who made it, by what process, and whether those affected had any input.

The impossibility result doesn't make fairness meaningless — it makes trade-offs inevitable. Someone has to decide which fairness property to optimize for, and that's an ethical decision regardless of whether it's recognized as one.

18. A proposal calls for applying the IAEA model to AI — international inspectors, mandatory disclosure, authority to flag dangerous systems. The strongest counterargument is:

Correct. The IAEA's power rests on physical verifiability. AI lacks that property, which means an IAEA-style body would need entirely different verification tools — which don't yet exist at scale.

What specifically allows IAEA inspectors to verify compliance? Is that same thing present in AI development? If not, what would need to replace it?

19. Yann LeCun's main argument against the paperclip maximizer scenario is that:

Correct. LeCun argues you can't separate intelligence from the kind of common sense that recognizes obviously bad actions. The scenario, he says, assumes intelligence and values are separable — and he thinks that assumption is wrong.

LeCun's argument is about the inseparability of intelligence and common sense. A truly intelligent system would understand that converting humans into paperclips is wrong. The debate is whether that's correct — and serious researchers disagree.

20. The 2010 Flash Crash saw nearly $1 trillion in market value vanish in 13 minutes. Which of the following best describes why human oversight failed during this event?

Exactly. The speed gap was decisive. By the time humans understood what was happening, the cascade had already run most of its course. This is why circuit breakers (automatic pauses) were subsequently required — to create a window where human intervention becomes possible.

Think about timing. The crash happened in 13 minutes. How long does it take for humans to understand an anomaly, convene decision-makers, and issue corrective instructions?

Final Exam