Module 7 · Lesson 1

The Researchers Who Saw It Coming

Before there were headlines, there were a handful of people taking the problem seriously enough to build their careers around it.

Who decided AI safety was worth working on — and what pushed them to start?

On July 22, 2023, Ilya Sutskever — co-founder of OpenAI and one of the most cited AI researchers alive — quietly signed a public statement. The statement was short. It said, in plain language, that mitigating the risk of extinction from AI should be a global priority alongside pandemics and nuclear war.

Over 350 AI scientists, executives, and engineers signed it on the same day. Many of them had spent years building the very systems they were now warning about. The statement was published by the Center for AI Safety, a small research nonprofit in San Francisco that most people outside the field had never heard of.

Reporters called it shocking. But to the researchers who had been working on AI safety for a decade, it wasn't shocking at all. It was, if anything, overdue. The question is: who had been working on this before it became a headline — and why did they start?

The Field That Almost Wasn't

In the early 2000s, worrying about AI safety was considered eccentric — maybe even embarrassing. AI systems at the time could barely play chess or recognize photographs. Talking about existential risk from AI was the kind of thing that got researchers laughed out of conferences.

Then in 2000, a philosopher named Nick Bostrom at Oxford University published a paper asking a question almost nobody else was asking: what happens when AI systems become significantly more capable than humans? Not science fiction AI — real systems, built the same way the existing ones were built, just much better. He called the scenario he was worried about a superintelligence, meaning an AI whose reasoning ability exceeds the best human experts across most or all domains.

Most people dismissed it. A handful didn't.

In 2004, a programmer and writer named Eliezer Yudkowsky had already founded something called the Machine Intelligence Research Institute (MIRI) in Berkeley, California — one of the first organizations in history dedicated specifically to making sure future AI systems wouldn't cause catastrophic harm. He was 24 years old. He had no PhD. He had been writing about AI risk online since 1996, when the web was barely a few years old.

What made Yudkowsky take it seriously when almost nobody else did? He had thought through a specific scenario: an AI that is given a goal — almost any goal — and is sufficiently capable might pursue that goal in ways that humans don't anticipate, don't want, and can't stop. The problem wasn't that AI would be evil. The problem was that it might be indifferent to human welfare while being extremely effective at achieving something else. He called this misaligned goals — the AI optimizes hard for the wrong thing.

Misaligned goalsWhen an AI system pursues an objective that doesn't match what humans actually want — often because the goal was poorly specified in the first place.

The Moment the Field Became Real

For years, MIRI operated on small donations and was read mostly by people who found it through online forums. That changed in 2014, when Nick Bostrom published a book called Superintelligence. It became a bestseller. Elon Musk tweeted that it was worth reading "if you care about the future of humanity." Bill Gates said it was one of the most important books he'd read. Suddenly, AI safety was in the news.

That same year, 2014, a group of researchers at Google — most of them working on machine learning — started asking internal questions about what would happen if their systems scaled up significantly. One of those researchers was Stuart Russell, a computer science professor at UC Berkeley who would later write the standard university textbook on AI. Russell wasn't a fringe figure. He was about as mainstream as AI research gets. And he started saying publicly that the field needed to think seriously about whether AI systems would actually do what humans wanted them to do.

In 2015, Russell helped co-found the Center for Human-Compatible AI (CHAI) at Berkeley — specifically to work on the technical problem of making AI systems that reliably understand and pursue human values, not just the specific goals written into their code.

Also in 2015, Elon Musk, Sam Altman, and several top researchers co-founded OpenAI, explicitly framing it as a safety-first AI lab. The stated reason: if powerful AI was coming regardless, it was better to have safety-focused researchers at the frontier than to leave the field to organizations that didn't think about the risks.

That logic — you can't stop it, so you'd better be the one doing it carefully — is one of the central tensions in AI safety that you'll see come up again and again.

Ethical Tension

If you believe a dangerous technology is inevitable, does building it yourself — while trying to be careful — actually make the world safer? Or does it just accelerate the timeline? There is no clean answer to this. Some of the people who co-founded OpenAI in 2015 later left, saying the lab had become what it was created to prevent.

You Now See What Most People Miss

Most people think AI safety research started recently — maybe 2022 or 2023, when ChatGPT became famous and politicians started making speeches about it. You now know that the foundational work started decades earlier, built by a small number of people who were dismissed, then gradually taken seriously, then suddenly treated as prophets.

That history matters. It tells you that the organizations working on AI safety today didn't appear out of nowhere. They have intellectual genealogies — specific arguments, specific people, specific moments when the field shifted. MIRI, CHAI, OpenAI's safety team, the Center for AI Safety: these aren't interchangeable. They have different philosophies, different methods, and sometimes disagree sharply with each other.

Knowing where each organization came from helps you evaluate what they say — and what they might be missing.

You Can Now See This

When a tech CEO says their company "takes safety seriously," you can now ask: what organization did their safety team come from? What specific problem are they trying to solve? Are they focused on misaligned goals, on near-term harms, or on longer-term risks? The answers tell you far more than the headline does.

Module 7 · Quiz 1

The Researchers Who Saw It Coming

5 questions · Choose the best answer for each

1. Eliezer Yudkowsky founded MIRI in 2004 to address a specific concern. What was it?

Correct. Yudkowsky's core argument was misaligned goals — an AI optimizing hard for the wrong objective, not a malicious one.

Not quite. The key insight was that harm doesn't require malice — just a powerful system pursuing the wrong goal relentlessly.

2. In 2015, Stuart Russell co-founded the Center for Human-Compatible AI. What specific technical problem was CHAI designed to solve?

Exactly right. CHAI's founding problem was value alignment — the gap between what we write into AI code and what we actually want.

CHAI's focus is specifically on the value alignment problem — the gap between stated goals and actual human values.

3. A new AI lab claims it is "safety-focused" because powerful AI is coming and it's better to develop it carefully than to leave it to less careful organizations. What is the core tension in this argument?

Yes. This is the exact tension that caused some OpenAI co-founders to later leave — the worry that "building it carefully" still means building it faster.

The tension is about acceleration: does participating in the race, even carefully, make the outcome safer — or just sooner?

4. The July 2023 statement on AI extinction risk was signed by hundreds of AI researchers. Why would this statement carry more weight than, say, a statement from a group of philosophers or politicians?

Right. Credibility here comes from proximity — these are the people who know the systems from the inside.

Think about who is signing: these are the builders, not outside observers. Their technical proximity to the systems is what gives the statement weight.

5. Nick Bostrom's 2014 book Superintelligence helped move AI safety into mainstream conversation. Before that book, what was the main reason AI safety researchers weren't taken seriously?

Correct. The systems of the early 2000s could barely play games — warning about existential risk looked absurd to most of the field at the time.

The issue was that the gap between current AI and dangerous AI seemed enormous — making the warnings sound like science fiction to mainstream researchers.

Module 7 · Lab 1

Mapping the Pioneers

You're building a briefing on early AI safety history. Your research partner will push back.

Your Assignment

You've been asked to write a one-paragraph briefing on why AI safety research started when it did and who drove it. Your AI research partner has read the same material — but will challenge your framing, ask for evidence, and point out what you might be getting wrong.

Your job is to defend a position, not just summarize facts. Take a stance: was the early AI safety movement too early, too late, or well-timed? Who deserves the most credit for making it a legitimate field?

Start by stating your position. Example: "I think Yudkowsky deserves the most credit because..." — then I'll push back on whatever you say.

Research Partner

AI Safety History

Alright — you're writing a briefing on the early AI safety field. I've read the same sources. Give me your take: who was most important in making AI safety a real field, and why? Don't just list names — pick one and defend it. I'll tell you what you're missing.

Module 7 · Lesson 2

The Labs Inside the Labs

Some of the most powerful AI companies in the world have entire teams whose job is to slow down or limit what those companies build.

How do safety teams inside AI companies actually work — and do they have any real power?

On November 17, 2023, the board of directors of OpenAI fired Sam Altman, the company's CEO. The reason, they said, was that he had not been "consistently candid" with the board. What exactly he had withheld, they did not say publicly.

Within 48 hours, nearly every senior employee at OpenAI — over 700 people — signed a letter threatening to quit unless the board reversed its decision. Several members of OpenAI's own safety team signed the letter too. Five days later, Altman was reinstated. The board members who fired him resigned.

The episode revealed something that most people outside the industry didn't know: OpenAI's structure was specifically designed so that a safety-focused nonprofit board could override commercial decisions. That structure had just failed its first real test — or succeeded in it, depending on who you ask. Either way, a safety mechanism had been installed, used, and then dismantled, all in under a week.

What "Internal Safety Team" Actually Means

Every major AI lab — OpenAI, Google DeepMind, Anthropic, Meta AI — now has some version of a dedicated safety team. But the word "safety" can mean very different things depending on which lab you're talking about.

At Anthropic, founded in 2021 by Dario Amodei, Daniela Amodei, and several colleagues who left OpenAI specifically over safety disagreements, nearly the entire company is framed around safety research. Anthropic doesn't just have a safety team — it argues that safety and capability research are the same work. Their AI system, Claude, is designed from the ground up using a method called Constitutional AI, where the AI is trained to reason about its own behavior against a set of principles, not just to follow rules. Anthropic's stated mission is to be the company that figures out how to make powerful AI systems that don't harm people — while building some of the most powerful AI systems in existence. The tension there is the same one from Lesson 1.

At Google DeepMind, safety work is distributed across multiple teams. Some focus on near-term harms — things like bias in language models, or dangerous misuse. Others, including the team led by Geoffrey Irving and researchers like Jan Leike (who later moved to Anthropic), work on longer-term alignment problems: what happens as systems become more capable?

At Meta AI, the philosophy is different again. Yann LeCun, Meta's chief AI scientist, has publicly argued that current AI systems — including large language models — are not on a path toward dangerous superintelligence at all, and that treating them as if they are is a distraction from more immediate, tractable problems like bias and misinformation. This puts Meta's safety philosophy in direct conflict with the approaches at Anthropic and OpenAI.

Constitutional AIA training method developed by Anthropic where an AI system is taught to evaluate its own responses against a set of written principles — essentially learning to critique itself before responding.

The Power Problem

Here is the structural problem with internal safety teams: they exist inside companies whose financial survival depends on shipping products. A safety team that says "we need to delay this release by six months" is costing the company money, competitive position, and possibly the ability to raise funding. The safety team doesn't control the budget. It doesn't set the product roadmap. It advises.

In May 2024, Jan Leike — who led the Superalignment team at OpenAI, which was explicitly created to solve the long-term alignment problem — resigned. In his resignation post on X (formerly Twitter), he was unusually blunt: "Safety culture and processes have taken a back seat to shiny products." He said that OpenAI's board and leadership had repeatedly made choices that prioritized product development over safety research, and that he no longer believed the company was on the right path.

A week earlier, Ilya Sutskever — the same co-founder who signed the AI safety statement in 2023 — also quietly left the company. Both departures were significant: these were the people who were supposed to be the internal check on OpenAI's development speed.

This is a pattern worth watching. Safety teams get founded with real authority, then gradually find that authority eroded as commercial pressures grow. It has happened at OpenAI. Researchers who study organizational behavior call it safety normalization — where safety becomes a brand and a talking point rather than an actual constraint on decisions.

Ethical Question — No Clean Answer

If a safety researcher stays at a company they believe is making dangerous decisions, they might slow things down slightly from the inside. If they leave, they lose all influence over what the company does next. Which choice is more ethical — staying in a compromised position, or leaving with your integrity intact? This question has no clean answer. People who work in AI safety disagree about it bitterly.

What Anthropic Did Differently — and Why It Matters

When the Amodei siblings and their colleagues left OpenAI in 2021, they didn't just start a new lab. They structured it differently. Anthropic is a public benefit corporation — a legal structure in the US that requires the company to consider public benefit alongside profit, not just profit. It also has a Long-Term Benefit Trust that is supposed to hold the company accountable to its safety mission even if investors push for faster, less careful development.

Whether that legal structure actually constrains behavior in practice is an open question — one that will probably be tested in the next few years as AI systems become more capable and the financial stakes get higher. But the structure itself reflects a lesson learned from watching OpenAI's board get overridden: if you want safety to be a real constraint, it has to be baked into the company's legal DNA, not just its culture.

You now understand something that even most journalists covering AI don't clearly grasp: the difference between a company that has a safety team and a company that has built safety into its governance structure is enormous. One is a department. The other is a legal obligation.

You Can Now See This

Next time you read that an AI company "prioritizes safety," look up how they are legally structured. Is safety embedded in their charter, their board, their investor agreements — or is it just a team name on an org chart? The answer tells you far more than their press releases do.

Module 7 · Quiz 2

The Labs Inside the Labs

5 questions · Apply what you know about internal safety structures

1. In November 2023, OpenAI's board fired Sam Altman — then reinstated him five days later. What did this episode reveal about OpenAI's safety governance structure?

Correct. The episode showed that safety governance structures can be undone by commercial and social pressure even when they operate exactly as designed.

The key lesson was about structural fragility — the safety mechanism worked briefly, then was overwhelmed by other pressures.

2. Jan Leike resigned from OpenAI's Superalignment team in May 2024, saying "safety culture has taken a back seat to shiny products." What does this illustrate about internal safety teams?

Yes — this is what the lesson called "safety normalization": safety becomes a brand rather than an actual constraint.

The pattern described is safety normalization — where safety teams gradually lose decision-making power as commercial incentives dominate.

3. A technology startup tells investors: "We have a dedicated AI ethics and safety team of 15 people." Based on what you've learned, what is the most important follow-up question?

Exactly. A safety team that can only advise — but not block or delay — is structurally limited in its ability to prevent harm.

The most important question is about power, not size or output. Can they actually stop something? Or just write reports about it?

4. Yann LeCun at Meta AI disagrees with the approach taken by Anthropic and OpenAI's safety team. What is the core of his disagreement?

Correct. LeCun's position is that the catastrophic AI risk framing is wrong about what current systems actually are — not that safety doesn't matter at all.

LeCun's argument is specifically about whether current AI trajectories actually lead to dangerous superintelligence — he thinks they don't, and that the focus should shift accordingly.

5. Anthropic is structured as a "public benefit corporation" rather than a standard company. Why does this matter for AI safety specifically?

Right. The lesson's key insight: safety as a legal obligation is much more durable than safety as a team name or company culture.

The significance is structural: a legal requirement is harder to erode under pressure than a cultural commitment or an organizational chart entry.

Module 7 · Lab 2

The Safety Audit

You're auditing a fictional AI company's safety structure. Defend your conclusions.

Scenario

You're an independent auditor reviewing a fictional AI company called NovaMind. NovaMind has: a 20-person safety team, a Chief Safety Officer who reports to the CEO, a safety review process for new products, and a company culture document that says "safety first." But the CSO has no veto power over product releases, and safety team members are evaluated partly on product shipping metrics.

Your partner has reviewed the same materials. State your overall assessment: is NovaMind's safety structure genuine or mostly cosmetic? Then defend it as I push back.

Give your verdict on NovaMind's safety structure. Is it real or window dressing? Why?

Audit Partner

Safety Structure Analysis

I've reviewed the same NovaMind briefing you have. Before you give your verdict — I want to flag that a 20-person safety team is actually larger than what many real AI companies have. Does that change your read? Give me your overall assessment first, then we'll get into the details.

Module 7 · Lesson 3

The Nonprofits and the Academics

Not everyone working on AI safety works for an AI company. Some of the most important voices work precisely because they don't.

What can researchers outside the big labs actually accomplish — and why does independence matter?

On May 16, 2023, Sam Altman sat before the US Senate Judiciary Committee. Senators on both sides of the aisle asked him about AI risks. Altman agreed with many of their concerns. He suggested the government should create a new agency to regulate AI. He said OpenAI wanted oversight. The hearing was widely covered.

What got less coverage: sitting a few rows behind Altman in the hearing room was Yoshua Bengio, one of the three researchers who won the 2018 Turing Award — the Nobel Prize of computer science — for inventing the deep learning methods that made modern AI possible. Bengio wasn't testifying that day. He was there as an observer. But he had come to Washington because he had recently started speaking publicly about something he found deeply uncomfortable: he believed the systems he helped invent were now potentially dangerous, and that the people building them commercially had too much financial incentive to minimize those risks.

Within a year, Bengio would become one of the most prominent voices calling for an international treaty on AI development — a binding agreement between governments, like the treaties on nuclear weapons and chemical weapons. That call came not from inside a lab, but from a university professor who had nothing to sell.

Why Independence Changes What You Can Say

There's a structural reason why some of the bluntest warnings about AI risk come from academic researchers and nonprofit organizations rather than from the companies building the systems: independence means you don't have a product to protect.

When a researcher at Anthropic says a system might be dangerous, they are also saying something about a system their employer sells. When a researcher at a university says the same thing, they have nothing to lose financially. That's not to say company researchers can't be honest — many are — but the incentive structure is different, and incentive structures shape what gets said out loud.

The academic and nonprofit side of AI safety includes several distinct types of organizations. The Center for AI Safety (CAIS), run by Dan Hendrycks, focuses on specific technical risks and published the 2023 statement on extinction risk. It operates on donations and has no commercial AI products. The Future of Life Institute (FLI), founded in 2014 by physicist Max Tegmark and others, works on policy and published the 2023 open letter calling for a six-month pause in advanced AI development — a letter signed by over 30,000 people including Elon Musk and Steve Wozniak.

The AI Now Institute takes a very different approach. Founded in 2017 by Kate Crawford and Meredith Whittaker at New York University, it focuses not on hypothetical future superintelligence but on AI systems causing harm right now: surveillance, hiring discrimination, welfare systems that deny benefits based on faulty algorithms, and the concentration of AI power in a small number of companies.

These organizations are sometimes in conflict with each other, not just with AI companies. Whether you should focus on near-term concrete harms or longer-term speculative catastrophes is one of the most significant debates in the AI safety field.

Near-term vs. long-term safetyNear-term safety focuses on harms happening today: bias, surveillance, misuse. Long-term safety focuses on risks from future, more capable systems: misaligned goals, loss of human control. Researchers in both camps sometimes argue that the other camp is wasting resources on the wrong problem.

The Pause Letter and Its Critics

In March 2023, the Future of Life Institute published an open letter calling for a six-month pause in training AI systems more powerful than GPT-4. The letter argued that no one — not the labs, not governments, not the public — was prepared to handle what came next, and that a pause would give society time to catch up.

The letter was signed by many prominent AI researchers and public figures. It was also criticized, sharply, by many others — including some who deeply care about AI safety.

Timnit Gebru, a researcher who was forced out of Google's AI ethics team in 2020 after co-authoring a paper about the risks of large language models, argued that the pause letter was a distraction. In her view, focusing on hypothetical future superintelligence allowed companies to avoid accountability for the harms their current systems were causing today. She founded the DAIR Institute (Distributed AI Research Institute) specifically to do AI research that isn't funded by the big labs — and to focus on communities most affected by AI systems right now, not on scenarios that might happen in 20 years.

This is the kind of real, substantive disagreement that exists inside the AI safety world — not just between safety researchers and AI companies, but among people who all agree AI can cause serious harm. What kind of harm, and when, and to whom — those questions divide the field.

Ethical Question — No Clean Answer

If you had to allocate $100 million in AI safety research funding, how much would you direct toward near-term harms — bias, surveillance, algorithmic discrimination affecting people today — versus long-term alignment research aimed at preventing catastrophic risks from future, more capable systems? There is no objectively correct answer. Thoughtful, informed people argue both ways. What does your reasoning tell you?

The Turing Award Problem

Yoshua Bengio is not the only Turing Award winner who became a prominent voice on AI safety. Geoffrey Hinton, whose deep learning research was as foundational as Bengio's, left Google in May 2023. His stated reason: he wanted to speak freely about AI risks without worrying about how it reflected on his employer. He said publicly that he regretted some of his life's work — that he had contributed to building something he now worried about.

The fact that the people who helped build modern AI are now among its loudest critics is not a contradiction. It reflects something important: understanding a technology deeply enough to build it is exactly what generates the most credible fears about where it's going. These researchers aren't worried about science fiction. They are worried about specific mechanisms they understand firsthand.

You now understand something most people in any general conversation about AI don't: the AI safety field isn't a monolith. It contains Turing Award winners who helped build the technology and now warn about it, nonprofit researchers focused on discrimination and surveillance happening right now, policy advocates pushing for international treaties, and academics who believe the whole catastrophic-risk framing is overblown. Understanding which type of voice is speaking — and why — is the first step to evaluating anything they say.

You Can Now See This

When a headline says "AI safety researchers warn of risks," you can now ask: which kind? Near-term or long-term? Academic or nonprofit? Former industry, or always independent? A warning from Timnit Gebru about algorithmic bias in welfare systems and a warning from Yoshua Bengio about AI treaties are both from people who deeply understand AI — but they are pointing at very different problems.

Module 7 · Quiz 3

The Nonprofits and the Academics

5 questions · Think about independence, incentives, and what different researchers are actually worried about

1. Why did Geoffrey Hinton leave Google in May 2023, according to his own stated reasoning?

Correct. Hinton explicitly said independence was the reason — he couldn't speak freely about risks while his words could affect a commercial employer.

Hinton's own explanation was about freedom of speech — he couldn't say what he thought while employed by a company that sells AI systems.

2. Timnit Gebru criticized the Future of Life Institute's 2023 pause letter. What was the core of her argument?

Yes — Gebru's argument is that the catastrophic-future framing is a distraction from near-term, concrete harms affecting real people today.

Gebru's critique is about what the letter's framing ignores, not about the length of the pause. It lets companies off the hook for current harms.

3. A researcher who helped build the technology they now warn about is sometimes dismissed as hypocritical. Based on what you've learned, why might their warnings actually be more credible than someone who never worked in the field?

Exactly. Building something deeply is what generates the most credible fears — you know what's actually inside the system, not just what it looks like from outside.

The credibility comes from technical depth, not credentials. Knowing the internals of a system firsthand is what makes a warning grounded rather than speculative.

4. The AI Now Institute and the Center for AI Safety both work on AI safety. What is the most significant difference between their approaches?

Correct — this is the near-term vs. long-term divide that runs through the whole field.

The key distinction is temporal: AI Now works on harms happening now; Center for AI Safety focuses on what more powerful future systems might do.

5. Yoshua Bengio has called for an international treaty on AI development, similar to treaties on nuclear or chemical weapons. What makes this call notable given his background?

Yes. The weight of the call comes from the source — a Turing Award winner whose own research is foundational to the systems he's now asking governments to regulate.

The significance is who is saying it: one of the inventors of the underlying technology is now calling for international constraints on that same technology.

Module 7 · Lab 3

Near-Term vs. Long-Term

You're advising a foundation on where to put $10 million in AI safety funding. Defend your allocation.

The Decision

A fictional philanthropic foundation has $10 million to allocate to AI safety research. They can split it any way they want between two priorities: (A) near-term harms — bias in hiring algorithms, surveillance systems, AI used in welfare decisions — or (B) long-term alignment — research on making future, more capable AI systems reliably pursue human values.

Your partner will push back on whatever split you propose. You need to defend the reasoning, not just pick a number.

State your proposed split (e.g., "70% near-term, 30% long-term") and your core reason. Then I'll challenge it.

Foundation Advisor

Funding Strategy

I've seen a lot of funding proposals in this space. Most people default to one camp without really thinking through the tradeoffs. Before you give me your split — what's your underlying theory? Do you think near-term harms and long-term risks are connected, or are they basically separate problems requiring separate strategies?

Module 7 · Lesson 4

Governments Enter the Room

For most of AI's history, safety was a conversation between researchers. Then governments decided they had something to say about it too.

What happens when governments try to regulate AI safety — and who actually writes the rules?

On November 1, 2023, representatives from 28 countries gathered at Bletchley Park — the site where British codebreakers, including Alan Turing, cracked Nazi communications during World War II. The location was not accidental. The British government, which organized the event, wanted to signal that this moment was historic: the first international summit on AI safety.

At the end of the two-day summit, representatives from all 28 countries — including the United States, China, and the European Union — signed what became known as the Bletchley Declaration. It acknowledged that AI "presents enormous global opportunities" but also potentially "serious, even catastrophic, harm" and called for international cooperation on safety. It was the first time China and the US had signed a joint statement on AI.

What the declaration didn't do: set any binding rules, create any enforcement mechanism, or require any company to do anything differently. It was a statement of shared concern, not a shared set of actions. Critics called it a photo opportunity. Supporters called it the beginning of something. Both, in different ways, were right.

The EU Goes First

While individual countries debated and convened summits, the European Union moved faster than anyone else to actually write law. In March 2024, the European Parliament passed the EU AI Act — the first comprehensive law governing AI anywhere in the world.

The Act works by assigning AI systems to risk categories. Systems in the highest-risk category — things that could affect people's fundamental rights, like AI used in criminal sentencing, hiring, or critical infrastructure — face the strictest requirements: mandatory testing, documentation, human oversight, and the right for people affected by AI decisions to seek explanation and appeal. Systems with lower risks face lighter requirements. Purely recreational AI faces almost no restrictions.

The Act also bans some uses of AI entirely in Europe: real-time biometric surveillance in public spaces (with narrow exceptions), AI systems that exploit psychological vulnerabilities to manipulate behavior, and social scoring systems like the ones used in parts of China.

Who wrote the draft legislation? Partly EU bureaucrats, partly technical experts — but also, heavily, AI company lobbyists. OpenAI, Google, and other major AI companies spent significant resources trying to weaken parts of the Act. Some provisions were significantly watered down between the first draft and the final vote. This is not a conspiracy — it's how legislation almost always works. But it means the final law reflects what the industry would accept as much as what safety researchers recommended.

EU AI ActA law passed by the European Union in 2024 that regulates AI systems based on the level of risk they pose, with the strictest requirements for AI used in high-stakes decisions about people's lives.

The US Takes a Different Path

In the United States, there is no comprehensive AI law as of 2024. Instead, President Biden issued an executive order on AI in October 2023, which required companies developing the most powerful AI systems to share safety testing results with the government before public release. It also created new federal standards for AI used by the government itself and directed federal agencies to consider AI risks in their existing regulatory frameworks.

An executive order is not a law — a future president can revoke it. And in early 2025, many of the Biden AI executive order's provisions were reversed. This illustrates a core problem with the US approach: without Congress passing actual legislation, AI governance depends on who is in the White House, changing with each administration.

Several countries have taken China's approach as a reference point for a different reason: China has implemented real, binding AI regulations — but primarily aimed at controlling the use of AI for political speech and content that challenges the government, rather than protecting individuals from AI harm. China's regulations are comprehensive and enforced, but their purpose is fundamentally different from the EU's. What looks like "AI safety regulation" from the outside can mean very different things depending on what — and whom — the government is trying to protect.

The US, the EU, and China represent three genuinely different theories of what AI governance should do: the US preference for voluntary commitments and industry self-regulation; the EU preference for rights-based legal frameworks that protect individuals; and China's government-control model that regulates AI use primarily in the interest of political stability. None of these is simply "safe" or "unsafe." Each reflects different values about what kind of power is most dangerous.

Ethical Question — No Clean Answer

If the EU AI Act restricts certain high-risk AI applications in Europe but those same applications are freely available in countries without such laws, has the regulation actually made anyone safer? Or has it just moved the risk to places with less protection? This is called regulatory arbitrage — and it is one of the hardest problems in international AI governance. There is no clean answer.

What This Means for You

Here is the full picture you now have: AI safety is being worked on simultaneously by a small number of pioneering researchers who started in the early 2000s; by dedicated teams inside commercial AI labs whose power is structurally limited; by independent nonprofits and academics who can speak more freely precisely because they don't have products to sell; and now by governments, some of which are writing real law and some of which are issuing press releases.

These groups don't agree on what the problem is, don't agree on what the solutions are, and sometimes work actively against each other. The AI lab safety team and the independent nonprofit researcher may both use the phrase "AI safety" and mean entirely different things by it. A government regulation that one researcher calls a landmark achievement, another calls a captured compromise that protects industry more than people.

You are now equipped to navigate that landscape. When you see a headline about AI safety — a new regulation, a researcher resigning, a company announcing a safety commitment, a government signing a declaration — you can ask the right questions: Who has structural independence here? Who has financial incentives to minimize the risk? What specific harm are they trying to prevent? Does their proposed solution actually have enforcement power?

Most people reading that same headline will take it at face value. You won't have to.

You Can Now See This

Every AI safety announcement — from a lab, a government, a nonprofit — is made by people with specific interests, specific theories about what is dangerous, and specific amounts of actual power to do anything about it. Knowing the difference between these groups is the core skill for reading the AI safety landscape as it actually is, not as the press releases describe it.

Module 7 · Quiz 4

Governments Enter the Room

5 questions · Apply the concepts of governance, enforcement, and regulatory design

1. The Bletchley Declaration of November 2023 was described by critics as "a photo opportunity." What was the most significant limitation that justified this criticism?

Correct. A declaration without enforcement is a statement of intention, not a constraint on behavior.

The key limitation was structural: without binding rules or enforcement, the declaration couldn't actually require any change in how AI is built or deployed.

2. The EU AI Act assigns AI systems to risk categories. Which of the following would most likely fall into the highest-risk category under this framework?

Yes. The highest-risk category is for AI that affects fundamental rights — criminal sentencing is exactly the kind of high-stakes, irreversible decision the Act targets.

The highest risk category covers AI used in high-stakes decisions about people's rights and lives — sentencing, hiring, benefits. Entertainment AI falls much lower.

3. Why is an executive order a weaker form of AI governance than a law passed by Congress?

Exactly right. The Biden AI executive order's provisions were largely reversed after the next administration took office — illustrating precisely this vulnerability.

The core weakness is durability: what one president orders, the next can undo. A law requires Congress to pass a new law to reverse it.

4. China's AI regulations are comprehensive and enforced, but critics say they serve a different purpose than European AI regulations. What is that difference?

Correct. "AI safety" means different things: protecting individuals' rights in the EU framing; protecting government stability in China's framing. Both are real regulation — with very different values embedded in them.

The key insight is that "AI safety regulation" can mean protecting people from AI harm, or protecting governments from political disruption. These are very different goals dressed in similar language.

5. A country passes a strict AI safety law that bans certain high-risk applications. A company moves those applications to a country with no such law and continues operating them. What problem does this illustrate?

Yes — regulatory arbitrage is one of the central challenges of international AI governance, and it's why many researchers argue that national regulations alone are insufficient.

This is regulatory arbitrage: national laws can shift where harmful AI operates without actually reducing global exposure to harm. It's a core argument for why international agreements matter.

Module 7 · Lab 4

Write the Rule

You're drafting one provision of an AI safety law. Your partner is a critic who will find every flaw.

Your Role

A fictional international body has asked for a draft provision — one specific rule — that should be included in a global AI safety agreement. You decide what the rule is, who it applies to, and what the consequences are for breaking it. It should address a real risk from what you've learned in this module.

Your partner is a policy critic who has seen a lot of AI regulations watered down or fail in practice. They will ask: who enforces this? What stops companies from gaming it? Does it actually prevent the harm it claims to?

State your draft rule. Be specific: what does it require, who does it apply to, and what happens if someone violates it?

Policy Critic

Global AI Governance

I've reviewed a lot of AI policy proposals, and most of them have the same problem: they sound rigorous but have no real teeth. I'm going to pressure-test whatever you write. Go ahead — state your draft provision. What's the rule, who does it bind, and what's the penalty?

Module 7 · Module Test

Who Is Working on AI Safety

15 questions · Score 80% or higher to pass · Reason, don't just recall

1. Eliezer Yudkowsky founded MIRI in 2004 without a PhD, drawing on years of online writing about AI risk. What does this tell us about how the AI safety field developed in its early years?

Yes — the field's early shape was defined by who took it seriously first, and that was people outside the mainstream, partly because the mainstream had dismissed the problem.

The key insight is that the initial agenda was set by outsiders precisely because insiders weren't engaged — which shaped what questions the field asked first.

2. Stuart Russell co-founded CHAI at Berkeley specifically to work on which technical problem?

Correct — the value alignment problem: bridging the gap between what we specify and what we actually want.

CHAI's founding problem is value alignment — the difficulty of encoding what humans actually care about, rather than just the metrics we can easily measure.

3. In November 2023, OpenAI's board fired Sam Altman; he was reinstated five days later. What does this episode most directly demonstrate about the relationship between safety governance and commercial pressure at AI labs?

Exactly — the mechanism worked; the structure was overwhelmed anyway. That's the lesson.

The episode shows that formal safety governance structures can be designed well and still be overridden by the combined pressure of employees, investors, and market forces.

4. Anthropic was structured as a "public benefit corporation" rather than a standard C-corporation. What specific risk from OpenAI's history was this structure designed to prevent?

Yes. Legal structure was chosen specifically because culture proved insufficient — what can be eroded culturally requires legal protection.

The lesson from OpenAI was that safety as culture or org structure can be overridden. Anthropic tried to make safety a legal obligation that couldn't be simply voted away.

5. Jan Leike resigned from OpenAI's Superalignment team in 2024, citing safety culture taking "a back seat to shiny products." Which phenomenon does this most accurately illustrate?

Correct. Safety normalization: the label stays, the constraint disappears.

This pattern is called safety normalization — over time, "safety" becomes part of the brand without functioning as an actual limit on what gets built or shipped.

6. Yann LeCun at Meta AI holds a position that directly conflicts with the approach taken by Anthropic and OpenAI's safety team. What is the substance of his disagreement?

Yes — LeCun's position is a genuine technical disagreement about what current AI systems actually are and where they're heading.

LeCun's core argument is about the trajectory of current systems: he believes the superintelligence risk framing is wrong about the technology, not just that it's being addressed poorly.

7. Geoffrey Hinton left Google in 2023, saying he wanted to speak freely about AI risks. Apply this to a new scenario: a senior researcher at a major AI lab privately believes their company's flagship product poses serious risks but stays quiet publicly. What structural factor most likely explains their silence?

Correct — this is exactly the structural problem Hinton identified and resolved by leaving. Employment creates a conflict of interest between candor and loyalty.

The structural explanation is incentive conflict: the company's commercial interests and the researcher's honest assessment pull in opposite directions, and employment ties the researcher to the company's side.

8. The AI Now Institute and the Center for AI Safety both work under the umbrella of "AI safety." What is the most fundamental difference in their focus?

Yes — the temporal divide: one is focused on harms that exist now, the other on harms that could exist later.

The fundamental difference is temporal and demographic: AI Now addresses current, documented harms to real people; Center for AI Safety addresses potential future harms from systems not yet built.

9. Timnit Gebru founded the DAIR Institute after being forced out of Google's AI ethics team. What was the core motivation behind founding an entirely independent research institute rather than joining another company?

Correct — financial independence was the explicit goal. Who funds research shapes what questions get asked and what conclusions get published.

The founding logic of DAIR is independence from industry funding, enabling research that follows where the evidence leads rather than what funders want to hear.

10. The Bletchley Declaration of 2023 was notable for being signed by both the US and China. But critics said it changed nothing. Apply the concept of "enforcement mechanism" to explain why both things can be true simultaneously.

Yes — a statement of shared concern is diplomatically meaningful but behaviorally inert without enforcement. Both are true at the same time.

The key distinction is between expressing a shared concern (significant diplomatically) and creating an obligation that changes behavior (requires enforcement). A declaration can be both historically notable and operationally empty.

11. The EU AI Act bans real-time biometric surveillance in public spaces. An AI company moves its public surveillance product to a country without such a ban and continues selling it globally. What does this scenario illustrate, and what would be required to actually address it?

Correct. National regulation can protect a population within a jurisdiction; only international coordination can close the gaps companies exploit.

This is the regulatory arbitrage problem: national regulation protects the regulated jurisdiction but doesn't eliminate the harm globally if companies can simply operate from elsewhere.

12. China has comprehensive, enforced AI regulations. The EU also has comprehensive, enforced AI regulations. What is the key difference that makes it misleading to call both "AI safety regulation"?

Exactly. "Safety" in each framework means protecting different things and different people. The same word covers very different values.

The fundamental difference is what — and who — is being protected. "Safety" in EU law protects individuals from AI power; "safety" in China's law protects the government from political challenge. Same word, opposite implications.

13. A 12-year-old asks you: "If so many important people are working on AI safety, why are we still worried?" Based on everything in this module, what is the most accurate and complete answer?

Yes. People working on it, disagreeing about it, and facing structural limits on their power — that's the accurate picture.

The honest answer covers all three dimensions: disagreement about the problem, variation in actual power, and structural pressures that limit even well-intentioned safety work.

14. Yoshua Bengio and Timnit Gebru are both credible, independent AI safety voices. If they publicly disagree about AI policy, what is the best approach to figuring out whose position is more sound?

Correct. Evaluating any expert requires understanding their specific claim, their evidence, their proposed solution, and their incentive structure.

Evaluating expert disagreement requires going deeper than credentials: what exactly is each person worried about? What evidence do they cite? Do their solutions follow logically? Do they have interests that might color their view?

15. You read a news headline: "Major AI Lab Announces New Safety Commitment." Based on everything in this module, which question is MOST important to ask first?

Yes. Commitment with no binding power is a press release. The question of enforcement is the question that separates real safety governance from branding.

The first question is always about power and enforceability. A safety commitment that can be overridden when it becomes commercially inconvenient is not actually a constraint on behavior.