Keeping AI Under Control · Introduction

The Most Powerful Tool in History Has No Off Switch

AI is already making decisions about your life. This course teaches you to see how — and why that matters more than most adults realize.

In November 2022, a company called OpenAI released a chatbot called ChatGPT. Within five days, one million people were using it. Within two months, one hundred million. No technology in history had ever spread that fast — not television, not the internet, not the smartphone. Schools started banning it. Teachers panicked. Some students used it to write essays they didn't write. Others used it to understand things their teachers hadn't explained clearly enough. The same tool, the same week, doing completely opposite things.

Here's what almost nobody talked about in all that noise: this wasn't really about cheating or chatbots. It was about something much older. Every powerful technology — electricity in the 1880s, the printing press in the 1440s, nuclear energy in the 1940s — arrived before anyone had figured out how to control it. People celebrated the power before understanding the cost. And then, sometimes years later, sometimes decades later, they had to go back and fix what they'd broken. With AI, we're living in that first window — the one before the fixing. And unlike electricity, this technology can make decisions.

This course won't give you all the answers, because nobody has all the answers yet. What it will give you is a way of thinking: a set of questions to ask when someone tells you AI is going to solve everything, or destroy everything, or is totally fine, or is the end of the world. After four lessons, you'll know what most adults in the room don't — and that's not an exaggeration. Most of the people making decisions about AI right now learned everything they know from headlines. You're going to do better than that.

Keeping AI Under Control · Lesson 1 of 4

When the Algorithm Got It Wrong and Nobody Noticed

AI failures aren't always dramatic. Sometimes they're quiet, routine, and affect thousands of people before anyone asks a single question.

If a machine makes a mistake that a human would have caught, who is responsible?

In 2017, a teacher in Houston, Texas named Carolyn Blankenship was rated one of the worst teachers in her district. Her score came from a system called EVAAS — the Education Value-Added Assessment System — an algorithm designed to measure how much students improved because of a specific teacher. Blankenship had taught for over a decade. Her students liked her. Her principal liked her. But the score said she was failing, and under Houston's rules, a low score meant dismissal. She was fired.

When she and other teachers sued the school district, they asked a simple question: how does the algorithm actually work? The district's answer was remarkable. They said the formula was proprietary — owned by a private company — and could not be shared, even in court. A judge would eventually rule this was unconstitutional. But here is what stayed true: hundreds of teachers had already been evaluated, hired, fired, or denied raises based on a calculation that nobody in the school district — not the superintendent, not the school board — could fully explain. They had handed a decision that changed people's lives to a machine, and then forgotten to check whether the machine was right.

That's not a story from science fiction. That's a real city, real people, and a real algorithm — one still used in some form in school districts across the United States today.

What Is an Algorithm, Really?

An algorithm is just a set of instructions. A recipe is an algorithm. "If it's raining, take an umbrella" is an algorithm. The word sounds technical, but the idea is ancient — humans have been writing rules for decisions since before computers existed.

What changed in the past thirty years is scale. A human judge applies a rule to one case at a time, slowly, with the ability to notice when something feels off. An algorithm applies a rule to ten thousand cases in the time it takes you to read this sentence — with no ability to feel anything at all. That speed is powerful. It is also dangerous, because mistakes travel at the same speed as correct answers.

When we talk about AI today, we're usually talking about a special kind of algorithm called a machine learning model. Instead of a human writing every rule, the machine studies millions of examples and figures out its own rules. This is why these systems can do things that seem almost magical — recognizing faces, translating languages, writing passable college essays. But it also means that even the people who built the system often can't tell you exactly why it made a particular decision. The rules are buried inside millions of mathematical calculations nobody can read.

Algorithm — A step-by-step set of instructions for making a decision or solving a problem. In AI, these instructions are often learned from data rather than written by hand.

Machine learning — A type of AI where the computer finds its own patterns in data, rather than following rules a programmer wrote explicitly.

Proprietary — Privately owned. In the context of algorithms, it means the formula is kept secret because the company that built it considers it valuable intellectual property.

Why It's Everyone's Problem — Not Just Experts'

Here's the thing most people get wrong about AI safety: they think it's a technical problem, which means only technical people need to care about it. That's like saying only car engineers need to care about traffic laws. The engineers build the cars. Everyone else decides how they're used, where they're used, and what happens when they crash.

Algorithms are making consequential decisions right now — decisions that affect whether you get a loan, whether you get flagged as a shoplifter walking into a store, whether a social media feed shows you content that makes you feel worthless, or content that helps you understand yourself. These decisions are made by systems built by a fairly small number of engineers, often without the input of the people most affected by them.

In 2018, Amazon scrapped an AI recruiting tool they had been developing for years. The tool was supposed to identify the best job candidates from thousands of résumés. The problem: it had learned from Amazon's historical hiring data, and Amazon had historically hired mostly men. So the algorithm taught itself that being a woman was a negative signal. It penalized résumés that included the word "women's" — as in "women's chess club" — and downgraded graduates of all-women's colleges. Amazon shut it down before deploying it, but the tool had been in development for years before anyone noticed. It took a human to catch what the machine had quietly decided.

Pause and Consider

Amazon's recruiting AI wasn't programmed to discriminate. It learned to discriminate from real data about how real humans had actually behaved. If the humans were biased, and the AI learned from the humans — who is responsible for the bias the AI produced?

This is the kind of question AI safety asks. Not "how do we build faster computers?" but "how do we make sure the decisions these systems make are fair, explainable, and correctable?" Those questions don't require a computer science degree. They require the ability to think clearly about power, fairness, and accountability — things you can absolutely do right now.

The Three Gaps That Create Risk

AI safety researchers — the people whose entire job is thinking about what goes wrong — have identified a pattern that shows up again and again in cases like EVAAS and Amazon's recruiting tool. They call it different things, but here's a simple way to see it: there are three gaps that, when they open up, create real danger.

The Transparency Gap. When the people affected by a decision can't understand how it was made, they can't challenge it. Carolyn Blankenship couldn't appeal her firing because the algorithm's formula was hidden. Transparency means being able to look at the reasoning — not just the result.

The Accountability Gap. When something goes wrong with an AI system, it's often unclear who is responsible. Is it the engineer who wrote the code? The manager who deployed it? The company that sold it? The city that used it? Everyone points to someone else, and the person who was harmed is left with nobody to hold accountable.

The Feedback Gap. Human systems learn from their mistakes partly because the people making decisions feel the consequences of getting it wrong. A judge who makes a bad ruling might face public criticism. A teacher who fails a class that clearly didn't understand the material gets parent complaints, a new approach next year. An algorithm doesn't feel anything. It will make the same mistake in case 10,000 that it made in case 1, unless a human stops it.

You Now See What Most People Miss

The next time you hear about an AI system making a decision that affected real people — in hiring, policing, healthcare, education — you can now ask three specific questions: Can the people affected understand why the decision was made? Is there someone actually responsible if it goes wrong? And is there any mechanism to catch and correct mistakes? Most news coverage never gets this far. You just did.

The Ethical Question Nobody Can Fully Answer

Here is the genuine hard question that sits underneath this entire lesson — and underneath a lot of AI safety research. Consider this: human decision-makers are also biased, inconsistent, and sometimes corrupt. A human judge might give harsher sentences to defendants who remind them of someone they dislike. A human hiring manager might unconsciously prefer candidates who went to the same school they did. A human teacher might grade essays more harshly on a Monday morning when they're tired.

Studies have repeatedly found that AI systems — even flawed ones — are sometimes more consistent than humans making the same decisions. If you replace a biased human judge with a flawed-but-less-biased algorithm, have you made things better or worse?

There is no clean answer. A system that's statistically more fair can still be completely unaccountable — and unaccountability is its own kind of harm. A human who is inconsistent can also be persuaded, appealed to, moved by a story, or corrected in the moment. An algorithm cannot. We are trading one set of problems for a different set of problems, and we haven't fully agreed on which problems are worse.

That tension doesn't go away. It'll come up in every lesson of this course. Sit with it. The people who are most certain they've resolved it are usually the people who've thought about it the least.

Lesson 1 Quiz

5 questions · Tests reasoning, not just recall

1. Houston's EVAAS system was problematic primarily because it was:

Right. The core problem wasn't that it got things wrong (though it may have) — it was that no one could check whether it was right or wrong. Hidden reasoning = no accountability.

Not quite. The central issue was that the algorithm's formula was proprietary — kept secret — so teachers couldn't challenge decisions that affected their careers. Hidden reasoning is the problem.

2. Amazon's recruiting AI learned to penalize women's résumés. This happened because:

Exactly. The AI wasn't told to discriminate — it learned what "a good hire at Amazon" looked like from history, and history was biased. The machine reflected the humans who came before it.

This is a key concept to revisit. The AI wasn't intentionally programmed to be biased — it learned bias from real historical data about past hiring decisions, which had been made mostly by and for men.

3. A city uses an AI to predict which neighborhoods need more police patrols. The AI was trained on arrest data from the past ten years. A civil rights lawyer argues this is dangerous. Which of the following best explains the lawyer's concern?

Correct reasoning. This is called a feedback loop: more patrols → more arrests → more data → AI predicts more patrols needed in the same place. Past inequality gets locked in and amplified.

Think about what the training data actually represents. If police historically patrolled some neighborhoods more than others — for non-crime reasons — that pattern gets baked into the AI's predictions as if it were objective truth.

4. The "Feedback Gap" in AI systems means:

Yes. Human decision-makers face social, emotional, and professional consequences when they get things wrong. Algorithms feel nothing. They'll repeat error 10,000 exactly like error 1 unless a human intervenes.

The Feedback Gap is about correction, not data or timing. An algorithm has no mechanism to learn from the harm it causes in real deployment — it doesn't experience the consequences of being wrong.

5. Research shows some AI systems are statistically more consistent than human decision-makers. This means AI is therefore a safer choice for high-stakes decisions, such as criminal sentencing. Do you agree?

Exactly right. Consistency is one value among many. A consistently wrong, consistently unaccountable, consistently unexplainable system can cause more harm than an inconsistent human who can at least be challenged, appealed to, and corrected in the moment.

This is a common oversimplification. Consistency is valuable, but it doesn't automatically mean fairness or safety. An algorithm can consistently apply a flawed rule across thousands of cases — with no human able to appeal the decision or explain the reasoning.

Lab 1: The Accountability Audit

You are an investigative journalist. Your source has leaked details of a school district's AI grading system.

Your Assignment

A school district in your city just announced it will use an AI system to automatically grade student essays. The company that built it says it's faster, more consistent, and removes teacher bias. The school board approved it unanimously without a public vote. You have three questions to answer before your article publishes.

Start by telling your AI source what concerns you most about this scenario — then ask your first hard question. Your source won't just agree with you. Push back expected.

Source: Alex Rivera

Investigative Partner

You got the leak. I've seen the contract — the district signed a five-year deal with EdScore Analytics. The algorithm grades essays on a 0–100 scale and the scores count for 40% of a student's final grade. The company won't release the rubric. What's your first question for your story?

Keeping AI Under Control · Lesson 2 of 4

The Machine That Couldn't Be Corrected

Some AI failures aren't about the algorithm being wrong — they're about humans being too confident to check it.

When should a human be required to make a final decision, even if an AI is more accurate?

On April 10, 2010, a Polish Air Force Tu-154 aircraft approached Smolensk North Airport in Russia carrying Polish President Lech Kaczyński and 95 other passengers and crew. The weather was terrible — dense fog, near-zero visibility. Air traffic controllers on the ground repeatedly told the crew to divert to another airport. The crew acknowledged the warnings. They also had onboard navigation systems giving them data about altitude and descent rate. And yet the plane continued its approach, descending too quickly in conditions it couldn't navigate. It clipped treetops and crashed. All 96 people aboard died.

Investigators later found something that has haunted aviation safety researchers ever since: the crew had become so accustomed to trusting their automated systems that when those systems gave ambiguous readings — and when the air traffic controllers gave clear human warnings — they weighted the machine's data more than the human's voice. This isn't exactly an AI story; the navigation systems weren't AI. But it illustrates a pattern that AI researchers call automation bias: the tendency for humans to over-rely on automated systems, even when better information is available from non-automated sources.

In the years since, automation bias has been documented not just in aviation but in medicine, law enforcement, financial trading, and content moderation. The pattern is consistent: once a machine is in the loop, humans tend to defer to it even when they shouldn't. This isn't stupidity. It's a well-documented feature of human psychology — and it's one of the central challenges of deploying AI in high-stakes environments.

What Automation Bias Actually Looks Like

In 2015, a study published in the journal Computers in Human Behavior gave participants a task: sort images into categories, with the help of an AI assistant that would recommend a category for each image. Sometimes the AI was right. Sometimes it was obviously wrong. The researchers found that when the AI was wrong, participants still followed its recommendation about 74% of the time — even when the correct answer was easy to see with their own eyes.

This is what automation bias looks like in practice: not someone blindly following a machine off a cliff, but someone who has a perfectly good instinct, looks at the machine's answer, and quietly decides the machine probably knows better. It happens to smart people. It happens to trained experts. It happens more often when people are tired, when they're under time pressure, or when they trust the general reputation of the system — even if this specific output is wrong.

For AI safety, this creates a specific problem: we build AI systems to assist human decision-making, but the presence of the AI can actually reduce the quality of human decision-making. The human becomes a rubber stamp on the machine's output rather than an independent check on it.

Automation bias — The tendency to over-rely on automated systems, accepting their output without sufficient critical evaluation, even when evidence suggests the automated output may be wrong.

Human-in-the-loop — A design principle where a human must review and approve an AI's recommendation before it takes effect. The goal is to preserve human judgment — but automation bias can undermine this if the human defers anyway.

Healthcare: The Stakes Get Real

In 2019, a research team at Google Health published a paper in Nature describing an AI system that could detect breast cancer in mammograms more accurately than radiologists — reducing false negatives by 9.4% in U.S. studies. The result was celebrated widely. Headlines declared that AI would "beat doctors" at reading scans. The research was real and significant.

What received far less coverage: a follow-up analysis found that when radiologists worked alongside the AI — the "human-in-the-loop" condition that sounds like the responsible approach — some of their diagnostic accuracy actually decreased. The presence of the AI's recommendation changed how they looked at the image. Instead of starting with a fresh examination, they were now (consciously or not) confirming or questioning what the machine had already told them. Their thinking was anchored to the AI's output.

This is one of the most consequential unsolved problems in AI deployment. The intuitive solution to AI error — "just put a human in the loop" — doesn't always work if the human-AI combination produces worse outcomes than either would alone. And yet removing the human from the loop entirely creates the accountability problem from Lesson 1. Both options have serious costs.

The Ethical Tension

If an AI system is provably more accurate at detecting cancer than a radiologist working alone, but combining AI with a radiologist produces worse outcomes than the AI alone — is it ethical to require human approval of every AI diagnosis? You're not required to have an answer. You are required to take the question seriously.

Who Controls the Off Switch?

In 2010, on May 6th, the U.S. stock market experienced what became known as the Flash Crash. In about 36 minutes, major U.S. stock indices dropped nearly 10% — erasing almost a trillion dollars in market value — before recovering almost as fast. The cause: automated trading algorithms, each reacting to the other's behavior, created a self-amplifying cascade that no single human had triggered and no single human could stop. The market eventually corrected itself, but for those 36 minutes, it operated in a state that no human had designed and no human was in control of.

Since then, regulators have introduced "circuit breakers" — rules that automatically pause trading if prices move too fast. This is a human-designed constraint on AI behavior: a hard limit that says, "we don't care what the algorithm wants to do, it stops at this threshold." It's imperfect. It doesn't catch every problem. But it's an example of a principle that AI safety researchers now argue is essential: corrigibility.

Corrigibility — The property of an AI system being correctable, interruptible, and shut-downable by humans. A corrigible system doesn't resist being turned off or modified. This sounds obvious — but it becomes complex as AI systems become more capable.

The question of who holds the off switch — and whether they can actually use it — becomes more urgent as AI systems are deployed in faster, more interconnected environments. A human can decide not to push the button. The question AI safety researchers are now asking is: can the AI system decide not to let you push it?

What You Now Understand That Changes How You Read the News

Every time you see a headline about AI being "more accurate than doctors" or "better than human experts," you now know to ask a second question: What happens to human judgment when the AI is in the room? The technology's accuracy in a controlled test and its effect on real-world decision-making can be completely different things.

Lesson 2 Quiz

5 questions · Apply the concepts to new situations

1. Automation bias means that people with AI assistance tend to:

Correct. Automation bias is the systematic tendency to over-trust automated outputs — not because people are careless, but because human psychology naturally defers to systems that generally work well.

Automation bias is specifically about over-trusting AI output, not ignoring it. People tend to follow the AI's recommendation even when their own judgment or available evidence points elsewhere.

2. The 2010 Flash Crash illustrates which core AI safety concern?

Yes. The Flash Crash wasn't caused by one bad algorithm — it was caused by many algorithms reacting to each other in a loop that humans couldn't interrupt quickly enough. This is an emergent behavior problem.

The Flash Crash is specifically about loss of control in interconnected systems — algorithms reacting to each other faster than humans could intervene, producing an outcome nobody intended and nobody could stop for 36 minutes.

3. A hospital uses an AI to recommend medication dosages. Doctors are required to review and approve every recommendation before it's administered. Based on Lesson 2, what is the most significant risk with this design?

Exactly. This is automation bias in a medical context. "Human-in-the-loop" only works as a safeguard if the human is truly exercising independent judgment — which the presence of an AI recommendation can actually undermine.

The lesson's key insight applies here: having a human in the loop doesn't guarantee independent judgment. Automation bias means the doctor may defer to the AI's suggestion, making the review process less effective than it appears.

4. "Corrigibility" in an AI system refers to:

Correct. Corrigibility is about human control: can we change this system's behavior, interrupt it, or shut it down when needed? As AI becomes more capable, ensuring systems remain corrigible becomes a central technical and policy challenge.

Corrigibility is about human ability to control and correct AI systems — not the AI correcting itself. The key question is whether humans can override, modify, or stop the system when its behavior needs to change.

5. Studies show that AI-assisted radiologists sometimes perform worse than AI alone when detecting cancer. The best policy response to this finding would be:

Yes. This is the nuanced response: neither reject AI nor remove humans, but study the interaction carefully and redesign it. AI safety is often about the system design — how humans and machines work together — not just about the AI in isolation.

The finding doesn't suggest removing AI or removing humans — it suggests the specific way humans and AI currently interact needs redesigning. How a human receives and processes AI information affects the quality of the combined outcome.

Lab 2: The Design Audit

You're a policy advisor. A hospital wants your recommendation on how to deploy their new AI diagnostic tool.

Your Role

Valley General Hospital is about to deploy an AI system that reads X-rays to flag potential fractures. It's 94% accurate — better than any radiologist on staff. The hospital's CTO wants to use it to pre-screen all X-rays before radiologists review them. You've been asked for a recommendation. Your AI advisor has seen the implementation plan.

Tell your advisor what safeguards you think the hospital needs — then defend your reasoning. Your advisor will push back and raise complications you may not have considered.

Advisor: Dr. Priya Nair

Systems Design Critic

I've looked at the implementation plan. The AI flags suspicious X-rays and puts them at the top of the radiologist's queue. X-rays it clears go straight to the report as "no significant findings" — radiologist never sees them. That's 60% of the daily volume. What's your first concern?

Keeping AI Under Control · Lesson 3 of 4

Whose Values Did We Teach It?

Every AI system encodes choices. The question is whether those choices were made deliberately — or left to chance.

If an AI is trained by a small group of people in one country, and then deployed to the entire world, whose values does it actually represent?

In 2017, a former Facebook data scientist named Frances Haugen was given access to internal research documents that Facebook had kept private. She wouldn't go public until 2021, but what she found dated back to at least 2016: Facebook's own internal research showed that its algorithm — the system that decided which posts, videos, and articles appeared in users' feeds — was specifically optimizing for engagement. Engagement meant likes, comments, shares, and time spent on the platform. And Facebook's researchers had found that content causing anger, outrage, and fear generated significantly more engagement than content causing happiness or calm. The algorithm hadn't been told to make people angry. It had simply learned, through millions of data points, that anger kept people scrolling.

What Haugen's documents revealed — and what made this a landmark moment in AI history — was that Facebook's engineers had identified this problem as early as 2018, and proposed a fix. A team built a system that would reduce "problematic content" in feeds. Senior leadership rejected it. The reason given in internal documents: the fix would reduce engagement metrics, which would reduce advertising revenue. A deliberate choice was made to keep an algorithm that the company's own researchers believed was causing social harm — because changing it would cost money.

This is a different kind of AI problem than the ones in Lessons 1 and 2. The algorithm wasn't secretly broken. It was doing exactly what it was designed to do. The values encoded in it — maximize engagement above all else — were a choice. And that choice had consequences felt by hundreds of millions of people who never agreed to it.

Values Are Embedded Whether You Intend Them or Not

Every AI system makes tradeoffs. When you decide what to optimize for — what to measure, what counts as "good," what counts as "bad" — you are making a values choice. Sometimes this is obvious: a hiring algorithm that prioritizes candidates with Ivy League degrees encodes a value that Ivy League education predicts job performance. Sometimes it's hidden: a search algorithm that shows you results matching your past behavior encodes a value that past preferences should determine future information — a choice that limits what you're exposed to.

The technical term for what you measure and optimize is the objective function. Facebook's objective function was engagement. Amazon's early recruiting tool's objective function was similarity to past successful hires. EVAAS's objective function was student score improvement attributable to a single teacher. In each case, the objective function captured something real — but it missed things that turned out to matter enormously.

Objective function — The mathematical goal an AI system is trying to maximize or minimize. It defines what "success" means for the algorithm. Getting this wrong — or right for the wrong reasons — is one of the central challenges of AI design.

Value alignment — The problem of ensuring an AI system's objectives actually match what humans genuinely want — not just what was easy to measure. Often, what's easy to measure and what we actually care about are different things.

AI researchers call this the alignment problem. It applies to everything from social media feeds to self-driving cars to AI assistants: the difference between what we tell an AI to optimize and what we actually want is often a gap where serious harm can enter.

The Geography of AI Development

In 2023, a study by the AI Now Institute found that the five largest AI research laboratories in the world — OpenAI, Google DeepMind, Anthropic, Meta AI, and Microsoft Research — were all headquartered in the United States, primarily in the San Francisco Bay Area. The people who build the most widely deployed AI systems in the world represent a narrow demographic slice: predominantly American, predominantly male, predominantly from affluent technical backgrounds.

This matters because when you build something, your assumptions about the world — what's normal, what's harmful, what's funny, what's offensive, what a "good outcome" looks like — get embedded in it. In 2015, Google Photos' image recognition system labeled photos of Black users with the word "gorillas." This wasn't malicious. It was the product of training data that underrepresented people of color, built by a team that didn't catch the problem before deployment. Google's fix was to remove the category "gorilla" from the app's labels entirely — a solution that held as recently as 2023.

A 2019 paper published in Science found that a healthcare algorithm used by major U.S. hospital systems was systematically giving lower priority scores to Black patients than to equally sick white patients. The algorithm hadn't used race as a variable. It had used historical healthcare spending as a proxy for medical need — and because Black patients had historically received less healthcare spending (due to systemic inequities in the healthcare system), the algorithm interpreted this as lower medical need. A proxy for need that reflected past discrimination was being used to allocate future care.

The Hard Question

If an AI system is built by people who genuinely did not intend to cause harm, but the system causes harm anyway because of whose data it was trained on and whose experiences were missing from its development — who is responsible for fixing it? And more importantly: who gets to decide what "fixed" looks like?

What Oversight Actually Looks Like

In 2021, the European Union proposed the Artificial Intelligence Act — the first major attempt by a government to regulate AI systems based on the level of risk they pose. The Act categorizes AI applications into risk tiers. Systems used in critical infrastructure, hiring, healthcare, or law enforcement are classified as "high risk" and must meet strict transparency, testing, and human-oversight requirements before deployment. Systems that pose unacceptable risk — like real-time public biometric surveillance or AI that manipulates people using psychological techniques — are prohibited entirely.

The AI Act became law in 2024. It's the most significant legal framework for AI in the world and serves as a template other governments are now studying. It's imperfect — experts debate whether the risk categories are drawn correctly, whether enforcement is feasible, and whether it will slow beneficial AI development. But it represents a concrete answer to a concrete question: who makes the rules about whose values get encoded in AI systems that affect everyone?

The answer the EU chose: elected governments, not private companies. That's a values choice too — and not everyone agrees with it. Some argue that government regulation stifles innovation. Others argue that leaving the rules to companies is like asking banks to write their own banking regulations. Both arguments have merit. The tension between them is the central political debate about AI happening in legislatures and boardrooms right now.

Knowing This Changes What You See

You can now read debates about AI regulation — which will dominate policy news for the rest of your life — and understand what's actually at stake. Not just "is AI dangerous?" but: Who defines what harm means? Who verifies that AI systems meet those standards? And what mechanisms exist to correct course when they don't? These are questions of democratic governance, not just technology.

Lesson 3 Quiz

5 questions · Apply value-alignment concepts to real scenarios

1. Facebook's algorithm was designed to maximize engagement. According to its own internal research, this primarily resulted in:

Correct. The algorithm wasn't programmed to promote anger — it learned that anger-generating content produced more likes, comments, and shares. The objective function (maximize engagement) produced a side effect the company's own researchers identified as harmful.

Facebook's internal research found that outrage and anger generated more engagement (comments, shares, time on platform) than calm or positive content. The algorithm rewarded what worked, and anger worked — regardless of its social effects.

2. A healthcare algorithm that used "past spending" as a proxy for medical need ended up giving lower scores to Black patients. This happened because:

Yes. This is a key concept: using a proxy (spending) for a real thing (need) can import historical injustice into an algorithm, even without any discriminatory intent. The algorithm learned from a world that was already unequal.

No intentional discrimination was involved. The issue was using historical healthcare spending as a proxy for medical need — but because Black patients had historically received less spending due to systemic inequities, the algorithm treated that history as objective data about need.

3. An AI assistant for writing is trained primarily on English-language text from the United States. A school in Brazil begins using it for all student assignments. What value-alignment concern does this most directly raise?

Correct application of the concept. What counts as "good writing," relevant examples, appropriate tone, and even which topics feel natural to discuss are culturally embedded. An AI trained in one cultural context carries those assumptions into every other context it's deployed in.

This is a values-geography problem. The AI was built by people with specific cultural assumptions, and those assumptions are embedded in what it considers "good," "normal," or "appropriate." Language is only part of this — the deeper issue is whose conception of good writing gets treated as the standard.

4. The EU's AI Act classifies AI applications into risk tiers. Which of the following is the best justification for this approach?

Exactly. A music recommendation algorithm making a bad suggestion has very different consequences than a bail-risk algorithm making a bad prediction. Proportional regulation tries to match the level of scrutiny to the level of potential harm.

Risk-tiered regulation is based on proportionality: higher stakes require higher scrutiny. An AI that suggests movies doesn't need the same oversight as an AI that determines medical treatment or criminal sentencing. The costs of errors are vastly different.

5. Someone argues: "AI companies should write their own safety standards — they understand the technology better than any government could." Someone else argues: "Companies should not regulate themselves — that's like banks writing their own banking laws." Which position is more consistent with the concept of value alignment as taught in this lesson?

Correct. Value alignment is precisely about who decides what AI systems should optimize for. If those decisions are made only by companies with financial interests in particular outcomes, the values encoded in AI systems may not reflect the interests of the people those systems affect.

Value alignment teaches us that AI design encodes value choices — and those choices affect everyone. Leaving value decisions entirely to profit-motivated companies creates the same conflict of interest as any self-regulatory system. Technical expertise is valuable input, but not the only input that should matter.

Lab 3: The Values Interrogation

You're on the ethics board reviewing a new AI content moderation system before it launches globally.

Your Role

A major social media platform is about to deploy an AI content moderation system in 50 countries. It flags and removes posts that violate community standards. The training data came primarily from English-language content moderated by teams in California. The platform wants your board's sign-off before global rollout. Your AI colleague has reviewed the technical documentation.

Tell your colleague what questions you'd require answered before signing off — and make a case for the most important one. Expect pushback: your colleague thinks the platform needs to move fast and these concerns can be addressed post-launch.

Colleague: Mia Osei

Ethics Board Member

I've read the technical docs. The system has 91% accuracy on English content in California testing. They have no accuracy data for Arabic, Yoruba, or Tagalog. The platform's VP says they'll collect that data after launch and iterate. The board has 48 hours to respond. What's your position?

Keeping AI Under Control · Lesson 4 of 4

What Keeping Control Actually Requires

Safety isn't a setting you switch on. It's a set of practices, institutions, and choices made — or not made — every day.

If the people building the most powerful AI systems in the world believe those systems could be dangerous, why are they still building them?

In March 2023, an open letter was published and signed by over 1,000 AI researchers, engineers, and technology executives — including Elon Musk, Steve Wozniak, and Stuart Russell, one of the world's leading AI scientists. The letter called for a six-month pause in the training of AI systems more powerful than GPT-4, the system OpenAI had just released. The letter said, plainly: "AI systems with human-competitive intelligence can pose profound risks to society and humanity... AI labs are locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one — not even their creators — fully understand."

The pause never happened. Not for six months, not for six weeks. Development continued. The same companies whose employees signed the letter kept building. New systems were released. OpenAI, Anthropic, Google, and Meta all shipped increasingly capable models throughout 2023 and 2024. Some of the people who signed the letter went on to build or fund the very systems they'd warned about. Sam Altman, the CEO of OpenAI whose company had just released GPT-4, did not sign the letter — but had said publicly, in multiple interviews, that he believed he was building one of the most potentially dangerous technologies in human history and was doing it anyway.

This is not hypocrisy, exactly. Or at least, it's not just hypocrisy. It's a real strategic argument: if powerful AI is coming regardless, it's better to have safety-focused labs at the frontier than to cede that ground to developers less focused on safety. You can agree or disagree with that argument — but you need to know it exists, because it shapes almost every major decision in AI development right now.

The Race Dynamic and Why It's Hard to Stop

There's a concept in game theory called a prisoner's dilemma — and it describes the AI development situation almost perfectly. Imagine two AI companies. Both would be better off if they both slowed down and focused on safety. But if only one slows down while the other keeps building, the one that slowed down loses market share, loses talent, loses investment, and eventually loses relevance. The rational move for each company, individually, is to keep building — even if both companies collectively would prefer a world where everyone slowed down.

This is why voluntary commitments to safety are considered insufficient by most policy researchers. A company that genuinely wants to be responsible still faces enormous competitive pressure to keep pace. The solution most frequently proposed: external rules that apply to everyone, so that slowing down for safety doesn't mean falling behind. Which brings us back to the question of governance.

Race dynamics — The competitive pressure in AI development where each actor believes they must move fast to avoid being overtaken, even when all actors would collectively prefer a slower, more careful pace.

Governance — The systems of rules, norms, institutions, and enforcement mechanisms that shape how powerful technologies are developed and deployed. For AI, governance is still being built — in real time, by real people making decisions right now.

What Safety Practices Actually Look Like in Practice

AI safety isn't abstract. There are specific, concrete practices that safety-focused organizations use — and specific, concrete ways those practices get cut when they're expensive or inconvenient. Here are three that are currently being debated inside every major AI lab:

Red-teaming. Before deploying an AI system, safety teams deliberately try to break it — asking it harmful questions, looking for ways to make it produce dangerous outputs, probing for failure modes. In 2022, Anthropic published research describing how they had red-teamed their own Claude model for hundreds of hours before release. The practice is now standard at major labs, but it's expensive and time-consuming, and the pressure to ship faster constantly competes with the time needed to red-team thoroughly.

Staged rollouts. Rather than releasing a new AI system to the entire world at once, staged rollouts deploy first to a small group, watch for problems, and scale up only when the system behaves as expected. This is how responsible software engineers approach high-stakes deployments. It is not how every AI system has been deployed — the pressure for large launch-day numbers often wins.

Independent auditing. Third-party organizations — not the company that built the AI — review the system for bias, safety failures, and unexpected behaviors before or after deployment. This is analogous to financial auditing: companies don't grade their own earnings reports. AI auditing is still nascent; there's no standard framework, no universal requirement, and no agreed-upon credentials for who is qualified to do it. This is one of the most active areas of AI policy work happening right now.

Where the Decisions Are Being Made

AI governance decisions are being made right now — in the EU Parliament, in the U.S. Congress, in standard-setting bodies like NIST (National Institute of Standards and Technology), and in private meetings between AI company CEOs and government officials. The outcomes of those decisions will shape what AI systems are built, how they're tested, and who is allowed to challenge them when they fail. These are not inevitable outcomes. They are choices being made by specific people, and they can be influenced by an informed public.

What You Can Actually Do With This

Here is an honest answer to the question most people have after a course like this: what is a 12-year-old supposed to do about AI safety? The answer is not "nothing" — but it's also not "go write your representative a letter" (though that's not a terrible idea). The answer is more immediate and more practical.

First: notice. You now have vocabulary and frameworks that most adults don't have. When AI appears in news coverage — and it appears constantly — you can ask the right questions. Is this system transparent? Who is accountable when it fails? Does it apply equally to all groups? Who decided what it was optimizing for? Does any independent body review it? These questions are not rhetorical. They are the scaffolding of responsible AI deployment, and asking them is itself a form of participation in how AI governance develops.

Second: resist simplification. AI debates get flattened into "AI good" vs. "AI bad." You now know that's useless. The real questions are specific: which system, doing what, with what safeguards, governed by whom, correctable how? The people most likely to shape AI governance well are the ones who resist the urge to pick a team and instead insist on the specific, complicated, uncomfortable questions.

Third: take the long view. Every technology that humans have struggled to control — nuclear weapons, social media, financial derivatives — eventually got some form of governance, some set of rules, some set of institutions. The governance was always imperfect, often late, and frequently contested. But it happened. AI will be no different. The question is whether the governance that emerges is shaped by people who actually understand the problem — or just by whoever was loudest or richest in the early years. You now understand the problem. That is a real advantage.

What You've Built Over These Four Lessons

You started with a teacher who got fired by an algorithm she couldn't question. You now understand why that happened — and can name the specific gaps (transparency, accountability, feedback) that allowed it. You understand why having a human in the loop isn't automatically the answer. You understand that AI systems encode values, and those values reflect whoever built them. And you understand that keeping AI under control requires institutions, governance, and practices — not just good intentions from engineers. Most people who work in AI policy came to these ideas after years of technical or legal training. You got there in four lessons. That's a starting point, not an endpoint — but it's a real one.

Lesson 4 Quiz

5 questions · Synthesize and apply concepts from the full module

1. The 2023 open letter calling for a pause in AI development was signed by over 1,000 researchers — and then largely ignored by the industry. This best illustrates:

Correct. The letter's failure to produce a pause illustrates the prisoner's dilemma: each lab faces pressure to keep building even if all labs would collectively prefer a pause. Voluntary agreements don't solve collective action problems — that's what external governance is for.

The letter's signatories genuinely believed what they wrote. The problem was structural: competitive race dynamics meant that any company that paused unilaterally would lose ground to those that didn't. This is exactly the problem external governance is designed to solve.

2. "Red-teaming" an AI system before deployment means:

Right. Red-teaming is adversarial testing: you try to break the system before users encounter its failure modes. It's proactive safety work — and it competes with time and cost pressures to ship faster.

Red-teaming is pre-deployment adversarial testing — actively attempting to find what can go wrong before the system is released to the public. It's a proactive safety practice, not a reactive or competitive one.

3. An AI company argues: "We're building powerful AI because if we don't, a less safety-focused company will, and that's worse for everyone." This argument:

Exactly. This argument — sometimes called the "if not us, then someone worse" argument — has genuine strategic merit and is genuinely self-serving at the same time. Good critical thinking means holding both truths at once rather than dismissing the argument or accepting it uncritically.

This is a case where a single position is incomplete. The argument has real logic — being at the frontier with a safety focus may be better than ceding it. But it also conveniently aligns with competitive incentives that exist regardless of safety. Both things are true simultaneously.

4. Independent AI auditing is compared to financial auditing. The most important parallel between the two is:

Correct. The core principle is conflict of interest. Companies that profit from their AI systems have structural incentives to minimize findings of harm. Independent review — by parties without those incentives — is the mechanism that makes oversight credible.

The parallel is about independence from conflicts of interest. Just as companies don't grade their own financial performance, they shouldn't be the sole evaluators of whether their AI systems are safe — because they have incentives that may not align with honest assessment.

5. A classmate says: "AI safety is just a problem for engineers and governments — there's nothing regular people can do." Based on this module, the best response is:

Well reasoned. Governance of powerful technologies has always required an informed citizenry — not just experts. The three things this lesson named — noticing, resisting simplification, and taking the long view — are contributions anyone with these frameworks can make, regardless of technical background.

Technical expertise is valuable but not sufficient. AI safety is fundamentally about values, power, and accountability — domains where broad public understanding and participation are essential. People who understand these concepts can shape governance conversations, journalism, public opinion, and policy — even without writing a line of code.

Lab 4: The Governance Debate

You're testifying before a city council deciding whether to allow AI in public school grading, hiring, and policing — all at once.

Your Role

Your city's council is voting next week on a proposal to allow AI systems across three domains simultaneously: automated essay grading in public schools, AI-assisted hiring for city jobs, and predictive policing software. The AI company says all three systems have passed internal testing. No independent audit has been done. You have three minutes to give your testimony. Your debate partner represents the company and believes all three should be approved.

Start by taking a position — approve all, reject all, or something more specific — and give your reasoning. Your partner will challenge you. Use what you learned across all four lessons.

Partner: Jordan Whitfield

Company Representative

I'll go first: our systems have 90%+ accuracy in internal testing across all three applications, we've deployed similar tools in seven other cities without major incident, and delaying approval costs the city an estimated $2M in efficiency savings per year. The technology is ready. What's your position?

Module Test

15 questions · Covers all four lessons · Pass at 80% or above

1. The Houston EVAAS teacher evaluation system was most problematic because:

Correct. Hidden reasoning prevents accountability — the core lesson of the EVAAS case.

The central issue was that the algorithm's formula was proprietary and could not be examined even in court, making it impossible for affected teachers to challenge its decisions.

2. Amazon's AI recruiting tool penalized women's résumés. This was primarily a result of:

Correct. The AI learned what "a good Amazon hire" had historically looked like — and that history was biased.

No intentional discrimination occurred. The AI learned patterns from historical hiring decisions that had been made predominantly by and for men.

3. Which of the following best describes the "Accountability Gap" in AI systems?

Correct. The Accountability Gap is the absence of a clear responsible party when AI systems cause harm.

The Accountability Gap is about diffused responsibility — when something goes wrong, no single party clearly owns the outcome, leaving harmed parties without recourse.

4. Automation bias refers to:

Correct. Automation bias is a documented human cognitive tendency — not a programming error in the AI itself.

Automation bias is about human psychology: people systematically over-trust automated outputs, even when better information is available from other sources.

5. The 2010 Flash Crash most directly demonstrated the danger of:

Correct. The Flash Crash was an emergent phenomenon: algorithms reacting to each other in a feedback loop that spiraled beyond any single point of control.

The Flash Crash resulted from many algorithms interacting in ways no human had anticipated, producing a collective behavior that no individual algorithm had been designed to create.

6. The concept of "corrigibility" in AI safety means a system is:

Correct. Corrigibility is about human control — can we intervene, change, or stop the system when needed?

Corrigibility refers to the property of remaining under human control — specifically, being modifiable, interruptible, and shut-downable when needed.

7. Facebook's internal research found that its engagement-optimizing algorithm promoted content that caused anger. The company's response was to:

Correct. Senior leadership rejected the fix — documented in internal communications — because it would reduce revenue. This is a clear case of a deliberate values choice prioritizing engagement over social harm.

According to documents revealed by Frances Haugen, Facebook's leadership rejected a proposed fix because it would reduce engagement metrics. A business decision overrode the safety recommendation.

8. An "objective function" in AI design refers to:

Correct. The objective function defines what the AI is optimizing for — and getting this wrong, or right for the wrong reasons, is a central challenge in AI safety.

The objective function is the mathematical definition of "success" for an AI system. It determines what the system tries to maximize — and misalignments between what's easy to measure and what we actually care about are a major source of AI problems.

9. The EU Artificial Intelligence Act classifies AI applications by risk tier. The primary purpose of this approach is to:

Correct. Risk-tiered regulation is proportional: higher stakes require higher scrutiny. It's an attempt to impose governance without prohibiting beneficial applications.

Risk-tiering applies more stringent oversight requirements to applications where errors cause more serious harm — like healthcare and criminal justice — while allowing lower-stakes applications to operate with lighter requirements.

10. A healthcare algorithm gave lower priority scores to Black patients than to equally sick white patients — without using race as a variable. This happened because:

Correct. This is the proxy problem: using a measurable variable as a stand-in for something you actually care about can import historical injustice, even when the intent is entirely neutral.

The algorithm used spending as a proxy for need. Because Black patients had historically received less healthcare spending due to systemic inequities, the algorithm interpreted lower spending as lower need — amplifying historical injustice through an apparently neutral variable.

11. The "race dynamic" in AI development is best described as a situation where:

Correct. Race dynamics create a collective action problem — everyone would prefer the safer outcome, but individual incentives push each actor toward speed over caution.

Race dynamics describe a prisoner's dilemma structure: any single company that slows down loses ground to competitors who don't, creating incentives that push the entire field faster than any individual actor might prefer.

12. Google Photos labeled photos of Black users with a harmful racial slur in 2015. This happened primarily because:

Correct. No malicious intent — but unrepresentative training data plus insufficient pre-deployment testing produced a deeply harmful outcome at scale.

The failure resulted from training data that underrepresented people of color, combined with a development process that didn't catch the problem before deployment. No intentional harm — but real harm resulted.

13. "Staged rollout" as a safety practice means:

Correct. Staged rollout is a risk management approach: limit initial exposure, observe real-world behavior, and scale only when confidence is established.

Staged rollout is about cautious deployment sequencing: start small, monitor closely, expand only after confirming the system behaves as expected in real conditions.

14. When Google Health's AI detected breast cancer more accurately than radiologists alone, but human-AI teams performed worse than AI alone in some studies, the best implication for policy is:

Correct. The finding doesn't argue for removing humans or AI — it argues for studying and redesigning human-AI interaction to ensure the combination actually outperforms either alone.

The finding calls for redesigning how humans receive and process AI recommendations — not eliminating either human oversight or AI assistance. The quality of human-AI collaboration depends heavily on how the interaction is structured.

15. A student who has completed this module would be best equipped to do which of the following when reading a news story about a new AI deployment?

Correct. These four questions — transparency, accountability, value alignment, and independent oversight — are the practical analytical tools this module built. They apply to any AI deployment in any domain.

The core skill this module develops is a framework for evaluating AI deployments: who can see the reasoning, who is responsible for failures, what is the system optimizing for, and is there independent verification of its safety claims?