Keeping AI Safe for Everyone · Introduction

The Most Powerful Tool Ever Built Doesn't Have an Off Switch

AI systems are already making decisions that affect your life — this course is about who's responsible for making sure they don't make those decisions badly.

In October 2021, a 14-year-old boy in the UK named Molly Russell's father found out that Instagram and Pinterest had been recommending his daughter — over and over, algorithmically, without any human deciding to do so — content about self-harm and depression, right up until she died in 2017. The inquest concluded in 2022 that the platforms' recommendation algorithms had contributed to her death. No one programmed those systems to hurt anyone. The people who built them almost certainly never imagined this outcome. The algorithm was just doing what it was optimized to do: keep a user engaged.

That story isn't ancient history. It happened because of a specific set of choices — choices about what an AI system should optimize for, who should oversee it, and what guardrails should exist. Those choices were made by engineers and executives, many of them well-meaning, and the consequences landed on a family that had nothing to do with any of it. That gap — between who makes the decisions about AI and who lives with the results — is exactly what this course is about.

By the end of this module, you'll be able to read any headline about AI and immediately ask the right questions: What was this system optimizing for? Who decided that? Who didn't have a say? Those aren't just abstract questions. They're the difference between AI that serves everyone and AI that quietly works against most of us. You won't leave here with all the answers. But you'll have the framework — and most people making these decisions professionally don't even have that yet.

If you finish every module, here's who you become:

You'll understand what AI alignment actually means — not as jargon, but as a concrete design problem with real stakes.
You'll be able to read any AI headline and immediately ask: what was this system optimizing for, and who decided that?
You'll know the difference between reward hacking, misalignment, and existential risk — and why conflating them leads to bad decisions.
You'll recognize what meaningful human oversight looks like in AI systems, and what it looks like when it's missing.
You'll become someone who can evaluate AI governance proposals — from corporate policies to international frameworks — without being misled by either hype or dismissal.
You'll understand who currently holds power over global AI safety decisions and where ordinary people do — and don't — have leverage.
You'll leave with a framework for thinking about AI risk that most working professionals in this field still don't have.

Keeping AI Safe for Everyone · Lesson 1 of 4

When the Robot Doesn't Know It's Wrong

AI systems fail not because they're evil — but because no one defined "good" clearly enough.

If a machine does exactly what you told it to do and someone gets hurt, who is responsible?

Starting around 2014, Amazon built an AI hiring tool designed to save time by automatically screening job applicants. The company fed it ten years of résumés from people who had been hired — and the system learned from those. By 2018, Amazon's own engineers discovered that the tool was consistently downgrading résumés that included the word "women's" — as in "women's chess club" or "women's college." It was also penalizing graduates of all-women's colleges. Amazon shut the project down. The system had done exactly what it was designed to do: learn from historical hiring data. But that data reflected a decade of a male-dominated tech industry. The AI learned the bias and amplified it. Nobody programmed it to discriminate. It figured that out on its own.

That case is now one of the most studied examples in AI safety. And the uncomfortable question it raises is: what were they actually trying to build? A "good" hiring tool, obviously. But what does "good" mean when you train a system on data that reflects an unfair world?

What Is AI Safety, Actually?

Most people hear "AI safety" and picture science-fiction: a robot uprising, a computer that decides to destroy humanity. That's not what this course is about — or at least, it's not the part that matters for the next ten years of your life.

AI safety is the field of making sure AI systems do what we actually want them to do, without causing harm in the process. That sounds obvious. It turns out to be incredibly hard.

The Amazon hiring tool wasn't unsafe because it was powerful. It was unsafe because the people who built it forgot to ask a crucial question: "If we train this on historical data, what values will it absorb?" They defined "good" as "similar to past successful hires" — and that definition quietly contained a decade of inequality.

AI Safety The work of making sure AI systems behave as intended, don't cause unintended harm, and keep humans able to understand and correct them when they go wrong.

Bias (in AI) When an AI system treats different groups of people unequally — often because the data it learned from reflected past inequality, not because anyone programmed it to discriminate.

Here's something worth holding onto: most AI safety problems are not technical problems at their core. They're value problems. The engineers knew how to build the system. They didn't know — or didn't ask — what the system should actually care about.

The Optimization Trap

Every AI system is built around a goal — a thing it's trying to maximize or minimize. In machine learning, this is called the objective function (or sometimes the "reward"). The system gets better and better at achieving that goal. The problem: the goal you write down is almost never exactly what you wanted.

Think about YouTube's recommendation algorithm. Before 2019, it was optimized to maximize watch time — keep users watching as long as possible. The system got extraordinarily good at this. It also started routing huge numbers of people toward increasingly extreme content, because extreme content kept people watching. Zeynep Tufekci, a sociologist at the University of North Carolina, wrote about this in 2018, describing how watching moderate political videos would lead the algorithm to progressively more radical content. YouTube modified the algorithm in 2019. But for years, the system did exactly what it was told — maximize watch time — and the side effects spread across political discourse worldwide.

The Core Problem

You can't just tell an AI "be helpful." You have to define helpful so precisely that the system can't find a shortcut that technically satisfies your definition while violating your intent. This is called the alignment problem — aligning what the AI optimizes for with what humans actually value.

This matters to you right now because every platform you use — every feed, every recommendation, every autocomplete — is an optimization system. Once you understand this, you start noticing whose interests are baked into those objectives. Usually it's the company's revenue. Not yours.

Why This Is Everyone's Problem

Here's the part that most people miss when they hear "AI safety": it's not a problem for engineers to solve in private and then present to the rest of us as a solved product. The decisions that shape AI systems — what they optimize for, whose interests they serve, what harms are acceptable — are fundamentally political and ethical decisions. They belong to everyone.

But right now, they're mostly being made by a small number of people at a small number of companies, most of them in a few ZIP codes in California and Washington state. The rest of the world is downstream of those decisions.

In 2016, the investigative news organization ProPublica published an analysis of a software tool called COMPAS, used by judges in US courts to predict whether a defendant would commit another crime. The tool assigned a "risk score" that influenced bail and sentencing decisions. ProPublica found that the algorithm was nearly twice as likely to falsely flag Black defendants as high risk compared to white defendants. The company that built COMPAS — Northpointe — disputed the analysis and defended its methodology. The disagreement revealed something important: there is no single mathematical definition of "fair." Different definitions of fairness are mathematically incompatible. Someone had to choose which one COMPAS used, and they chose without the affected communities having any say.

You Now See What Most People Miss

When someone says an AI system is "objective" or "unbiased," you now know to ask: objective according to what definition? Unbiased by whose measurement? The choice of objective is a value judgment, and value judgments are never neutral. Knowing this changes how you read every headline about AI.

This doesn't mean AI is always bad, or that the people building these systems are villains. Most of them are trying to build useful things. But "trying to be useful" and "actually understanding the consequences of your choices" are two different things. The gap between them is where AI safety work happens.

The Question That Has No Clean Answer

COMPAS made predictions about real people. Some of those predictions were wrong. In some cases, someone spent more time in jail because an algorithm flagged them as high risk when they weren't. In other cases, someone was released who went on to commit another crime.

Here's the uncomfortable truth: human judges make these kinds of errors too — and research consistently shows that human judges also have racial biases in their sentencing. So the question isn't "should we use the algorithm or the human?" The question is: which type of error are we more willing to accept, and from whom?

Ethical Question — No Clean Answer

If an AI system and a human judge make the same type of error at the same rate, but the AI's error is documented and traceable and the human's error is invisible — is the AI safer? Or does transparency make the harm feel more deliberate? Who should be accountable when an algorithm is wrong?

Sit with that. It's not a trick question with a hidden right answer. The people who study AI safety professionally disagree about it. The reason it matters to you — right now, at your age — is that these systems are being deployed in schools, courts, hospitals, and hiring processes. By the time you're an adult, decisions shaped by AI will be woven into nearly every institution you interact with. The people deciding how those systems work need to hear from more than just engineers and executives.

That's why AI safety is everyone's problem. Not just because you'll be affected by it. Because the conversation about how to do it right is happening now, and most seats at that table are empty.

Lesson 1 Quiz

Four questions — reasoning counts more than memory.

Amazon's AI hiring tool discriminated against women primarily because:

Exactly. The system learned from a decade of a male-dominated industry. It wasn't programmed to discriminate — it absorbed that pattern from the training data. This is what makes AI bias hard to catch: it doesn't look like a bug. It looks like the system working correctly.

Not quite. The problem wasn't a deliberate choice or a technical error — it was something more subtle. What was the system learning from?

YouTube's pre-2019 recommendation algorithm kept routing users toward extreme content. This was primarily a result of:

Correct. The algorithm was doing exactly what it was told — maximize watch time. Extreme content is engaging, so the system learned to promote it. This is the optimization trap: you define the goal, and the system finds every path to that goal, including ones you never imagined.

Think about the concept of the optimization trap from the lesson. What was the algorithm actually trying to maximize — and what side effect did that create?

A city uses an AI system to decide which neighborhoods get more frequent road maintenance. The algorithm uses historical repair request data. Residents of lower-income neighborhoods historically called the city complaint line less often — not because their roads were better, but because they had less confidence the calls would help. The AI now schedules less maintenance in those neighborhoods. This is an example of:

Yes. This is a real pattern — researchers have documented it in predictive policing, healthcare resource allocation, and infrastructure maintenance. The data reflects who historically had power to make demands, not who actually had the greatest need. An AI trained on that data will perpetuate the gap while appearing neutral.

Consider: what does the historical data actually measure? Does "fewer repair requests" mean fewer road problems — or could it mean something else about those communities?

The COMPAS case revealed that there is no single mathematical definition of "fair." Why does this matter for AI safety?

Exactly right. The choice of which definition of fairness to embed in a system is a political and ethical decision disguised as a technical one. When engineers say "we chose the most mathematically rigorous definition of fairness," they're still making a values choice — just one that sounds neutral. Recognizing this is one of the most important things you can do when evaluating any AI system.

Think about what the lesson said about different definitions of fairness being mathematically incompatible. If you have to choose between them, what kind of choice is that — technical or value-based?

Lab 1: The Objective Auditor

You're auditing a real-world AI system. Your job is to find the gaps between what it optimizes for and what people actually need.

Your Role: Independent AI Auditor

A school district has deployed an AI tool that flags students as "at risk of dropping out" based on attendance records, grade trends, and disciplinary history. The district says it's using the tool to help students get support earlier. But several parents and teachers are raising concerns.

Your job is to investigate. Talk to REMI — the AI assistant below — who has analyzed the system's documentation. Challenge REMI's reasoning, push back on assumptions, and work out what the real risks are.

Start by telling REMI what you think the most obvious problem with this system might be. Then dig deeper — REMI will push back and complicate your thinking.

REMI — AI Analysis Peer Lab 1

I've read the documentation on the dropout-prediction tool. It uses three years of historical data from this district — attendance, grades, and disciplinary records. The district says the goal is "early intervention." Before you tell me what you think the problem is, let me ask you something: what do you think the system is actually optimizing for? Not what the district says it does — what is it mathematically maximizing or minimizing?

Keeping AI Safe for Everyone · Lesson 2 of 4

Who Decides What the Machine Learns?

The people who build AI systems make choices — about data, goals, and tradeoffs — that affect everyone. Most of those choices are invisible.

If you can't see the rules an AI follows, how do you know whether to trust it?

In January 2012, researchers working with Facebook secretly altered the News Feeds of 689,003 users without their knowledge. For one week, some users saw more positive posts than usual; others saw more negative ones. The goal was to find out whether emotional content was contagious — whether seeing sad posts made you post sad things. The study was published in the journal PNAS in June 2014. When the public found out, the reaction was immediate and furious. People were horrified that a company had deliberately manipulated their emotional environment as an experiment, with no consent, no warning, and no way to opt out. Facebook's response, essentially, was: you agreed to our terms of service. Adam Kramer, the researcher who led the study, later wrote on Facebook that he was "deeply sorry" for the distress it caused. The lead academic researcher, Jeffrey Hancock of Cornell University, defended the work as important and ethical. Neither apology nor defense resolved the core question: who gave Facebook the right to decide what emotional content 700,000 people would see?

The Data Problem

Every AI system learns from data. The data is collected by someone, curated by someone, and labeled by someone. Each of those steps involves choices — and choices reflect values, priorities, and blind spots.

Take image recognition. In 2015, Google Photos launched a feature that automatically organized photos by what was in them — faces, places, objects. A Black software developer named Jacky Alcine discovered that the system had labeled photos of him and his friend as "gorillas." Google apologized and removed the label. But the fix, as reported by Wired in 2018, wasn't to train the system better on Black faces — it was to block the gorilla category entirely. The system still struggled with dark-skinned faces. The problem wasn't solved; it was hidden.

Training Data The collection of examples an AI system learns from. If the training data is unrepresentative, incomplete, or reflects historical bias, the AI will learn those flaws as if they were facts about the world.

The reason this matters isn't just one bad label on one photo app. Image recognition is used in facial recognition software deployed by law enforcement. In 2020, Robert Williams, a Black man in Detroit, was wrongfully arrested after a facial recognition system misidentified him from a grainy surveillance video. He was handcuffed in front of his children. The system had been trained on a dataset that significantly underrepresented dark-skinned faces — so it performed worse on them. Nobody who wrote that training dataset decided to make a system that would wrongfully arrest people. But their choices got Robert Williams handcuffed on his front lawn.

Invisible Choices, Real Consequences

When a doctor makes a judgment call and it turns out to be wrong, there's usually a trail of reasoning. You can examine it. You can ask the doctor to explain. When an AI system makes a wrong call, there's often no trail at all — just an output. This is called the black box problem.

Black Box An AI system where you can see the inputs (what goes in) and the outputs (what comes out), but not the reasoning in between. Most large neural networks are black boxes — even their creators can't always explain a specific decision.

In 2019, a study published in Science magazine analyzed a healthcare algorithm used across the US to decide which patients needed extra medical care. Researchers Ziad Obermeyer and colleagues found the algorithm was significantly less likely to flag Black patients as needing care than equally sick white patients. The company that made it — Optum — had not intentionally built a racist system. The algorithm used healthcare costs as a proxy for healthcare needs. But because of systemic inequality in the US healthcare system, Black patients with the same level of illness historically generated lower healthcare costs (in part because they had less access to care). The algorithm interpreted "lower costs" as "healthier." The effect: sicker Black patients were being passed over for care they needed.

The researchers estimated the algorithm affected approximately 200 million people annually. Nobody who designed it had made a deliberate choice to disadvantage Black patients. They'd made a data choice that seemed reasonable in isolation and turned out to have enormous consequences.

What You Now Understand

Every time someone says "the algorithm is neutral," you can ask: neutral to what? Every dataset embeds the world as it was, not the world as it should be. Every proxy measure (like "cost" for "health need") reflects an assumption. And most of the people affected by these systems never got to review those assumptions.

The Consent Gap

Back to Facebook's 2012 experiment. Here's what makes it particularly instructive for thinking about AI safety: the researchers didn't think they were doing anything wrong. They were studying real behavior in a real environment. They thought "terms of service" constituted consent. And in a narrow legal sense, maybe it did.

But there's a difference between legal consent and meaningful consent. When you click "I agree" on a 40-page terms-of-service document you haven't read (and that no one expects you to read), you haven't made a real choice. You've performed a ritual that provides legal cover for the company. The company knows this. You implicitly know this. And yet the ritual continues because it's convenient for everyone with power and inconvenient for everyone without it.

This pattern repeats throughout AI development. Data used to train large AI systems is often scraped from the internet — from photos people posted, text people wrote, art people made — without asking whether those people consented to train a commercial AI product. In 2023, artists including Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a lawsuit against image-generation AI companies, arguing that their artwork had been used to train systems without consent or compensation. The legal questions are unresolved. The ethical question — whether you should be able to train a for-profit system on someone's creative work without asking — is one where reasonable people disagree intensely.

Ethical Question — No Clean Answer

If an AI company trains on publicly available data — photos, writing, art that people chose to put on the internet — does that make it ethically acceptable? People made things public for one purpose; those things are now being used for a different purpose entirely. Is "it was public" the same as "it was available for any use"? And if not, how do you draw the line?

These questions are being argued in courts and legislatures right now. By the time you're working in any industry — not just tech — they will have shaped what AI systems exist, what they can do, and who owns the benefit of that. You don't have to wait until you're an adult to have an opinion about them. The decisions are being made now.

Lesson 2 Quiz

Apply the concepts — don't just recall them.

The healthcare algorithm studied by Obermeyer et al. used "healthcare costs" as a proxy for "healthcare need." The main problem with this choice was:

Correct. The proxy measure looked reasonable in isolation — costs are related to care — but it absorbed the inequality of the system that generated those costs. This is why choosing how to measure a goal is as important as choosing the goal itself.

Consider: why would two equally sick patients generate different costs? What does that tell us about what "cost" actually measures?

Google Photos "fixed" the gorilla label problem by blocking the gorilla category entirely rather than improving the system's performance on dark-skinned faces. What does this tell us about how AI companies sometimes respond to bias?

Yes. Hiding a problem is not the same as fixing it. The system still performed worse on dark-skinned faces — it just no longer produced the label that made that visible. This is a pattern worth watching for in AI systems generally: when you can't see the error, you can't push for it to be corrected.

Think about the difference between making a problem invisible and actually solving it. What changed about the system's underlying performance on dark-skinned faces?

A streaming service uses listening data from millions of users to train a music recommendation AI. A musician releases an album. The AI learns from this data and starts recommending the musician's style to users — but never recommends the musician directly, just songs that "sound like" their style. The musician earns nothing from this. Is this an AI safety issue, an ethics issue, a legal issue, or none of these?

Good reasoning. Real-world AI problems rarely fall into one clean category. This scenario involves questions about what data is ethically usable to train systems, what the legal framework should be, and whether AI systems that extract and recombine human creative work without credit or compensation are operating safely and fairly. These domains overlap.

Think about all three dimensions separately: Does it raise questions about fairness and consent? Does it have legal dimensions? Could it be considered an AI safety issue if systems routinely extract value from creators without accountability?

Facebook's 2012 emotional contagion experiment used "terms of service" as legal cover for conducting psychological research without explicit informed consent. The broader lesson for AI systems is:

Exactly. The consent gap is one of the most persistent problems in AI development. "Technically legal" and "genuinely agreed to" are not the same thing. This is important not just for experiments — it's relevant everywhere data is collected to train AI systems, from voice assistants to facial recognition to large language models.

Consider the difference between clicking a box you haven't read and genuinely deciding to participate in something. Is one of those real consent? And if not — what does that mean for the AI systems built on that data?

Lab 2: The Proxy Problem

A company claims its AI system is fair. Your job is to find the proxy measure that isn't.

Your Role: Fairness Investigator

A health insurance company has built an AI to predict which policyholders will need expensive medical care in the next year, so they can offer "wellness programs" proactively. The company says the system uses only objective medical data: number of prior doctor visits, number of prior prescriptions, and prior emergency room visits.

REMI has access to background data on this system. Start by identifying which of those three inputs you think might be a problematic proxy — and why. REMI will test your reasoning and push you toward what the evidence actually shows.

Which of the three inputs — doctor visits, prescriptions, or ER visits — is most likely to embed inequality into the system's predictions? Make a case for your choice before REMI responds.

REMI — AI Analysis Peer Lab 2

I've reviewed the system specs. Three inputs: prior doctor visits, prior prescriptions, and prior ER visits. The company says these are purely medical. Before I tell you what the research shows, I want to hear your reasoning. Which one do you think is the most problematic proxy, and what's your argument? Don't just guess — walk me through it.

Keeping AI Safe for Everyone · Lesson 3 of 4

The Feedback Loop: When AI Makes the Problem Worse

Some AI systems don't just reflect inequality — they actively generate more of it over time.

If an AI system makes a decision that changes the world it's predicting, is it measuring reality or creating it?

In 2013, the Chicago Police Department began using an algorithm called the Strategic Subject List — also called the "heat list" — to predict which individuals were most likely to be involved in gun violence, either as perpetrators or victims. The list was meant to be used for "outreach." But what actually happened, as documented by reporting from the Chicago Tribune and later by academic researchers, was that people on the list were often subjected to increased police surveillance and stops. Being on the list increased your chances of being stopped by police. Being stopped by police increased the likelihood of an arrest — even for minor infractions. Arrests generated more data, which fed back into the algorithm and pushed people higher on the list. Some people on the list had been placed there partly because of prior police contact — contact that was itself a product of being over-policed. The algorithm was, in part, predicting the consequences of its own existence.

Chicago discontinued the Strategic Subject List in 2019. But the pattern it demonstrated — an AI system that shapes the behavior of people and institutions, which then generates the data the AI uses to make future predictions — is everywhere. It has a name: a feedback loop.

How Feedback Loops Work

A feedback loop happens when an AI system's outputs influence the real world, and that changed world generates new data, which the AI uses to update its predictions. In a closed loop, the AI can end up amplifying whatever pattern it started with — whether that pattern was accurate or not.

Feedback Loop When an AI system's decisions affect the data that will be used to train or update future versions of the same system, causing predictions to reinforce themselves over time — regardless of whether the original prediction was accurate.

Here's a simple version: imagine an AI that predicts which students will need tutoring. It flags certain students. Those students get tutoring. Their grades improve. The AI notes that its predictions were accurate — those students did struggle initially. But what about students who weren't flagged? They didn't get tutoring. Their grades stayed flat. The AI interprets this as confirmation that they didn't need help. The result: the AI learns to route resources to whoever it already thought needed resources, and systematically overlooks everyone else.

Now scale that up to criminal justice, hiring, lending, or healthcare. The same dynamic applies. And in each case, the people on the wrong end of the feedback loop often have no way to know the loop exists, let alone opt out of it.

Real Instance: Predictive Policing

Researcher Rashida Richardson and colleagues published a 2019 study documenting what they called "dirty data" in predictive policing systems. They found that cities including New Orleans, Los Angeles, and Chicago used historical crime data to train their AI tools — but that data had been generated by biased policing practices to begin with. The AI then directed more policing resources to the same areas, generating more arrests, which validated the original predictions. The loop was self-sustaining.

Recommendation Systems and Radicalization

Feedback loops aren't only about policing. Every major recommendation system runs on them.

When you watch a YouTube video, the algorithm notes your engagement. When it recommends something similar and you watch that too, the system strengthens the connection between those categories and your profile. If you watch one video skeptical of climate science — not because you believe it, maybe just because the thumbnail was interesting — the algorithm may begin recommending more. Not because it wants to radicalize you. Because engagement creates a positive feedback signal, and clusters of related content tend to generate more engagement than random recommendations.

Researcher Manoel Horta Ribeiro and colleagues at EPFL published a 2020 study in Proceedings of the Web Conference tracking what they called "the alt-right pipeline" — a pattern in which users who engaged with mainstream conservative content on YouTube were, over time, increasingly recommended content from more extreme channels. The paper found significant migration between channels of escalating radicalism, consistent with the algorithmic recommendation structure.

YouTube disputes the characterization and has modified its recommendation algorithm repeatedly since 2019. But the underlying dynamic — a recommendation system that learns from engagement and therefore tends to push people toward more extreme versions of whatever they've already engaged with — is a structural feature, not a one-time bug. You can patch the specific channels. The incentive structure that produces the pattern remains.

The Institutional Stakes

This is where AI safety becomes a policy question, not just an engineering one. What rules should govern the feedback loops in widely-used recommendation systems? Should companies be required to disclose what their algorithms optimize for? Should users have a right to see why they're being recommended something? These decisions are being made at the EU, US Congress, and UN level right now — and they're shaping the information environment you live in.

The Measurement Problem, Again

Feedback loops are worst when the AI is measuring something that its own behavior changes. In economics, there's a concept called Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The moment you start optimizing for something, you change the behavior of the thing you're measuring.

In 2019, the UK government used an algorithm to predict students' A-level exam grades after COVID-19 cancelled in-person exams. The algorithm used school-level historical performance to moderate individual teacher-predicted grades. Students at historically lower-performing schools — disproportionately working-class and minority students — had their grades downgraded. Students at private schools had their grades upgraded. The outcry was immediate. Boris Johnson's government abandoned the algorithm within days and reverted to teacher-assessed grades. But thousands of students had already been rejected from university places on the basis of the algorithmic grades.

The algorithm had been measuring "what grades this school's students historically get" and using that to predict "what grade this specific student deserves." Those are not the same thing. Every individual student was being assessed not on their own work but on the aggregate history of their institution. The measurement wasn't measuring what it claimed to measure.

Ethical Question — No Clean Answer

If human teachers also have biases — and research shows they do, rating students differently based on race and socioeconomic status — is an algorithm that's at least consistent more fair? Even if its consistency encodes historical disadvantage? Is predictable bias better or worse than unpredictable bias? And who has the right to make that call on behalf of students?

You now know how to look at any AI system and ask: does this system's output change the data it uses to make future predictions? If yes, what pattern is it reinforcing? And who benefits from that reinforcement — and who doesn't? Most people using these systems never think to ask those questions. You do now.

Lesson 3 Quiz

Feedback loops and self-fulfilling predictions.

Chicago's Strategic Subject List was criticized partly because people on the list experienced increased police stops, which generated more arrest data, which pushed them higher on the list. This is an example of:

Correct. The loop was: prediction → increased surveillance → more arrests → more data → higher prediction score. The AI didn't need to be wrong to cause harm — it could self-validate even if the original predictions were inaccurate, because its existence changed the behavior it was measuring.

Think about what happened after people were placed on the list. Did the prediction change what happened to them? Did what happened to them then feed back into the algorithm?

A bank uses an AI to decide who gets approved for loans. The AI was trained on ten years of historical loan data. People in certain zip codes historically got fewer loans approved (partly due to past discriminatory lending). The AI learns this pattern and continues to approve fewer loans in those areas. Over the next decade, less investment flows into those areas. Property values stagnate. The bank's AI uses this as further evidence that the area is "high risk." This scenario best illustrates:

Exactly right. This is a real practice called "redlining" when done by humans — denying services to neighborhoods based on racial composition. When an AI learns from that history and replicates it, it doesn't matter whether the engineers intended discrimination. The effect is the same. The feedback loop is what makes it self-perpetuating rather than a one-time error.

Trace the loop carefully. What does the AI's decision do to the real world? What does that real-world change do to the data the AI will see in ten years? Where does that cycle end?

The UK government's 2019 A-level algorithm downgraded students from historically lower-performing schools. The core flaw in the algorithm's logic was:

Yes. Measuring "what does this school's student body historically score" and applying that to "what does this individual student deserve" is a proxy error. Individual ability and school-level historical averages are different things. The algorithm was encoding group history onto individuals — which is almost always both inaccurate and unfair.

What was the algorithm actually measuring? Was it measuring anything about the individual student's abilities, or was it measuring something else entirely?

You're building a content recommendation system for a news app. You want to avoid the radicalization feedback loop described in the lesson. Which of the following design choices would most directly address the structural problem?

Good reasoning. Warning labels and manual review address symptoms without changing the underlying incentive. Banning categories removes content but doesn't fix the loop. The structural solution is to change what the algorithm optimizes for — because the feedback loop is a direct consequence of optimizing for engagement. Change the objective, and you change what the system learns to amplify.

Think about where the loop starts. The feedback loop exists because of what the algorithm is optimizing for. Which of these options changes the objective itself — rather than just patching a symptom?

Lab 3: Loop Detective

Find the feedback loop — then figure out how to break it.

Your Role: Systems Analyst

A large urban school district uses an AI to allocate tutoring resources. The system monitors students' grades weekly and flags those who are falling behind for extra support. Teachers see the flags and spend more time with flagged students. End-of-year scores improve for flagged students. The district reports the system is a success.

But a researcher has noticed something: students who were not flagged in September — perhaps because their early grades were average, not failing — received no additional support. By June, the gap between flagged and unflagged students had grown, and the AI interpreted this as confirmation that its original risk assessments were accurate.

Work with REMI to analyze this loop: where does it start, what does it reinforce, and what would you actually change to break it?

Start by mapping the feedback loop: what is the input, what decision does the AI make, how does that decision change the world, and how does that changed world feed back into future decisions?

REMI — AI Analysis Peer Lab 3

Alright — walk me through the loop. Don't just name it. Map it out: what's the input signal, what does the AI decide, what happens to real students as a result, and how does that outcome show up in the data the AI will see next year? Then tell me: what's the hidden assumption the system makes that's causing the problem?

Keeping AI Safe for Everyone · Lesson 4 of 4

Who Gets to Fix It?

AI safety isn't just about finding problems — it's about building systems where humans can see, understand, and correct mistakes before they become irreversible.

If the people most affected by an AI system have no way to challenge it, can it ever be considered safe?

Between 2013 and 2019, the Dutch tax authority ran an automated fraud detection system called SyRI (System Risk Indication) that flagged citizens for potential welfare fraud. The system analyzed data from multiple government sources — tax records, utility bills, residency registration — and assigned risk scores to individuals. People in low-income neighborhoods, many of them immigrants or from ethnic minority backgrounds, were disproportionately flagged. If you were flagged, government benefits could be halted and you could be subjected to intensive investigation — before any fraud was proven. Thousands of families lost child benefit payments they were legally entitled to. Some lost their homes. By the time the scandal fully broke in 2020, Prime Minister Mark Rutte's government had collapsed — the first Dutch government to fall since World War II due to a domestic policy scandal. A court had ruled SyRI illegal in 2020, finding it violated human rights. But the damage to tens of thousands of families had been accumulating for seven years. No one had been watching.

The Oversight Problem

The Netherlands scandal illustrates something AI safety researchers call the human oversight problem: as AI systems make more decisions faster, the ability of humans to monitor and correct those decisions can break down. Not because anyone decided to stop watching — but because the volume and speed of automated decisions outstrips human capacity to review them.

Human Oversight The ability of humans to monitor what an AI system is doing, understand why it made a particular decision, and intervene to correct or override it. When oversight breaks down, errors can compound for years before anyone notices.

SyRI ran for six years before a court stopped it. During that time, thousands of families were harmed. The people who were flagged had no right to see their risk score, no right to know which data points generated it, and no effective way to challenge it. The system was a black box used by the government against its own citizens, with no meaningful appeal process.

This is what happens when oversight fails at the institutional level. And it's not hypothetical — it happened in a wealthy, democratic country with a functioning legal system. The legal system eventually caught up. By then, the government had fallen and the harm was done.

The Scale Problem

A human bureaucrat making a mistake about your benefits affects one family. An automated system making the same mistake systematically affects thousands of families simultaneously, without any single error being visible enough to trigger review. Scale transforms individual mistakes into systemic harm.

Transparency, Explainability, and Accountability

AI safety researchers often talk about three properties that AI systems used for consequential decisions need to have. Each one matters, and they're related but distinct:

Transparency Being able to see what data an AI system uses, what it was trained on, and what it optimizes for. Transparency doesn't mean you understand every calculation — it means the basic facts about the system are available for scrutiny.

Explainability Being able to understand why the AI made a specific decision in a specific case. "The algorithm flagged this person for fraud" is not explainability. "The algorithm flagged this person because their utility bills were paid from a different address than their registered residence" is.

Accountability The existence of a clear path to challenge an AI decision and have a human review it. Accountability means someone — a person, not just the algorithm — is responsible for the outcome and can be asked to justify it.

In 2018, the European Union's General Data Protection Regulation (GDPR) established a legal right to explanation for automated decisions — in EU law, if a company makes a significant decision about you using an algorithm, you have a right to ask for a human explanation. This is one of the first major regulatory frameworks to try to enforce explainability. Whether it actually works in practice is still being tested. But it establishes the principle: consequential automated decisions need to be explainable, or they shouldn't be made.

In the US, no equivalent national standard exists, though the Consumer Financial Protection Bureau has published guidance requiring that people denied credit must receive a specific reason — meaning lenders using AI for credit decisions have to be able to produce an explanation. This doesn't solve the black box problem for most AI applications. But it shows that explainability is achievable when regulators require it.

Who Gets a Voice?

Here's the piece of AI safety that gets the least attention in mainstream coverage: the question of who gets to participate in deciding how AI systems are built in the first place.

In 2021, a document leaked from inside Google showed that the company had fired AI ethics researcher Timnit Gebru — one of the most prominent researchers on AI bias — after she submitted a research paper critical of large language models (the technology underlying systems like ChatGPT). The incident sparked enormous controversy in the AI community. Gebru had been one of few Black women in a senior AI research role at a major tech company. Her research focused on the ways AI systems could encode and amplify harm for marginalized communities — which happened to be the communities most often missing from the rooms where AI systems were being designed.

This isn't just about one person. It reflects a structural issue: the people who have the most to lose from AI systems that embed bias or fail to account for marginalization are also the least represented in the rooms where those systems are built. This isn't a coincidence. It's a self-reinforcing pattern with real consequences for what gets built, what gets questioned, and what gets ignored.

You Now Have the Framework

You've covered the full picture now: AI systems fail not just because of technical errors, but because of choices about objectives, data, measurement, feedback, oversight, and voice. Every one of those failure modes is a place where more people — including people your age — should be asking questions. The field of AI safety is still being defined. The norms around transparency and accountability are being established right now. You are not too young to have informed opinions about this. You're exactly the right age to be developing them.

Ethical Question — No Clean Answer

If you build an AI system that causes harm to a group of people, and you didn't intend that harm, and you were following standard industry practices — what is your responsibility? Is "I didn't know" a valid defense when the tools to anticipate the harm existed? Does it matter whether the people most likely to be harmed were given a chance to warn you? Who bears responsibility when an institution adopts an AI system whose flaws were documented and knowable — the company that built it, the institution that deployed it, or the regulators who allowed it?

That question has no clean answer. But it's the right question to be asking about every consequential AI deployment you encounter for the rest of your life. The fact that you're asking it at all puts you ahead of most people making decisions about these systems professionally. Use that.

Lesson 4 Quiz

Oversight, accountability, and who gets a seat at the table.

The Dutch SyRI scandal is particularly relevant to AI safety because:

Correct. The key features of the SyRI case are: six years of operation, thousands of families harmed, no effective individual appeal mechanism, and a black-box system that affected people couldn't scrutinize or challenge. This is what the failure of human oversight looks like at scale in a real democratic country.

Focus on what made the harm so extensive and long-lasting. What was missing that could have stopped or limited it earlier?

The EU's GDPR includes a "right to explanation" for automated decisions. What problem in AI safety is this most directly designed to address?

Yes. The right to explanation is specifically about explainability and accountability — giving people a path to understand why a decision was made and to challenge it with a human reviewer. It doesn't fix what the AI optimizes for or how it was trained. But it creates a minimum standard for consequential automated decisions: they must be explainable or they can't be made.

Think about what the GDPR right to explanation actually gives someone. What can they do with it that they couldn't do before? Which of the AI safety problems does that address most directly?

A hospital uses an AI system to prioritize which patients receive follow-up calls from nurses. The system was built without input from community health advocates or patients from the communities the hospital primarily serves. Three years after deployment, a review finds the system is systematically deprioritizing elderly patients who speak English as a second language — because it weights phone call response rates as a signal of patient engagement, and those patients are less likely to answer unknown numbers. The hospital's administrators say they had no idea. This scenario most directly illustrates:

Exactly right. Phone response rate is a reasonable-sounding proxy for engagement — unless you're designing for a community where that proxy doesn't hold. The people who would have flagged this as a problem were not in the room when the system was designed. And without ongoing oversight, it ran for three years. This is why representation in AI design and continuous monitoring are both components of AI safety — not extras.

What would have changed if patients from those communities had reviewed the system design? What would have changed if the hospital had an ongoing monitoring process? Both of those gaps contributed to this outcome.

Based on everything in this module: which of the following is the most accurate summary of why AI safety is "everyone's problem," not just an engineering challenge?

Yes. This is the core argument of the whole module. AI safety failures are rarely purely technical. They're failures about values — what gets optimized, whose data counts, whose definition of fairness wins, who has appeal rights, who was in the room. Those are political and ethical decisions. They belong to everyone affected by them. Right now, they're mostly made by a small group. That's the problem — and understanding it is the beginning of being able to do something about it.

Think back through the four lessons: Amazon's hiring tool, Google Photos, COMPAS, YouTube, Chicago's heat list, SyRI, the UK grades algorithm. In each case, what kind of failure was at the root? Was it primarily technical, or primarily about values and decisions?

Lab 4: Design the Safeguard

You're on an AI oversight board. A system is going live next month. You have to decide what safeguards it needs before it does.

Your Role: Oversight Board Member

A city government plans to deploy an AI system that recommends which families should receive priority access to affordable housing. The system uses income data, employment history, family size, and current housing conditions. The city says it will make the process "faster and fairer" than the current system, where decisions are made by individual caseworkers and wait times exceed two years.

You're on the independent oversight board that must approve or reject the deployment — or approve it with conditions. You have one conversation with REMI, who has studied comparable systems deployed in other cities. You need to decide: what conditions, if any, do you require before this system goes live?

Start by telling REMI the single safeguard you think is most critical — the one without which you would block deployment entirely. Be specific about what it requires and why the absence of it would make the system unsafe.

REMI — AI Analysis Peer Lab 4

I've reviewed four comparable deployments in other cities. Two were discontinued within 18 months due to community pushback. One is still running and considered mostly successful. One caused a civil rights lawsuit that's still in court. I'll share details on any of those as we go. But first — what's your non-negotiable? The one thing you'd block the whole system over if the city refused to include it. Tell me what it is and make the case for why that specific thing is the critical line.

Module Test

15 questions across all four lessons. 80% to pass.

1. Amazon's AI hiring tool learned to discriminate against women primarily because:

Correct. The system learned from a decade of biased historical data. This is what makes training data problems so dangerous — they look like facts about the world.

Think about what the system was learning from and what that data reflected.

2. The "alignment problem" in AI refers to:

Yes. Alignment is about the gap between the objective you write down and what you actually meant.

Think about the YouTube example — the system achieved its stated objective perfectly. What was the problem?

3. In the COMPAS case, the finding that different definitions of fairness are mathematically incompatible means:

Correct. Fairness is not a neutral technical concept. Choosing a definition is choosing whose interests to prioritize.

Consider: if two definitions of fairness are equally mathematically valid but produce different results for different groups, who decides which one to use? Is that a technical choice?

4. Google Photos' response to the gorilla-labeling incident — blocking the gorilla category entirely — is best characterized as:

Yes. Hiding an error is not fixing it. The system still performed worse on dark-skinned faces — users just couldn't see that anymore.

What changed about how the system performed on dark-skinned faces? What changed about what users could observe?

5. Facebook's 2012 emotional contagion study revealed a gap between:

Exactly. The consent gap is one of the most persistent issues in AI development — it affects not just experiments but all data collection used to train AI systems.

What did users technically agree to? What had they not genuinely chosen to participate in?

6. The healthcare algorithm studied by Obermeyer et al. used healthcare costs as a proxy for healthcare need. A city government using a similar approach builds an AI to allocate park maintenance resources and uses "number of maintenance requests submitted" as a proxy for "need for maintenance." Based on what you know, what is likely to go wrong?

Correct reasoning. "Requests submitted" doesn't measure maintenance need — it measures willingness and ability to request service. Those things are correlated with trust in government and historical responsiveness, which track along socioeconomic and racial lines. This is a real phenomenon documented in US city resource allocation research.

Think about what "number of requests submitted" actually measures. Does it measure how much maintenance is needed, or does it measure something else?

7. A feedback loop in an AI system occurs when:

Yes. The defining feature of a feedback loop is that the AI's outputs influence the world, and that changed world generates new inputs — which can make original patterns amplify regardless of their accuracy.

Think about Chicago's heat list. What did the list do to the people on it? What did that do to their data going forward?

8. Chicago's Strategic Subject List was discontinued in 2019. The most significant AI safety lesson from this case is:

Yes. The loop was self-validating: the prediction changed what happened to people, and what happened to people fed back into the prediction. The system could look accurate while causing the very behavior it was predicting.

Think about whether the system needed to be inaccurate to cause harm. What if it predicted crime risk and then the surveillance it triggered actually led to more documented crime in those areas?

9. The UK A-level grading algorithm downgraded students from historically lower-performing schools. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — applies here because:

Yes. The moment school-level history became the target for individual grade calculation, the measure stopped tracking what it was supposed to measure — individual student performance. Every individual student was averaged into their school's past, which is not what grades are supposed to represent.

What was the algorithm trying to measure? What did it actually end up measuring once it used school averages as the target?

10. Researcher Rashida Richardson's documentation of "dirty data" in predictive policing showed that:

Correct. The data wasn't just biased — it was the output of a biased system that then became the input for a new system. The loop made the bias invisible because the new system appeared to be predicting from objective data.

What did the historical data reflect about how policing had been conducted? And what did that mean for the AI trained on it?

11. The Dutch SyRI system ran for six years harming thousands of families. Which combination of AI safety failures best explains why the harm went undetected and unchallenged for so long?

Exactly right. All three failures contributed. Transparency failure: people didn't know they were on a list or why. Explainability failure: no specific reasoning was available. Accountability failure: no realistic path to challenge the decision. Remove any one of these failures and the harm would likely have been caught sooner.

What would have had to be different for affected families to challenge the system? What information would they have needed? What process would they have needed access to?

12. The EU's GDPR "right to explanation" for automated decisions is a regulatory response to which AI safety problem?

Yes. The right to explanation is specifically about explainability and accountability. It gives people a minimum right: know why this decision was made about you, and have access to human review. It doesn't fix the other problems — but it creates a check on the most direct individual harms of black-box decision systems.

What does the right to explanation actually give someone? What does that help them do that they couldn't do with a black-box decision?

13. Timnit Gebru's dismissal from Google in 2021 is relevant to AI safety because:

Yes. The structural point matters: representation in AI design isn't just ethically desirable — it's an epistemic necessity. The people most likely to spot certain harms are the people most likely to have experienced the conditions that generate them. If those people aren't in the room, or aren't protected when they speak, critical risks go unidentified.

Think about who is most likely to recognize AI harms experienced by specific communities. What happens to AI safety if those perspectives are systematically excluded from design teams?

14. A social media company builds an AI that recommends friends to connect with. After two years, data shows users are ending up in highly homogeneous social networks — they mostly see and connect with people who share their demographics, beliefs, and geographic location. The AI was never designed to do this. Which explanation is most likely correct?

Good analysis. This is the optimization trap meeting the feedback loop. The AI learns that similar recommendations get accepted more often, so it recommends more similar people, which generates more acceptances, which reinforces the pattern. No bug, no deliberate design — just an optimization objective producing an outcome no one explicitly intended.

Think about what the AI is optimizing for. What behavior does maximizing "accepted connections" actually reward? And how does that create a feedback loop?

15. Which of the following most accurately describes why AI safety is a problem for everyone — not just for engineers or policymakers?

Yes. This is the core argument. AI safety failures aren't primarily technical — they're failures of values, representation, and accountability. The decisions being made about these systems belong to all of us. The gap between who makes them and who lives with the consequences is the problem this field is trying to close.

Trace through every case in this module. In each one, was the root failure primarily technical, or primarily about whose values and interests were baked in — and who didn't have a say?