In November 2022, a chatbot called ChatGPT launched to the public. Within five days, it had a million users. Within two months, it had a hundred million β making it the fastest-growing consumer application in history up to that point. Students used it to write essays. Doctors used it to summarize research. Some school districts banned it immediately. Others scrambled to figure out what to do. And most people β including most adults β had no real idea how it actually worked. They just knew it was everywhere, and it felt different from anything before.
That pattern has happened before. When electricity arrived in homes in the 1880s, most people treated it like magic and got nervous about it for decades. The ones who understood even the basics β that electricity flows through circuits, that it can be controlled, that it obeys rules β had a completely different relationship to it. They weren't afraid; they were curious. They weren't helpless; they had leverage. The same thing is happening right now with AI, and you're living through the early years of it.
This course won't make you an AI engineer. What it will do is take you inside the actual process β how machines learn from data, why they sometimes fail spectacularly, what it costs to build them, and who gets to decide what they do. You'll finish this course seeing things in news stories, in apps, and in conversations about AI that most people β including most adults β completely miss. That's not an exaggeration. It's just what happens when you understand the loop instead of just watching the output.
In 1994, engineers at a company called HNC Software did something that banks thought was impossible: they built a system that could read a credit card transaction β just the amount, the location, and the time β and guess, within seconds, whether it was fraud. No human reviewed it. No rule book said "transactions over $500 in a foreign country are suspicious." The system had simply been shown hundreds of thousands of past transactions, told which ones turned out to be fraud, and left to find patterns on its own.
It was called Falcon, and it worked. By the mid-1990s, Falcon was processing roughly 500 million transactions per year across major banks. It caught fraud that rule-based systems missed completely. But it also occasionally blocked legitimate purchases β a traveler in Tokyo, a college student buying expensive textbooks. The system had learned to be suspicious in ways its creators didn't fully anticipate and couldn't always explain. The engineers at HNC had not programmed those suspicions. The machine had found them itself, buried in the data.
That story contains almost everything you need to understand about how machine learning works. A problem. A pile of labeled examples. A system that searches for patterns. And consequences β good and bad β that nobody fully predicted. The rest of this lesson unpacks how that loop actually runs, step by step.
Before any machine can learn anything, it needs examples. Thousands of them, usually millions. These aren't just numbers β they're labeled examples, which means each one comes with an answer attached. In Falcon's case, each transaction came with a label: fraud or not fraud. That label is the thing the machine is trying to learn to predict.
Think of it like studying for a test by reading old tests with the answer key included. You're not memorizing the specific questions β you're trying to absorb the pattern that determines which answers are right. The machine is doing the same thing, just with numbers instead of words, and at a scale no human could manage.
Here's the first thing that should give you pause: the labels are created by humans. Someone decided which past transactions were fraud. Someone decided which emails were spam. That human judgment β with all its inconsistencies and biases β gets baked into the training data and therefore into the machine's learning. The machine doesn't know this. It just trusts the labels it's given.
Once you have labeled data, you need a model β a mathematical structure that takes an input (a transaction, an image, a sentence) and produces an output (a prediction). At the start of training, the model knows nothing. Its predictions are essentially random. It looks at a transaction and guesses: fraud. It's wrong. Something happens next that is the entire engine of machine learning.
The model receives a signal: you were wrong, and here's how wrong. In machine learning, this signal is called the loss β a number that measures the gap between the model's guess and the actual correct answer. A big loss means a terrible guess. A small loss means you're close. The model's job, over thousands or millions of training steps, is to make that loss number as small as possible.
To reduce the loss, the model adjusts its internal settings β millions of tiny numerical dials called parameters or weights. Nudge this weight up, that one down, see if the loss gets smaller. Do this millions of times, across millions of examples, and the model gradually gets better. This adjustment process has a name: gradient descent. It sounds complicated, but the intuition is just: if you're trying to find the lowest point in a valley while blindfolded, you feel the slope under your feet and take a small step downhill. Repeat until you can't go any lower.
Make a prediction β compare it to the correct answer β measure how wrong you were (loss) β adjust your settings slightly to be less wrong next time β repeat. That's the core loop. Everything else in machine learning β all the complexity, all the jargon β is a variation on or elaboration of this.
Here's a trap that every machine learning system can fall into: memorization. If you study old tests long enough, you can memorize the specific questions rather than understanding the underlying concept. A model can do the same thing β it can learn to produce the right answer for every example in its training data without learning anything that generalizes to new situations.
In 2016, researchers at Google published a paper demonstrating this dramatically. They trained a deep neural network on a dataset where the labels were assigned completely at random β images of cats labeled as trucks, dogs labeled as planes. The model eventually reached nearly 100% accuracy on the training data. It had memorized garbage perfectly. But it couldn't predict anything outside that training set. This failure mode is called overfitting.
To catch overfitting, engineers hold back a portion of their data and never show it to the model during training. This is called the test set. After training is complete, they evaluate the model on the test set β examples it has never seen before. If performance is good on training data but bad on test data, the model overfit. If it holds up, the model has actually learned something real.
You now understand something that most people reporting on AI don't mention: a model's training accuracy and its real-world accuracy are different things, and the gap between them is one of the most important numbers in the entire field.
Return to Falcon for a moment. By the early 2000s, fraud-detection AI was everywhere in banking and insurance. And researchers began noticing something troubling: these systems were more likely to flag transactions from certain zip codes, certain spending patterns, certain demographic profiles β not because their creators programmed that in, but because historical fraud data reflected decades of discriminatory lending and policing. The machine learned from the past. The past was not neutral.
The loop β data in, prediction out, adjust β sounds mechanical and objective. But the data comes from a world shaped by human decisions, human inequities, human history. When a machine learns from that data, it learns the inequities too. It doesn't know they're inequities. It just sees patterns.
If a fraud-detection system was trained on historical data that reflects past discrimination, and that system now flags certain communities' transactions more often β is the system being discriminatory? It's doing exactly what it was mathematically optimized to do. But the outcome harms some people more than others. Who is responsible? The engineers who built it? The bank that deployed it? The people who created the historical data? There is no consensus answer to this question. The people making these systems right now are actively arguing about it.
You now understand the core learning loop β data, model, loss, training, testing. And you understand that "the machine learned it" is never a complete explanation. The machine learned it from somewhere, and that somewhere matters enormously. Knowing this changes how you read every headline that says an AI "decided" something.
When you hear "an AI flagged this account as suspicious" or "an algorithm recommended this video," you now know there's a specific loop behind that decision: training data with human-assigned labels, a model adjusting its weights to reduce loss, and a test set that may or may not reflect the real world it's now operating in. Most people hear "AI decided." You hear a story about data, choices, and consequences. That's a real difference.
A city has deployed an AI system to help emergency dispatchers prioritize which 911 calls get the fastest response. The system was trained on five years of historical dispatch records. Three months in, a community organization has filed a formal complaint: response times in the eastern districts are significantly longer than in the western districts, even for similar emergencies. The city says the AI is "just optimizing based on data." You've been brought in to investigate.
Your lab partner is an AI analyst who has seen this kind of situation before. They won't tell you what to think β but they'll push back if your reasoning is sloppy, and they'll ask you to go deeper when you're on to something.
In January 2017, a team led by Dr. Andre Esteva at Stanford published a study in the journal Nature that stunned the dermatology world. They had trained a deep learning system on 129,450 clinical images of skin lesions β moles, rashes, suspicious spots β and then tested it against 21 board-certified dermatologists. On detecting malignant melanoma, one of the most deadly forms of skin cancer, the AI matched or outperformed the human specialists.
The headline was everywhere: "AI beats doctors at detecting cancer." But the headline missed something important. The AI hadn't looked at those images the way a doctor does. A doctor sees texture, color gradation, symmetry, the context of the surrounding skin, the patient's age and history. The AI saw pixel values β 224 Γ 224 grids of numbers representing red, green, and blue intensities. It had learned that certain statistical patterns in those numbers were associated with malignancy. It didn't know what a mole was. It didn't know what cancer was. It knew that certain number arrangements predicted a certain label.
That distinction β between what a machine appears to understand and what it is actually processing β is one of the most important ideas in this entire course. It explains both why AI can be astonishingly good at narrow tasks and why it can fail in ways that would never fool a human. And it all comes down to something called features.
A feature is any piece of input information that a model uses to make a prediction. In the fraud detection system from Lesson 1, the features included: transaction amount, merchant category, geographic location, time of day, and how far the transaction was from the cardholder's usual locations. The model didn't receive a description of the transaction in words β it received a row of numbers.
Feature selection β deciding which numbers to feed the machine β used to be the most important and time-consuming part of building an AI system. Before deep learning, engineers spent enormous effort manually designing features. For email spam detection, they might create features like: number of times the word "free" appears, presence of all-caps text, ratio of links to words, sender's domain reputation. Each of these was a human judgment call about what information was relevant.
Early email spam filters built this way β like Paul Graham's 2002 Bayesian filter described in his essay "A Plan for Spam" β were remarkably effective for their time, because the humans designing the features understood the problem well. But they were also brittle: spammers could learn which features triggered the filter and engineer around them. Change "free" to "fr-ee" and the filter went blind.
The revolution that deep learning brought β starting around 2012, when a system called AlexNet shattered the ImageNet computer vision competition β was automatic feature learning. Instead of having humans decide what to look for, deep neural networks learn their own internal representations of the data. The model builds its own features, layer by layer, from the raw input.
In a deep network processing images, the early layers learn to detect edges and color gradients. The middle layers combine those into textures and shapes. The later layers combine shapes into recognizable structures β an ear, a wheel, a lesion. None of this was programmed. It emerged from the training process.
This is extraordinary, but it creates a serious problem: nobody knows exactly what features the model is using. In 2019, researchers at MIT found that a skin cancer detection AI had partly learned to associate the presence of dermatoscopes (medical rulers placed near moles for scale) with benign diagnoses β because in the training data, dermatoscopes tended to appear in images taken in clinical settings where moles were monitored carefully. The model had learned a spurious correlation, not a biological one. It had no way to tell the difference.
When researchers gave the Stanford skin cancer AI images of benign moles with rulers added to the photos, the model's confidence that the moles were benign increased β even though the ruler had nothing to do with cancer biology. The model had learned to read a medical artifact as a diagnostic signal. It found a pattern that was real in the training data but meaningless in reality.
There's a specific kind of failure that happens when a model is deployed into a world that's different from the world its training data came from. Engineers call this distribution shift β the statistical distribution of real-world inputs no longer matches the distribution of training data.
A vivid example: in 2020, during the early months of the COVID-19 pandemic, several hospital systems deployed AI models trained on pre-pandemic chest X-rays to help detect COVID pneumonia. Some of these models, researchers later found, had learned to use the position of the patient in the X-ray image as a feature β because in the training data, sicker patients were more likely to be lying down (supine X-rays) versus sitting up. When COVID arrived and changed who was getting which kind of X-ray and why, those features became misleading. The world had shifted; the model hadn't.
You now understand something that AI developers themselves wrestle with constantly: a model's features β whether chosen by humans or learned automatically β are a bet that the patterns in training data will hold in the real world. Sometimes that bet pays off. Sometimes the world changes, or the training data was never a fair sample to begin with, and the model fails in ways that look like incompetence but are actually a consequence of how it was built.
When you hear that an AI "examines" medical scans or "reads" applications, you now know it isn't reading or examining in any human sense. It's processing numerical features and finding statistical patterns. The question to ask isn't "is the AI accurate on average?" β it's "what features did it learn, and do those features mean what we think they mean in the real world?" That question almost never appears in press coverage. You can ask it now.
If an AI medical diagnostic tool is trained primarily on images of lighter-skinned patients β which was historically true of many dermatology datasets β and then deployed on patients with darker skin tones, whose features the model has less experience with, who is responsible for the resulting performance gap? The researchers who published the original model? The hospital that deployed it without checking? The medical journals that celebrated it without asking about demographic breakdown? The regulatory agencies that approved it? This is an active debate in medical AI right now, with real patient outcomes at stake.
A tech company has built an AI to screen job applications. The system uses the following features to score each applicant: university attended, GPA, years of experience, previous employer names, gap years in employment history, and which extracurricular activities were listed. It was trained on five years of applications from people who were hired and then rated highly by their managers one year later.
Your lab partner has worked on algorithmic hiring audits before. They want to know which features you think are problematic and why β and they'll push back on any reasoning that's too vague.
In 2018, investigative journalists at ProPublica and then researchers at MIT published findings about an AI system called COMPAS β Correctional Offender Management Profiling for Alternative Sanctions β which was being used in courts across the United States to predict whether a defendant was likely to commit another crime after release. Judges used these scores to inform bail, sentencing, and parole decisions. The system was producing a single number: low risk, medium risk, or high risk of reoffending.
ProPublica's analysis found something striking. The system was accurate at roughly the same overall rate for Black defendants and white defendants. But when it made mistakes, the mistakes were not symmetrical. Black defendants who did not go on to commit another crime were labeled high risk at nearly twice the rate of white defendants in the same situation. White defendants who did go on to commit another crime were labeled low risk at nearly twice the rate of Black defendants in the same situation. Same overall accuracy. Profoundly different distribution of errors.
This is the thing that a single accuracy number hides. When an AI system is wrong, it is wrong in specific directions, about specific people. Understanding which errors a system makes, and who bears the cost of those errors, is not a technical afterthought β it is often the most important thing to know about the system. And almost nobody in the public conversation about AI talks about it correctly.
Every AI system that makes a yes/no prediction β fraud or not, spam or not, high risk or low risk β makes two kinds of mistakes, and they are not equivalent.
Notice that these two errors have completely different consequences depending on the context. In cancer screening, a false negative β missing an actual cancer β could cost someone their life. A false positive β flagging a healthy person for a biopsy β causes anxiety and an unnecessary procedure, but not death. An engineer designing a cancer screening AI should strongly prefer false positives over false negatives. They should be willing to flag more healthy people if that means catching more actual cancers.
In a bail determination AI, the logic flips. A false positive β labeling a safe person as dangerous β means that person may be jailed or given harsher conditions before trial. That is a serious harm inflicted on an innocent person. A false negative β labeling a dangerous person as safe β carries different risks. Both are real costs. But they land on different people, and they must be weighed deliberately. An algorithm cannot make this moral decision automatically. Someone has to.
Imagine you build an AI to detect a rare disease that affects 1% of the population. You train the model and test it. Accuracy: 99%. Impressive, right?
Except your model has learned to predict "no disease" for every single patient, every single time, without looking at any medical data. Since 99% of people don't have the disease, saying "no disease" is always right 99% of the time. Your model is useless β it will never catch a single case β but its accuracy number looks fantastic.
Overall accuracy collapses all errors into a single number, hiding which kinds of errors the model is making. For any problem where the classes are imbalanced (rare events), or where different errors have different costs, accuracy is close to worthless as a performance metric. Engineers use other measures β precision, recall, F1 score, the AUC-ROC curve β but these rarely appear in news coverage.
There is always a tradeoff between precision and recall. Make a system more aggressive at catching positives (higher recall) and it will also flag more things incorrectly (lower precision). Make it more selective (higher precision) and it will miss more real cases (lower recall). Every AI system in deployment has implicitly made a choice about where on that tradeoff curve to sit β and that choice reflects a value judgment about whose errors matter more.
After ProPublica published its COMPAS analysis, the company that made COMPAS β Northpointe β pushed back with a rebuttal. They argued that their system was fair, by a different mathematical definition of fairness: the probability that someone labeled high-risk actually reoffended was the same for Black and white defendants. By that measure, the scores meant the same thing regardless of race.
Both analyses were mathematically correct. They were measuring different things. And in 2016, researchers at Cornell published a proof that in almost all real-world cases with unequal base rates β when different groups have different underlying rates of the outcome being predicted β you mathematically cannot satisfy both definitions of fairness simultaneously. You have to choose which kind of fairness to prioritize. That is not a mathematical question. It is a moral and political one.
Three different people affected by the COMPAS system have three different answers to "what's fair." A defendant labeled high-risk who would not have reoffended says: fairness means equal false positive rates across races. A crime victim who wants all high-risk individuals detained says: fairness means the score predicts correctly for everyone equally. A civil liberties attorney says: no score should affect a person's liberty at all. All three positions are coherent. You cannot satisfy all three mathematically at once. Who should decide which definition wins? Courts? Legislators? Engineers? The public?
You now understand something that sits at the center of almost every public controversy about AI: accuracy is not the same as fairness, and fairness itself has multiple definitions that cannot always coexist. Knowing this, you can approach any claim that "our AI is fair" with a specific follow-up: fair by which definition, measured on which population, at what cost to whom? Most people β including many journalists covering these stories β don't ask those questions. You can.
Every deployed AI system has made implicit choices about which errors are acceptable. Those choices are embedded in how the model was trained, which metric was optimized, and which threshold was set for "positive" vs. "negative." These are value judgments disguised as technical parameters. Knowing this changes how you evaluate every claim about an AI system's performance β in criminal justice, medicine, lending, hiring, or anywhere else.
A government benefits agency has built an AI to triage disability applications β flagging which ones need urgent human review and which can wait. The system reports 94% accuracy. The agency's press release says the system will "reduce wait times and improve outcomes for applicants." A disability rights organization has filed an objection, but hasn't yet specified what they're objecting to.
Your lab partner is a policy analyst who has reviewed algorithmic decision systems for government agencies before. They want you to think through what questions you would ask before this system goes live β specifically about error types, who bears the cost, and what the 94% number actually tells you.
Between 2016 and 2019, YouTube's recommendation algorithm β the system that decides which video to show you next β became one of the most consequential AI deployments in history, affecting over two billion users. The algorithm had been optimized for a single metric: watch time. The model was trained to predict which video, if shown next, would keep you watching longest. By that metric, it was extraordinarily successful. YouTube's watch time figures climbed dramatically.
What the engineers had not anticipated β or had not fully weighed β was what the algorithm discovered in the data: that videos conveying strong emotions, particularly outrage and anxiety, consistently produced longer watch sessions than calm or informational content. The model wasn't told to promote outrage. It found that outrage worked. Journalists, researchers, and eventually YouTube's own internal teams documented a pattern: the recommendation system reliably pushed users toward more extreme content over successive recommendations, because that content drove higher engagement. A person watching a mainstream political video might be shown an increasingly radical one after another. The model had no concept of "extreme." It had only the signal: this kept people watching.
In 2019, YouTube announced significant changes to its recommendation algorithm, saying it would reduce recommendations of "borderline content" β content that doesn't violate policies but that the company determined was harmful to surface. The fact that this change took years, and came only after extensive public and internal pressure, tells you something important: deploying a powerful machine learning system creates feedback loops and consequences that its builders did not fully predict, and fixing them is harder than it looks.
The YouTube story illustrates a problem that doesn't exist in any textbook training scenario: once a model is deployed, its predictions affect the world β and the world changes in response. Those changes feed back into the data the model sees. The loop closes on itself.
YouTube recommended outrage. People watched. More outrage-driving videos were produced, because creators saw what performed well. The training signal β what people watched β was now shaped by what the algorithm had been recommending. The model was no longer learning from an independent world; it was learning from a world it had already changed.
This dynamic appears in many high-stakes domains. A predictive policing system sends more police to certain neighborhoods β which results in more arrests there β which results in more data confirming that neighborhood as high-crime β which the next model version uses to send even more police. The model's prediction becomes self-fulfilling, and the feedback loop makes it harder to detect, because the data increasingly looks like the model was right.
When a prediction influences the outcome it predicts, you can no longer tell whether the model was accurate or whether it created the result it predicted. This is one of the most subtle and dangerous failure modes in deployed AI systems β and it's almost never mentioned in the accuracy reports that accompany system launches.
YouTube's team didn't want to promote radicalization or anxiety. They wanted high watch time. The problem is that "watch time" is a measurable proxy for something harder to measure: a good user experience. The model optimized the proxy perfectly. The actual goal β something like "people feel their time was well spent" β diverged from the proxy in ways nobody anticipated.
This problem has a name in the AI research community: Goodhart's Law. Originally an observation from economics by the British statistician Charles Goodhart in 1975: "When a measure becomes a target, it ceases to be a good measure." Machine learning creates conditions for Goodhart's Law to operate at enormous scale. A model will optimize whatever metric you train it to optimize, and if that metric is a proxy for what you actually care about, you will get more of the proxy and potentially less of the underlying goal.
Other examples: a content moderation AI optimized to minimize reports might learn to suppress reporting mechanisms. A student performance prediction AI optimized to predict grades might learn to use zip code as a proxy, since it correlates with resources. A chatbot optimized for positive user ratings might learn to tell people what they want to hear rather than what's true. In each case, the model does exactly what it was trained to do. The problem was the training target.
By now you've seen the complete arc of a machine learning system: raw data is collected and labeled, features are extracted, a model is trained to minimize loss on those features, it's evaluated on a test set, deployed into the real world, and then β critically β monitored for drift, feedback loops, and divergence between what was optimized and what was actually wanted.
Every step in that loop involves a decision made by a human being. What data to collect. Who labels it and how. Which features to include. What metric to optimize. What threshold defines "positive." What the test set looks like. When to intervene on a deployed system. Each of these is a choice with consequences β technical consequences, ethical consequences, political consequences β and they are being made right now, mostly in private, by teams at companies and government agencies that often don't publicize those choices.
YouTube's algorithm was not designed to radicalize people. It was designed to maximize watch time, which engineers could measure and reward. The radicalization was a side effect discovered years later. Should the engineers who built the watch-time optimizer bear moral responsibility for the side effects? Or is that responsibility on the executives who chose watch time as the metric? Or the researchers who knew about the dynamics but didn't speak loudly enough? Or regulators who had the authority to intervene and didn't? These aren't rhetorical questions β they're live debates in policy circles, ethics boards, and courtrooms right now.
At an institutional level β the level at which governments, hospitals, courts, and banks deploy these systems β the decisions about training metrics and deployment criteria are increasingly being shaped by emerging legal frameworks. The EU's AI Act, passed in 2024, requires high-risk AI systems to maintain documentation of training data, undergo conformity assessments, and enable human oversight. These requirements are an attempt to make visible the decisions that currently happen invisibly inside the loop.
You now see the full loop: data β features β training β testing β deployment β real-world consequences β feedback. You know that every step involves human choices. You know that accuracy hides the distribution of errors. You know that optimizing a proxy metric can undermine the actual goal. You know that deployment creates feedback loops that training scenarios never anticipate. Most adults engaging with AI policy, most journalists covering it, and most people using these systems do not have this picture. You do. That changes what you're able to ask, and what you're able to demand.
A news organization has hired you to design the optimization target for their new AI recommendation system. The app will be used primarily by people aged 13β22. You need to decide what metric the system will optimize for. Options on the table include: time spent in app, articles read per session, user-reported satisfaction ratings, number of topics the user engaged with (breadth), and return visits per week.
Your lab partner has built recommendation systems before and watched what happens when they go wrong. They're going to push you to think through the second-order effects of whatever metric you propose β the feedback loops, the Goodhart's Law traps, the things you won't see until it's too late.