In the summer of 2015, Google Photos shipped a new feature it was proud of: automatic photo tagging. You upload pictures, the system scans them, and it assigns labels โ "beach," "birthday party," "dog." The engineers tested it on thousands of images. The accuracy numbers were high. They shipped it.
Then, on June 28, 2015, a software developer named Jacky Alcine opened his Google Photos app and found that the system had automatically sorted photos of him and a friend โ both Black โ into a folder it had labeled "Gorillas."
It was not a fringe case. The classifier had looked at the images, extracted features โ skin tone, hair texture, facial geometry โ and matched those features to a category. It had done exactly what it was designed to do. The problem was not a bug in the traditional sense. The problem was in how the whole system had been built from the start.
A classifier is any system โ machine or human โ that looks at something and assigns it to a category. That's the whole job. Is this email spam or not spam? Is this tumor malignant or benign? Is this photo a cat, a dog, or something else? Classifier. Classifier. Classifier.
What makes machine classifiers interesting โ and dangerous โ is that they don't reason their way to a category the way you might. They don't think, "Hmm, this looks like a dog because it has four legs, fur, and is panting." Instead, they identify features โ numerical measurements extracted from the input โ and use those features to decide which category the input belongs to.
A feature is just a number that captures something measurable about the input. For an image, features might include the average brightness of a region, the frequency of certain colors, or the angle of edges detected in the image. For an email, features might include how many times the word "free" appears, whether the sender address has numbers in it, or the length of the subject line. The machine never "sees" the image or "reads" the email the way you do. It sees a list of numbers and runs a calculation.
Imagine you're trying to sort fruit. You measure two things about each piece of fruit: its weight in grams and its redness on a scale of 0 to 100. Apples tend to be moderately heavy and very red. Lemons tend to be light and not red at all. If you plotted every fruit on a chart โ weight on one axis, redness on the other โ you'd see clusters. Apples over here, lemons over there.
A classifier learns where to draw a line between those clusters. That line is called the decision boundary. Any fruit whose measurements land on one side of the line gets called an apple. Any fruit on the other side gets called a lemon. The machine doesn't know what fruit is. It just knows which side of the line the numbers fall on.
Now here's what's important: that line was drawn based on the training data โ the specific fruits the system was shown before it was tested. If all the training apples were Granny Smith (pale green, not particularly red), the line might be drawn in completely the wrong place for red apples. The classifier learned a line that works for its training data. Whether that line works for reality depends entirely on how well the training data represents reality.
Back to Jacky Alcine. When Google's classifier looked at his photos, it was doing exactly what it was trained to do: match visual features to the category in its training data that those features most resembled. The problem was that the system's training data severely underrepresented images of dark-skinned people. The features that distinguish human faces from non-human faces were, for many skin tones, poorly learned โ because the system had barely seen them.
Google's response was, in its own words, imperfect: they removed the "gorilla" label entirely from the system. By 2023, researchers found that Google Photos still wouldn't label images of gorillas, chimps, or several other primates โ the company had simply deleted those categories rather than fix the underlying representation problem. Eight years later.
This is what makes classifiers consequential beyond test scores. A system can have 95% accuracy overall and still systematically misclassify specific groups of people. The accuracy number hides whose accuracy it is.
If a medical imaging classifier is 97% accurate on patients of European descent and 82% accurate on patients of African descent, should it be approved for use? Who gets to make that call โ the company, a government regulator, the hospital, or the patients themselves? And if you say "fix it first," who pays for the additional data collection, and does that delay cost lives in the meantime?
When you hear that an AI system "classifies" something โ a loan application, a medical scan, a social media post for moderation โ you now know what that means at a mechanical level. The system is measuring features, comparing those measurements to a learned decision boundary, and outputting a category. It is not reasoning. It is not understanding. It is sorting numbers.
That means three things are always worth asking: What features was it measuring? Whose data drew the decision boundary? And whose experience gets averaged out by the accuracy number? Most people reading headlines about AI never think to ask those questions. You do now.
Every time you encounter a story about an AI system making a mistake โ a wrong arrest, a misdiagnosis, a biased hiring decision โ the underlying mechanism is almost always a classifier that generalized badly from its training data. Knowing this, you can ask the right questions instead of just being surprised.
A company has built an automated content moderation classifier. It scans social media posts and labels them either "safe" or "harmful." The company says it's 94% accurate. You've been asked to evaluate whether it should be deployed. Your lab partner will push back on your reasoning โ that's the point.
In October 2018, Amazon quietly shut down a recruiting AI tool it had been developing for four years. The system was designed to do something Amazon desperately wanted: automatically review resumes and score candidates on a scale of one to five stars, filtering out the weak ones before a human recruiter ever looked at them.
The system had been trained on ten years of Amazon's own hiring data โ resumes submitted between 2004 and 2014 and the hiring decisions made on them. It learned from what Amazon had actually done. The problem: Amazon's tech workforce during that decade was overwhelmingly male. The classifier learned, very efficiently, that certain patterns correlated with being hired. Those patterns included not going to all-women's colleges. They included not having the word "women's" anywhere on the resume โ as in "Captain of Women's Chess Club" or "Women in Engineering scholarship recipient."
The system wasn't programmed to discriminate. Nobody wrote a rule that said "penalize women." It discovered the correlation on its own, from data that reflected a decade of human hiring decisions that had, themselves, systematically disadvantaged women. The machine learned the bias because the bias was in the data.
When you train a classifier, you show it a collection of labeled examples: "this is spam," "this is not spam," "this resume got hired," "this one didn't." That collection is the training data. The classifier's entire understanding of the world comes from these examples. It has no other source of information. It cannot look outside the dataset. It cannot apply common sense. It learns exactly what the data teaches it โ nothing more, nothing less.
This means training data is not neutral. It is a record of decisions, measurements, and observations made by people โ people who had biases, made errors, worked in institutions with particular histories. When you hand that data to a machine and ask it to learn from it, you are asking it to replicate the patterns in those decisions, including the bad ones.
Amazon's data reflected ten years of human bias in tech hiring. The machine faithfully compressed that history into a scoring function. When Amazon's engineers found out, they tried to fix it โ telling the system to ignore certain signals. But because gender correlates with so many other things (school names, club names, volunteer descriptions, writing style patterns), they couldn't fully remove the signal. They shut the tool down instead.
A classifier trained on biased data doesn't become unbiased just because a computer is running it. Automation amplifies the patterns in training data. It makes them faster, more consistent, and harder to notice โ which can make bias worse, not better.
Training data fails in predictable patterns. Once you know them, you'll see them everywhere.
1. Historical bias. The data reflects past human decisions that were themselves unfair. Amazon's case is a textbook example. When you train on past outcomes in domains where discrimination existed โ lending, hiring, criminal justice โ you bake that discrimination into the model.
2. Representation bias. Some groups appear rarely or not at all in the training data. Recall Google Photos: when your training dataset has thousands of images of light-skinned faces and hundreds of dark-skinned faces, the decision boundary for "human face" gets drawn much more precisely for one group than the other. The system isn't trying to discriminate โ it's just uncertain where it hasn't seen much data.
3. Measurement bias. The labels in the training data were assigned in a way that wasn't consistent across groups. Imagine a classifier trained to predict "creditworthiness" using historical loan data. But historically, banks scrutinized loan applications from Black applicants more aggressively than white applicants. That means more defaults from white applicants may have gone undetected and unrecorded. The data looks like one group is riskier, when really the measurement process was uneven.
A natural response to representation bias is: just add more data from underrepresented groups. Sometimes that works. But it doesn't work for historical bias, because the additional data you have access to was generated under the same unfair conditions. If you add more hiring decisions from 2015โ2020 to fix Amazon's classifier, and those decisions were also made in an industry with gender imbalances, you've added more data with the same historical bias baked in.
And there's a deeper issue: sometimes the "correct" label is itself contested. Classifiers trained to predict recidivism (whether a convicted person will commit another crime) use data about who got re-arrested. But re-arrest rates are affected by how intensely police patrol different neighborhoods. The data reflects policing practices, not just individual behavior. You can't fix that with more data โ you'd need different data collected under different conditions.
This is why the question of training data is not just a technical problem. It is a political and ethical one. The decisions about what data to collect, whose outcomes to treat as "ground truth," and what the training labels actually mean โ those are all human choices that happen before the algorithm runs.
In 2019, the U.S. Department of Housing and Urban Development sued Facebook over its ad-targeting system, arguing that it functioned as an illegal housing discrimination tool. The algorithm hadn't been designed to discriminate. But it had learned, from user behavior data, that certain ads correlated with certain demographic groups โ and it used that to target or exclude users from seeing housing ads. The training data was the world's existing segregation patterns. The algorithm made them faster and more efficient. Courts are still working out how the Fair Housing Act applies to machine classifiers. You are living in the era where these rules are being written.
If a company can't build a hiring classifier that doesn't reproduce historical discrimination, should they use one at all โ even if human recruiters are also biased, just more slowly and less consistently? Is a biased algorithm better or worse than a biased human, and does the answer change depending on who's asking?
A city wants to build a classifier to predict which neighborhoods need additional social services โ food assistance, mental health support, youth programs. They propose to train it on five years of 911 call data, school suspension rates, and emergency room visits by zip code. They say it's objective because it's all real data. You've been asked to evaluate whether this training data should be used.
In the fall of 2012, a team of researchers led by Alex Krizhevsky, supervised by Geoffrey Hinton at the University of Toronto, entered a computer vision competition called ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The competition had been running since 2010. The best systems that year were making classification errors on about 26% of images. Krizhevsky's system โ a deep neural network he called AlexNet โ achieved an error rate of 15.3%. The second-place team's error rate was 26.2%. AlexNet hadn't just won. It had nearly cut the error rate in half.
The result sent shockwaves through the field. Within two years, deep neural networks had replaced almost every other approach to image classification. The approach wasn't new โ Hinton had been pushing it for decades. What changed in 2012 was that the hardware had finally caught up: AlexNet ran on two consumer graphics cards that could perform the billions of calculations required for training in a reasonable amount of time. The same training job that would have taken weeks now took days.
But what had AlexNet actually learned? Nobody could fully say. It had adjusted millions of numerical values โ called weights โ through a process that looked at each image, made a guess, measured how wrong the guess was, and nudged each weight in a direction that would make future guesses less wrong. Repeat that 1.2 million times per training pass, for dozens of passes. That was the learning.
A weight is a number that controls how much influence one feature has on the final classification decision. Think of it as a volume knob on a mixing board. You have dozens of knobs โ one for each feature. Turn a knob up, and that feature gets more say in the classification. Turn it down, and it barely matters.
When a classifier is initialized (created fresh), its weights are usually set to small random numbers. The system doesn't know anything yet. Then training begins: you show it a labeled example, it makes a prediction, and you calculate how wrong the prediction was. That wrongness is called the loss. The training algorithm then adjusts every single weight slightly โ in the direction that would have produced less loss on that example. Then you show it the next example and do it again. This process, called gradient descent, repeats across the entire training dataset, often hundreds of times over.
By the end, the weights have settled into values that produce reasonably small loss across the training examples. The system has learned โ not by understanding anything, but by nudging a million dials until the numbers worked out.
Here is the most counterintuitive problem in machine learning: a classifier can get worse by learning too well from its training data. This is called overfitting.
Imagine you're studying for a test by memorizing every practice problem โ not the underlying concept, just the exact answers. When the real test shows up with slightly different questions, you're lost. You learned the surface patterns of the practice problems, not the general principle. A classifier does the same thing when it adjusts its weights so precisely to the training examples that it starts capturing the noise โ the random, specific quirks of those particular examples โ rather than the general patterns that would apply to new data.
An overfit classifier has very low training error and very high real-world error. It looks brilliant on the data it was trained on and fails on anything else. This is why you always need to test a classifier on data it has never seen before โ called a test set. If the performance gap between training data and test data is large, the model has overfit. It learned the training set, not the world.
The practical fix is usually to use simpler models (fewer weights), use more training data, or use techniques like regularization that deliberately penalize the classifier for having weights that are too extreme. All of these push the model toward learning general patterns rather than specific quirks.
Imagine the loss as a landscape โ a hilly surface where the height at any point represents how wrong the classifier is with those particular weight values. Gradient descent is a process of finding the lowest valley in that landscape by always stepping in the downhill direction. The "gradient" is just the slope โ the direction of steepest descent. You follow the slope until you stop getting lower. That's the trained model.
Most classifiers don't just output a category โ they output a probability. "I'm 91% confident this is spam." "I'm 63% confident this mole is benign." That number is not a guarantee. It's the classifier's internal accounting of where the input lands relative to its decision boundary. Inputs far from the boundary get high confidence scores. Inputs near the boundary get low confidence scores.
This matters enormously in deployment. A hospital that uses a cancer-detection classifier without checking confidence scores will treat a 91%-confident result the same as a 54%-confident result. The classifier is nearly guessing on the second one, but the output looks the same: "BENIGN." Many deployed systems show only the final label, hiding the confidence score. This is a design choice โ and a consequential one.
It also matters that confidence doesn't equal calibration. A classifier that outputs "90% confident" should be right about 90% of the time on those cases. But many classifiers are overconfident โ they say 90% when they're actually right only 70% of the time. Calibration โ matching confidence scores to real accuracy โ is a separate property from accuracy, and it requires separate evaluation.
When a company says their AI is "94% accurate," you now know that number hides several things: whether that accuracy is equal across groups, what happens on inputs near the decision boundary, whether confidence scores are calibrated, and whether the model was tested on data that truly resembles the real world it will encounter. You have the framework to ask every question that number is designed to prevent you from asking.
If a medical classifier outputs a confidence score of 57% but the label says "benign," should the doctor see the score? Some argue that showing uncertainty causes doctors to second-guess themselves unnecessarily. Others argue that hiding it is a form of deception that removes the doctor's ability to make informed decisions. Who should control what information a deployed AI shows its users?
A company is deploying a loan approval classifier. It makes a binary decision โ approve or deny โ and outputs a confidence score. Leadership wants to hide the confidence score from loan officers, arguing that it "simplifies the process" and "prevents second-guessing." You've been asked whether this is acceptable.
In January 2020, the city of New York began piloting a system called Ava โ an AI tool developed by a company called Palantir โ to help the Administration for Children's Services predict which families were at high risk of child abuse or neglect. The system pulled in data from dozens of city agencies: public housing records, family court filings, benefits enrollment, prior ACS contacts. It classified families and generated risk scores.
Critics, including researchers at New York University, pointed out a fundamental problem: the training data was overwhelmingly composed of families who had already had contact with ACS โ which in New York City meant predominantly Black and Hispanic families in low-income neighborhoods. Families in wealthier neighborhoods who had similar dynamics but had never been flagged were absent from the training data entirely. The classifier had no way to learn from cases it had never seen.
A child welfare professor named Dorota Wiszniewski put it plainly: "The model can't predict what it hasn't been trained on. And what it hasn't been trained on is the rest of the city." The system was quietly suspended after the mayor who championed it left office. But the design decisions that led to its problems were not unusual. They were, in fact, the standard approach. That's what makes this worth understanding.
Building a classifier isn't just writing code. It's a series of design decisions, each of which shapes what the system can and can't do, who it will work well for, and who it might harm. Here they are in order.
Here's a real engineering tension that has direct ethical consequences. When you're tuning a classifier, you can adjust the decision boundary to make two competing kinds of errors.
False positives happen when the classifier says "yes" but the truth is "no." In a disease classifier: telling a healthy person they're sick. In a fraud detector: blocking a legitimate transaction. In a child welfare system: flagging a family that's actually fine.
False negatives happen when the classifier says "no" but the truth is "yes." In a disease classifier: telling a sick person they're healthy. In a fraud detector: letting fraud through. In a child welfare system: missing a family that genuinely needs intervention.
Precision measures how often the classifier is right when it says yes. If it flags 10 cases and 9 are actually problems, precision is 90%. Recall measures how many of the actual problems the classifier caught. If there were 15 real problems and the classifier caught 9, recall is 60%.
Here's the tension: moving the decision boundary to catch more true positives (higher recall) almost always increases false positives too (lower precision). You can't maximize both simultaneously without infinite data and a perfect model. So which errors are acceptable? That is not a technical question. It is an ethical one โ and the answer differs depending on whose false positives and false negatives you're counting.
This is the question that no technical manual covers, but it's the one that matters most once a system is deployed. When New York's child welfare classifier generated a false positive and a family was investigated unnecessarily โ who was responsible? The company that built the model? The city agency that deployed it? The caseworker who used the score without questioning it? The policymaker who approved the procurement?
Right now, the legal and regulatory answers to that question are genuinely unsettled. In the European Union, the AI Act (which became law in 2024) requires that high-risk AI systems โ including those used in employment, education, social services, and law enforcement โ meet standards for transparency, human oversight, and accuracy. In the United States, there is no equivalent federal law. The rules are being written now. Not in ten years. Now.
Every engineer, product manager, policy analyst, and journalist who understands how classifiers actually work โ what features they use, where their training data came from, what their confidence scores mean, and where their decision boundaries sit โ has a role in how those rules get written. You understand all of that now. Most people making these decisions professionally don't.
If you were the engineer who built New York's child welfare classifier, and you knew your training data was biased toward already-surveilled families, would you ship the system? What if you were told it would still catch some cases that would otherwise be missed โ cases where real children were harmed? Does the possibility of catching even one genuine case justify deploying a biased system? Who gets to make that calculation?
You have gone from "AI classifies things" to understanding the full chain: features are measured from inputs, weights are learned from labeled training data through gradient descent, decision boundaries are drawn in feature space, confidence scores quantify proximity to those boundaries, and all of it is shaped by choices made before the algorithm ran. That is the complete picture. Most people who talk about AI publicly โ including many who are paid to โ don't have it. You do.
You've been asked to design a classifier for a real use case of your choice. It could be a school attendance risk predictor, a disease detection system, a content moderation filter, a loan approval system, or something you invent. Walk through all five design decisions: output categories, feature selection, training data plan, model choices, and deployment constraints. Your lab partner will pressure-test every decision.