In May 2016, investigative journalists at ProPublica published a piece that would rattle the criminal justice world. They had obtained COMPAS scores β risk assessments generated by an algorithm made by a company called Northpointe β for more than 7,000 people arrested in Broward County, Florida. Then they followed up two years later to see who actually reoffended.
The results were stark. Black defendants were nearly twice as likely to be falsely flagged as future criminals compared to white defendants. White defendants were more likely to be incorrectly labeled low risk and go on to commit new crimes. The algorithm had never been shown to a judge as anything but a clean number β a score from one to ten β and yet it was shaping who went home and who went to prison.
Northpointe insisted their tool was accurate. ProPublica said it was biased. Both were telling the truth. That paradox is where the study of algorithmic bias begins.
Bias in AI systems is not a bug in the traditional sense. It is not a programmer typing the wrong symbol. It is a pattern inherited from data β data that was generated by human beings operating inside societies with documented histories of unequal treatment. When an algorithm is trained on past decisions, it learns to replicate those decisions, including the prejudices embedded in them.
The word "bias" in everyday language implies intent: someone is biased when they consciously or unconsciously prefer one group over another. In machine learning, bias often requires no such intent. A hiring algorithm trained on a decade of rΓ©sumΓ©s from a company that historically hired mostly men will learn that "maleness" correlates with success β not because any engineer chose that, but because the historical data said so.
Bias in AI can be loosely divided into three entry points: the data used for training, the design choices made during development, and the deployment context in which the system is applied. Understanding which type is operating in any given situation is essential to fixing it.
Northpointe's rebuttal to ProPublica was mathematically coherent. They demonstrated that COMPAS was calibrated: among people who scored a 7, roughly the same percentage of Black and white defendants went on to reoffend. From one angle, that's fairness β the score means the same thing regardless of race.
But ProPublica was measuring something different: error rates across groups. A Black defendant who would not reoffend was far more likely to be labelled high risk than a white defendant who would not reoffend. From this angle, the tool was inflicting different costs on different people for the same outcome.
A landmark 2016 paper by computer scientists Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved something that made the whole debate feel more like tragedy than scandal: when base rates of the outcome differ between groups β as they do when one group faces higher rates of policing and arrest β you mathematically cannot satisfy both definitions of fairness at the same time. You must choose which unfairness to accept.
The COMPAS case is not a story about one bad algorithm. It is a demonstration that choosing a fairness metric is itself an ethical and political act. There is no neutral option. The question "which fairness?" is always also the question "whose interests take priority?"
A common misconception about AI is that it discovers objective truth from data. More accurately, it reflects the world that generated the data. In 2015, Google Photos' image recognition system automatically labelled photos of two Black people as "gorillas." The engineers had not written racist code. They had trained on a dataset that dramatically underrepresented dark-skinned faces, causing the system's edge-detection features to misfire on skin tones it had rarely seen.
Google's response β reportedly blocking the word "gorilla" as a label entirely, a fix still in place years later β illustrated the difficulty of remediation. You cannot just remove the bad output; you must address the underlying cause, and the underlying cause is often the distribution of the world that produced the data in the first place.
This is why researchers like Joy Buolamwini at the MIT Media Lab introduced the term "the coded gaze" β the idea that AI systems encode the perspective of whoever built them and whoever generated the majority of the training data, often at the expense of those who did not.
Bias is not a property of the algorithm in isolation. It is a relationship between the algorithm, the data it was trained on, the world that generated that data, and the specific task being performed. Fixing it requires understanding all four elements together.
You will be presented with real-world AI scenarios. For each one, identify which type of bias is most likely operating β historical, representation, or measurement β and explain your reasoning. Your AI partner will probe your thinking and provide feedback.
Complete at least 3 exchanges to finish this lab.
On January 9, 2020, Robert Williams was standing in his driveway in Farmington Hills, Michigan, when two Detroit police officers pulled up and told him he was under arrest. They had a warrant. His wife and daughters watched as he was handcuffed and driven away. He spent the night in jail before learning the charge: shoplifting watches from a Shinola store in 2018.
Williams had never been in that store. Detroit police had fed surveillance footage into a facial recognition system that matched it to a driver's license photo in a database. The match was wrong. Investigators had not sought any corroborating evidence before seeking the warrant. Williams was the first documented American to be arrested based solely on a false facial recognition match.
He was not the last. In the year that followed, at least two more Black men β Michael Oliver and Nijeer Parks β were wrongfully arrested in the Detroit area on the same basis. Parks spent ten days in jail. All three men were Black. The technology's developers had tested it primarily on lighter-skinned faces.
In 2018, MIT researcher Joy Buolamwini and data scientist Timnit Gebru published "Gender Shades," a landmark audit of three commercial facial analysis systems sold by IBM, Microsoft, and Face++. They tested each system's ability to classify the gender of faces across a carefully stratified dataset of 1,270 faces ranging from dark-skinned women to light-skinned men.
The results were alarming. On light-skinned men, all three systems performed near-perfectly. On dark-skinned women, error rates reached as high as 34.7%. The performance gap was not a minor calibration issue β it was a systematic failure concentrated almost entirely on the group most underrepresented in the training data: darker-skinned female faces.
After the paper's publication, IBM and Microsoft both significantly improved their systems' performance on darker-skinned faces. Face++ showed smaller improvement. The study demonstrated that independent auditing β not vendor self-reporting β is often the mechanism through which these failures come to light.
Facial recognition has been deployed in policing in dozens of U.S. cities and internationally, often without public disclosure or legislative approval. The Washington D.C. area, New Orleans, New York City, and many others have used commercial systems from vendors including Clearview AI, Amazon's Rekognition, and NEC. These systems are typically used to generate "leads" β potential matches β rather than definitive identifications. But in practice, as the Robert Williams case shows, a lead can become an arrest warrant with insufficient scrutiny.
A 2019 NIST (National Institute of Standards and Technology) study tested 189 facial recognition algorithms submitted by commercial vendors. It found that most algorithms performed 10 to 100 times worse on African-American and Asian faces compared to Caucasian faces. False-positive rates β where the system incorrectly identifies someone as a match β were highest for African-American women. In a criminal justice context, a false positive is not a minor inconvenience. It can mean arrest, incarceration, and lasting reputational damage.
Several cities responded to these findings by banning or restricting government use of facial recognition: San Francisco (2019), Boston (2020), Minneapolis (2021), and others. The European Union's AI Act, finalized in 2024, places facial recognition used in real-time public surveillance in the highest risk category, with strict prohibitions on most use cases.
In 2018, the ACLU tested Amazon's Rekognition tool by running photos of all 535 members of the U.S. Congress against a database of 25,000 publicly available arrest photos. The system produced 28 false matches. Disproportionately, those misidentified were members of Congress who were people of color β despite people of color making up only 20% of the congressional membership tested. Amazon disputed the test methodology, saying the confidence threshold was set too low.
Beyond accuracy, facial recognition raises a distinct ethical issue: the collection and use of biometric data without consent. Clearview AI built a database of more than three billion facial images by scraping social media platforms β Facebook, Instagram, LinkedIn, Twitter β without permission from users or platforms. Law enforcement agencies could then upload a photo of a suspect and receive a list of potential matches with links to public posts.
In 2022, an Illinois court ordered Clearview to pay $52 million in a class-action settlement under the state's Biometric Information Privacy Act β the most stringent biometric privacy law in the United States. Canada, Australia, and multiple EU member states also found Clearview's practices to violate their privacy laws and ordered data deletion. The episode crystallized a broader question: even if facial recognition were perfectly accurate, should your face be searchable by anyone with a subscription?
Facial recognition sits at the intersection of two distinct ethical failures: a technical failure (differential accuracy by race and gender) and a consent failure (biometric data collected and used without meaningful permission). Solving one does not solve the other. A perfectly accurate system deployed without consent is still an ethical violation.
Vendors selling facial recognition tools often make claims about their systems' accuracy and fairness. In this lab, you'll practice critically evaluating those claims by applying what you've learned about audit methodology, differential performance, and consent.
Complete at least 3 exchanges to finish this lab.
In 2014, Amazon began building an AI recruiting tool that the company hoped would automate the search for talent. Engineers trained it on a decade of rΓ©sumΓ© submissions and hiring decisions β the inputs and outputs of Amazon's own past hiring process. By 2015, the system was operational. By 2017, it had been scrapped.
The reason: the system had learned to penalize rΓ©sumΓ©s that included the word "women's" β as in "women's chess club" or "women's college." It also downgraded graduates of all-female colleges. The tool was not told to discriminate. It had observed that men were hired at higher rates and inferred that signals of maleness correlated with hireability. It was doing exactly what it was designed to do β finding patterns in the data. The patterns it found were the legacy of a decade of biased decisions.
Amazon quietly dissolved the team. Reuters broke the story in October 2018. The company said the tool had never been used in final hiring decisions, though the degree to which it influenced candidate screening remained disputed.
Amazon's case is a textbook example of historical bias feeding forward. The company's historical hiring data encoded the gender imbalance of the tech industry β and the model faithfully reproduced it. But the problem extends beyond one company. In 2019, researchers at the University of Toronto analyzed a widely used pre-employment screening tool and found that it consistently scored candidates with "White-sounding" names higher than equally qualified candidates with "Black-sounding" names, echoing the findings of a famous 2003 audit study by economists Marianne Bertrand and Sendhil Mullainathan that sent identical rΓ©sumΓ©s with racially coded names to employers and found a 50% callback gap.
The issue compounds when algorithms score on proxies that correlate with protected characteristics. Credit scoring systems that penalize applicants without a credit history disproportionately affect recent immigrants, young adults, and communities where formal banking access has historically been limited. The variable "no credit history" is not race β but its distribution in the population is shaped by racially differentiated access to banking infrastructure.
This is what researchers call proxy discrimination: when a facially neutral variable serves as a statistical stand-in for a protected characteristic. Zip code, school attended, employment gap, credit history β each can be predictively valid and systematically unfair at the same time.
In 2019, the U.S. Department of Housing and Urban Development filed a complaint against Facebook alleging that its ad-targeting algorithms were facilitating housing discrimination. Advertisers could show housing listings only to users Facebook classified as likely to be interested β but those classification signals included proxies for race, national origin, and religion. The settlement required Facebook to overhaul its ad system for housing, employment, and credit categories.
The same year, the New York Department of Financial Services investigated Apple's credit card β the Apple Card, issued by Goldman Sachs β after David Heinemeier Hansson, the creator of Ruby on Rails, tweeted that he had received a credit limit 20 times higher than his wife despite their sharing all assets. Dozens of similar complaints followed. Goldman Sachs maintained that its algorithm did not use gender as an input. The regulator found no legal violation β but the investigation exposed the opacity of algorithmic credit decisions and the inadequacy of existing disclosure requirements.
When asked to explain its decision, Goldman Sachs could not provide an individual applicant with a meaningful explanation of what factors drove their score. This is not unique to Goldman. Most gradient-boosted decision tree models used in credit scoring are not designed for interpretability. The right to explanation β enshrined in Europe's GDPR and partially addressed in the U.S. by adverse action notices β is difficult to satisfy in practice when the model itself cannot clearly articulate why it decided what it decided.
When biased models make decisions β who gets a loan, who gets interviewed β those decisions create the next round of training data. People denied loans don't appear in the "successful borrower" dataset. People not hired don't appear in the "successful employee" dataset. The model's biases become invisible because the evidence that would reveal them was never generated.
New York City Local Law 144, which took effect in 2023, was the first law in the United States to regulate automated employment decision tools specifically. It requires employers using AI hiring tools to conduct annual bias audits by independent third parties and to disclose audit results publicly. Applicants must be notified when AI is being used. Enforcement has been slow, and critics have noted that the law allows employers to commission their own audits β creating conflicts of interest β but it represents the first substantive legislative effort to impose accountability on automated hiring.
The EU AI Act, passed in 2024, classifies AI systems used in employment, education, essential services, and credit scoring as "high risk," requiring conformity assessments, ongoing monitoring, human oversight mechanisms, and detailed documentation of training data before deployment. These requirements represent a significant structural shift: from voluntary vendor standards to binding legal obligations.
When algorithms replace human gatekeepers, they do not eliminate human bias β they often amplify and entrench it by making it faster, cheaper, and harder to see. The illusion of objectivity is not a side effect. It is, for many deployers, a feature: a way to disclaim responsibility for decisions that were always going to be made.
Proxy discrimination and feedback loops are subtle mechanisms. In this lab, you'll work through scenarios to identify proxy variables, trace how feedback loops form, and think through what interventions could break these cycles.
Complete at least 3 exchanges to finish this lab.
In February 2020, a Dutch court ordered the government of the Netherlands to immediately halt a fraud-detection system called SyRI β System Risk Indication. SyRI was an algorithm that combined data from seventeen government databases β tax records, employment data, housing registers, benefit claims β to generate risk scores for citizens suspected of welfare fraud. The system had been deployed in fourteen municipalities, overwhelmingly in low-income and ethnically diverse neighborhoods.
The court found SyRI violated the European Convention on Human Rights β specifically the right to private life under Article 8. The government had not made the risk model public. Citizens had no way to know they were being scored, no access to what data was used, and no clear mechanism to challenge or correct errors. The court ruled that opaque algorithmic surveillance of disadvantaged populations, without meaningful transparency or appeal rights, crossed a fundamental legal line.
It was among the first court rulings anywhere in the world to invoke human rights law directly against an automated government decision system β and it anticipated much of the regulatory architecture that would follow.
Researchers and engineers have developed a range of technical approaches to reducing bias in AI systems. These generally fall into three categories based on where in the pipeline they intervene:
Pre-processing interventions modify or rebalance training data before the model is trained. Techniques include resampling underrepresented groups, synthetic data generation (creating additional examples of underrepresented cases), and removing or transforming proxy variables. The risk is that removing proxies may degrade predictive performance, and that synthetizing data for underrepresented groups may introduce artifacts.
In-processing interventions modify the learning algorithm itself to include a fairness constraint β effectively penalizing the model during training if its predictions diverge too greatly across demographic groups. This requires specifying in advance which fairness metric to optimize for β and as the COMPAS paradox showed, different metrics can be mutually incompatible.
Post-processing interventions adjust the model's outputs after the fact, applying different decision thresholds for different groups to equalize error rates. This approach is implementable without retraining but is controversial because it explicitly treats groups differently β the very thing that discrimination law in most jurisdictions formally prohibits.
Every technical mitigation technique has a limitation: it operates within the system as designed. It cannot question whether the system should exist, whether the task being automated is itself appropriate to automate, or whether the fairness metric chosen reflects whose interests were prioritized in the design process.
In 2019, a study by Obermeyer, Powers, Vogeli, and Mullainathan in Science examined a widely used healthcare algorithm that predicted which patients needed intensive care management. The system had been deployed for 200 million people across the United States. It was found to be systematically under-identifying Black patients with the same level of illness as white patients β effectively allocating less care to Black patients with equal need.
The root cause: the algorithm used past healthcare spending as a proxy for health need. But spending is not the same as need. Black patients in the United States, due to documented barriers in healthcare access including cost, distrust, and geography, had historically spent less on healthcare for the same conditions. The algorithm had learned that Black patients "needed less," because they had historically received less.
The fix involved recalibrating the proxy β switching from spending to illness burden directly. The result was that the algorithm identified 46% more Black patients for enrollment in care management programs. But it required researchers outside the vendor to identify the problem, and the system had run for years before the audit.
In most jurisdictions, when an algorithmic decision harms someone, it is extraordinarily difficult to establish legal liability. Vendors argue their systems are general-purpose tools and bear no responsibility for how deployers use them. Deployers argue they relied on the vendor's representations of accuracy and fairness. The person harmed β wrongfully arrested, denied a loan, excluded from care β is left with a harm and no clear path to remedy. Closing this accountability gap is among the central challenges of AI governance.
The SyRI ruling, NYC Local Law 144, and the EU AI Act all point toward the same conclusion: technical debiasing cannot substitute for structural accountability. Structural solutions require transparency (you must disclose what your system does), contestability (affected individuals must have a meaningful way to challenge decisions), human oversight (consequential decisions must have a human review mechanism), and ongoing monitoring (you must continuously audit performance, not just test before deployment).
Algorithmic impact assessments β modelled on environmental impact assessments β are now required or recommended in several jurisdictions before high-risk AI systems can be deployed in public services. They require developers to articulate who is affected, what the expected benefits are, what the foreseeable harms are, and what mitigation measures are in place.
Critics from civil rights organizations argue that even these frameworks place too much burden on after-the-fact remediation. The more fundamental question, they argue, is whether some domains β criminal risk assessment, welfare fraud detection, predictive policing β should be automated at all given the current state of the technology and the severity of the harms when it fails. That is not a technical question. It is a political and moral one.
Algorithmic bias is not primarily a machine learning problem. It is a power problem: who decides what to automate, whose data is used, which fairness metric is chosen, and who bears the cost when the system is wrong. Technical tools can help. But they cannot replace the governance structures, legal accountability mechanisms, and political will required to ensure that automated systems serve everyone equitably.
You'll be asked to design or evaluate accountability frameworks for AI systems β going beyond technical fixes to structural requirements like transparency, contestability, and human oversight. Draw on all four lessons in this module.
Complete at least 3 exchanges to finish this lab.