In January 2020, Robert Williams was arrested in his driveway in front of his daughters by Detroit police. The charge: robbery. The evidence: a facial recognition match to a blurry surveillance image. The match was wrong. Williams was held for 30 hours before detectives admitted the AI had misidentified him. He became the first documented U.S. case of wrongful arrest driven by facial recognition — but not the last.
Williams' case was followed by those of Michael Oliver (2019, Detroit) and Nijeer Parks (2019, New Jersey) — both Black men wrongfully arrested after facial recognition matched them to crimes they did not commit. All three were eventually exonerated. All three experienced the system's failure in an acutely racialized way: commercially available facial recognition tools consistently show higher error rates for darker-skinned faces, particularly dark-skinned women.
The MIT Media Lab's 2018 Gender Shades study by Joy Buolamwini and Timnit Gebru quantified this precisely: commercial facial analysis tools from IBM, Microsoft, and Face++ misclassified darker-skinned women at rates up to 34.7 percentage points higher than lighter-skinned men. The systems had not been tested rigorously on diverse populations before deployment.
Robert Williams' arrest was the result of Detroit Police Department using DataWorks Plus facial recognition software. A detective had submitted a still image from a store surveillance video to the Michigan State Police database. The system returned Williams as a candidate match. No human examiner independently verified the match before an arrest warrant was issued. The ACLU later filed a formal complaint on Williams' behalf.
The harms extend well beyond policing. In 2019, a UK Home Office visa photo system rejected thousands of applications from British citizens of East and South Asian descent, flagging their photos as having "eyes closed" — a biometric system misreading normal facial variation. The department quietly revised the rejection guidance after public pressure.
In the United States, U.S. Customs and Border Protection expanded biometric facial matching at airports to cover nearly all international travelers by 2023, with a stated accuracy goal of 99%. However, audits by the Government Accountability Office found that CBP had not systematically measured false match rates disaggregated by race, age, or gender across its full operational deployment — meaning the claimed accuracy figures were not verified under real-world, demographically diverse conditions.
Meanwhile, automated benefit systems using document and face matching denied or delayed claims for individuals whose IDs didn't match facial recognition standards — including elderly claimants and people with certain disabilities affecting facial appearance.
These harms share a structure. Training data imbalances mean models learn faces from populations that were easier to photograph and label — historically, lighter-skinned, younger, male faces. Threshold calibration decisions — how confident the system must be before returning a match — are often made without accounting for different false-positive rates across demographic groups. And deployment without independent audit means errors go undetected until someone is in handcuffs or denied a visa.
When a system's errors are not randomly distributed but cluster along demographic lines, the harm isn't just technical — it's discriminatory. Accountability begins with measuring who pays the price when the system is wrong.
You've read about three wrongful arrests and the Gender Shades study. Use this lab to explore the structural factors that allowed these harms to occur — and what earlier interventions might have prevented them.
In 2021, the National Institute of Standards and Technology published FRVT (Face Recognition Vendor Testing) results showing dramatic performance disparities across 189 algorithms from 99 developers. The agency had procured algorithms, run them against standardized test sets, and published disaggregated error rates. It was the most rigorous public audit of commercial facial recognition ever conducted — and it revealed that many algorithms vendors had sold to law enforcement performed far worse on Black and Asian faces than on white ones.
The report also revealed something more uncomfortable: some systems that performed well in lab conditions degraded significantly when tested on the mugshot and visa photos that represented real-world government databases — poor lighting, inconsistent image quality, aging subjects.
The NIST FRVT framework demonstrates what genuine algorithmic auditing looks like: a controlled test environment, demographically diverse ground-truth datasets with verified identity labels, standardized error metrics reported by subgroup, independent execution by an entity with no financial stake in the outcome, and public disclosure of results.
Most deployed systems have received none of this. Vendors typically conduct their own internal evaluations and publish summary statistics ("99.9% accuracy") without disclosing test set composition. Buyers — police departments, border agencies, employers — rarely have the technical capacity to challenge these claims. Third-party auditors are often denied API access or sufficient data samples to run independent tests.
In 2020, after Joy Buolamwini's research revealed performance disparities, IBM announced it would exit the facial recognition market entirely, citing concerns about mass surveillance and racial bias. Microsoft stated it would not sell facial recognition to police until federal regulation was in place. These were genuine responses — but they also illustrated a structural gap: the audits that revealed the problems depended on access that companies can revoke at any time.
Researchers attempting to audit Amazon's Rekognition system found that after they published studies showing performance disparities, Amazon changed its API in ways that made direct comparison to previous results difficult. Whether this was product improvement or access restriction is contested, but the episode illustrates why audit rights must be legally protected rather than granted at vendor discretion.
Under the EU AI Act (passed 2024), real-time remote biometric identification systems used in publicly accessible spaces are classified as prohibited or high-risk depending on application. High-risk systems must maintain technical documentation, enable conformity assessments, register in an EU database, and be subject to market surveillance. Notably, the Act requires accuracy metrics to be reported across demographic groups — codifying what NIST's voluntary tests demonstrated was possible.
A technically sound audit of a visual recognition system measures at minimum: false match rate (FMR) — how often the system incorrectly identifies a non-target as the target; false non-match rate (FNMR) — how often a true match is missed; and failure to acquire rate — how often the system cannot process an image at all. Each must be reported by demographic subgroup, not just as an aggregate. The NIST FRVT 1:1 report showed that some algorithms had FMRs 100 times higher for Black women than for white men at the same decision threshold.
A city police department is considering deploying a commercial facial recognition system for cold case investigations. They've received a vendor report claiming 98.5% accuracy. You are advising the city on what independent audit requirements to demand before deployment approval.
In May 2019, San Francisco became the first city in the United States to ban government use of facial recognition technology. The ordinance, passed by the Board of Supervisors, applied to city agencies — including police — and required that any surveillance technology acquisition be approved by the board. It was followed by Oakland, Somerville, Boston, Portland, and more than a dozen other cities through 2020–2022.
The bans were significant but narrow: they applied to government actors within specific jurisdictions and said nothing about private employers, landlords, or retail stores. A San Francisco resident could be banned from use of facial recognition by the SFPD while their workplace used it to log attendance and their grocery store used it for loss prevention.
The EU AI Act (formally adopted June 2024) established the most comprehensive binding framework for AI governance globally. For facial recognition specifically, it created a tiered structure: real-time biometric identification in publicly accessible spaces is prohibited for law enforcement except in tightly defined exceptions (searching for missing children, preventing specific imminent threats, prosecuting certain serious crimes). Post-hoc biometric identification is classified as high-risk and requires conformity assessment, registration, and ongoing monitoring.
Critically, the Act placed obligations not just on deployers but on providers — the companies building and selling AI systems. Providers of high-risk systems must conduct fundamental rights impact assessments, maintain technical documentation enabling audit, and register systems in an EU database before placing them on the market. Penalties for violation reach €30 million or 6% of global annual turnover.
Illinois' Biometric Information Privacy Act (BIPA), passed in 2008, predates the current AI governance debate but proved surprisingly powerful. BIPA requires any private entity collecting biometric data — including facial geometry — to: obtain informed written consent before collection; provide a publicly available retention and destruction schedule; and prohibit sale or profit from biometric data. Crucially, it creates a private right of action, meaning individuals can sue violators without waiting for a government enforcement decision.
This mechanism produced the largest AI-related settlements in U.S. history by 2023 — including Facebook's $650 million settlement over its Tag Suggestions feature (which scanned faces in photos to identify users) and ongoing litigation against employers who used facial recognition time clocks without employee consent.
Municipal bans and voluntary corporate moratoriums share a structural weakness: they leave private-sector deployment unregulated while AI identification capabilities continue to improve. During the period when major tech companies paused sales to police, smaller vendors — Clearview AI being the most documented example — continued supplying facial recognition data to law enforcement in jurisdictions with no prohibition, scraping billions of images from public social media to build the largest private face database in existence.
Clearview AI built a database of approximately 30 billion facial images scraped from social media without user consent and sold access to law enforcement agencies. By 2023, it had been used by over 3,100 agencies in the U.S. alone. It violated terms of service of every major platform it scraped. It was fined or banned in multiple countries (UK: £7.5M fine, Canada: ordered to delete Canadian data, Australia: found to have breached privacy law). It continued operating in the U.S. in the absence of federal legislation.
You've examined moratoria, the EU tiered model, and BIPA's private right of action. A state legislature has asked you to advise on the single most impactful provision they could include in a facial recognition governance bill. Use this lab to work through the options.
In 2019, researchers at Google published a paper proposing Model Cards for Model Reporting — standardized documents that would accompany ML models and disclose: intended use cases, evaluation results by subgroup, performance limitations, and ethical considerations. The proposal emerged directly from the Gender Shades work and conversations about how facial analysis systems had been deployed without disclosing demographic performance disparities.
By 2023, Model Cards had become a standard practice on HuggingFace, where most model uploads now include them — though compliance is voluntary and card quality varies enormously. Dataset Cards, a parallel initiative, document the composition, provenance, and known limitations of training datasets.
Most documented visual AI failures trace back to training data. The faces a model learns from determine what it learns to see accurately. Large Face DB, MS-Celeb-1M, and other widely used facial recognition training datasets were scraped from the web without consent and without demographic auditing. When researchers began examining these datasets — notably the Excavating AI project by Kate Crawford and Trevor Paglen in 2019 — they found troubling patterns: faces labeled with offensive or stereotyped descriptors, heavy skew toward white male faces, images collected without subject knowledge.
IBM subsequently released its Diversity in Faces dataset, explicitly designed to include balanced representation across Fitzpatrick skin types, face shapes, and age groups, with transparent documentation of collection methodology. Microsoft later retracted MS-Celeb-1M entirely after researchers identified that it included images of private individuals collected without consent.
Following the Williams, Oliver, and Parks wrongful arrests, Detroit City Council passed an ordinance in 2021 requiring that facial recognition results used by police must be reviewed by a trained human examiner before generating a lead, and that the technology cannot be the sole basis for an arrest. Officers must have additional corroborating evidence. The system must be used only for violent crimes, not misdemeanors. Results must be logged and subject to annual audit.
This policy doesn't eliminate the risk of wrongful arrest — human examiners make errors too, and confirmation bias can cause an examiner to accept a weak match — but it creates a multi-layered decision chain and a documented record. Accountability requires a trail: someone must be identifiable as having made each consequential decision.
New York City's Local Law 144 (effective 2023) requires employers using automated employment decision tools — including image-based assessment tools — to conduct annual bias audits by independent auditors, publish summary results publicly, and notify candidates that such tools are being used. It is the first U.S. law mandating third-party auditing of AI hiring tools. Compliance has been inconsistent, but the law establishes the principle that consequential algorithmic decisions require demonstrated fairness, not assumed fairness.
Technical accountability tools — model cards, datasheets, audits — are necessary but not sufficient. A Model Card can accurately describe a system's demographic performance disparities, and an organization can choose to deploy it anyway. An audit can document bias, and a buyer can ignore the findings. Transparency creates the conditions for accountability but does not guarantee it. The complementary requirement is consequence: mechanisms — legal, financial, reputational — that make deploying a biased system costly rather than merely documented.
This is why the most effective accountability architectures combine technical transparency (Model Cards, audit reports) with legal enforcement (private rights of action, mandatory audit requirements) and procedural safeguards (human-in-the-loop requirements, logging, community oversight boards). Each layer compensates for the limits of the others.
Effective visual AI accountability operates at four levels simultaneously: technical (diverse training data, demographic disaggregation of metrics, adversarial testing); organizational (internal review processes, human-in-the-loop requirements, audit trails); legal (mandatory third-party audits, private rights of action, penalties for disparate harm); and participatory (affected communities have voice in deployment decisions and access to findings).
A mid-sized city's transit authority wants to deploy facial recognition to identify individuals on a watch list for violent incidents at stations. You've been asked to design the full accountability stack — technical, organizational, legal, and participatory layers — for this deployment. Use the lab to work through what each layer should require.