Lesson 1 · Module 2

How Face Detection Works

From Viola-Jones to deep neural nets — the 30-year sprint to instant recognition

How does a camera go from raw pixels to "that's a face" in under a millisecond?

In 2001, researchers Paul Viola and Michael Jones published a paper that changed consumer electronics forever. Their algorithm could detect faces in real time on hardware that, by today's standards, was barely powerful enough to run a spreadsheet. Within a decade, every digital camera on the market used a variant of their approach to draw those now-familiar green rectangles around faces before you pressed the shutter.

Today the same task is handled by deep convolutional neural networks that not only find faces but estimate age, gaze direction, emotional state, and identity — all before the image finishes loading on your screen.

The Pixel Problem

A single 12-megapixel smartphone photo contains roughly 36 million numbers — three color values for each pixel. Scanning every possible region of that image for a face, at every possible size, would require billions of comparisons per photo. Early computer vision struggled to do this faster than several seconds per frame, making it useless for live cameras.

The breakthrough was not looking at every pixel. Instead, algorithms learned to look at the right pixels in the right order, discarding non-face regions almost instantly.

Viola-Jones: The First Real-Time Method (2001)

Viola and Jones used three interlocking ideas. First, they represented images using Haar features — simple rectangular patterns that measure whether one region of pixels is lighter or darker than an adjacent region. Eyes tend to be darker than foreheads; the bridge of the nose is lighter than the sides. These contrast patterns are very fast to compute.

Second, they built a cascade classifier — a series of increasingly strict filters. A tiny two-feature check eliminates 50% of image regions immediately. Only the survivors reach the next, harder check. By the time a region reaches stage 20, it has survived 20 rounds of elimination. The average image region is rejected after just 10 feature evaluations; a real face requires hundreds. This asymmetry makes the whole system fast.

Third, they used a clever data structure called an integral image that lets any rectangular sum be computed in exactly four operations, regardless of the rectangle's size. These three ideas together produced a detector that ran at 15 frames per second on a 700 MHz processor in 2001.

Real-World Deployment

Fujifilm's FinePix F10, released in 2004, was the first mass-market camera to ship with built-in face detection. Canon and Nikon followed in 2007. By 2009, Viola-Jones variants were running on hundreds of millions of devices worldwide.

Deep Learning Takes Over (2012 onward)

When AlexNet won the ImageNet competition in 2012 with a convolutional neural network, the field shifted decisively. Instead of hand-crafting features like Haar patterns, deep networks learn their own features from millions of labeled examples. For face detection, this meant networks like MTCNN (Multi-task Cascaded Convolutional Networks, 2016) and later RetinaFace (2019) that simultaneously detect faces and locate facial landmarks — the positions of eyes, nose tip, and mouth corners — with sub-pixel precision.

These models train on datasets like WIDER FACE, released by researchers at the Chinese University of Hong Kong in 2015, which contains 393,703 labeled faces across 32,203 images covering extreme conditions: blur, occlusion, unusual lighting, and faces as small as 10×10 pixels.

Key Terms

Haar Feature —A rectangular pattern comparing pixel brightness between adjacent regions, used as a fast proxy for facial structures.

Cascade Classifier —A sequence of filters where early, cheap stages reject most candidates so expensive stages only run on likely positives.

CNN (Convolutional Neural Network) —A neural network architecture that learns spatial feature detectors by sliding small filter windows across an image.

Landmark Detection —Pinpointing specific facial points (eye corners, nose tip, mouth corners) to enable alignment and further analysis.

Why This Matters

Face detection is the first step in nearly every face-related AI system: photo organization, augmented reality filters, access control, and surveillance. Understanding how detection works — and where it fails — is essential before evaluating any downstream application built on top of it.

Lesson 1 Quiz

How Face Detection Works · 3 questions

What property of the Viola-Jones cascade classifier makes it fast enough for real-time use?

Correct. The cascade structure means roughly 50% of regions are eliminated at the very first stage using just two features. Only genuine face candidates survive to the expensive later stages.

Not quite. The cascade's power is precisely that it does not give equal effort to all regions — it rejects non-faces cheaply and early.

What is a Haar feature?

Correct. Haar features exploit the fact that faces have predictable contrast patterns — dark eye sockets below a bright forehead, for instance — and these rectangular comparisons are extremely fast to compute.

Not quite. Haar features are hand-designed rectangular brightness comparisons, not learned representations. Learned filters belong to CNN-based approaches.

Which dataset, released in 2015, helped drive advances in deep-learning-based face detection by including faces under extreme real-world conditions?

Correct. WIDER FACE from the Chinese University of Hong Kong contains 393,703 labeled faces across highly varied conditions, making it a benchmark for robustness in face detection research.

Not quite. WIDER FACE is the dataset specifically designed for face detection under challenging real-world conditions, released in 2015.

Lab 1 · Face Detection Mechanics

Explore how cascade classifiers and CNNs find faces — ask at least 3 questions to complete.

Your Mission

You have an AI tutor specialized in face detection algorithms. Use it to deepen your understanding of Viola-Jones cascades, integral images, CNN-based detectors, and how each approach handles hard cases like partial occlusion or extreme lighting.

Starter prompts: "Why does the integral image make Haar features fast?" · "What happens when a face is partially covered by hair or glasses?" · "How does MTCNN differ from Viola-Jones?"

AI Tutor — Face Detection

Lab 1

Hello! I'm your face detection tutor. We can explore anything from Viola and Jones's 2001 cascade paper to modern CNN-based detectors like MTCNN and RetinaFace. What would you like to dig into?

Lesson 2 · Module 2

Face Recognition: From Detection to Identity

How a camera moves from "there's a face" to "that's a specific person"

What actually happens between detecting a face and naming it — and what can go wrong?

In January 2020, Robert Williams, a Black man living in suburban Detroit, was arrested at his home in front of his family. The Detroit Police Department had used an automated facial recognition system — later identified as using DataWorks Plus software — that matched a blurry surveillance image of a shoplifting suspect to Williams's driver's license photo. The match was wrong. Williams spent 30 hours in jail before investigators who actually looked at the evidence acknowledged the error. It was the first documented case in the United States of a wrongful arrest driven by facial recognition.

The Two-Stage Pipeline

Face recognition is distinct from face detection. Detection answers "is there a face?" Recognition answers "whose face is it?" Modern systems accomplish recognition in two steps: embedding and matching.

In the embedding step, a neural network maps a detected face image to a point in a high-dimensional space — typically 128 to 512 numbers called a face embedding or face vector. The key property: faces of the same person cluster near each other; faces of different people are far apart. Google's FaceNet system (2015) demonstrated that 128-dimensional embeddings trained with a "triplet loss" function could achieve near-human accuracy on the LFW benchmark.

In the matching step, the system compares a query embedding to a database of stored embeddings. If the nearest neighbor in the database is close enough (within a threshold distance), the system declares a match. If not, the face is labeled unknown.

Training Data and Bias

In 2018, Joy Buolamwini at the MIT Media Lab and Timnit Gebru published "Gender Shades", an audit of three commercial face analysis systems from IBM, Microsoft, and Face++. They found that error rates for determining gender ranged from under 1% for light-skinned men to over 34% for dark-skinned women. The systems had been trained on datasets dominated by light-skinned faces, making them substantially less accurate for underrepresented groups.

A 2019 NIST study (the Face Recognition Vendor Test, or FRVT) evaluated 189 algorithms from 99 developers. It found that many commercial algorithms produced false positive rates (incorrectly matching two different people) that were 10 to 100 times higher for African-American and Asian faces compared to Caucasian faces when tested against the same government photo databases.

The Williams Case — Technical Breakdown

The Detroit Police used a low-resolution surveillance image that was first manually enhanced (sharpened and brightened) by a human analyst before being fed to the recognition system. Image enhancement can introduce artifacts that shift embeddings away from the true face. The system returned a candidate match; a human investigator accepted it without corroborating evidence. The failure involved both algorithmic and procedural errors.

How Modern Systems Raise Accuracy

Contemporary high-accuracy systems use several techniques to reduce error. ArcFace loss (2019), developed by researchers at Imperial College London, trains embeddings with angular margins that force faces of the same person to cluster more tightly. Large, diverse training sets like MS-Celeb-1M and VGGFace2 include millions of images across many demographics. Data augmentation — artificially darkening, rotating, and occluding training images — forces networks to generalize across conditions.

Even so, performance in controlled conditions (front-facing, well-lit, high resolution) remains far better than in-the-wild conditions, which is precisely where law enforcement and surveillance systems operate.

Key Terms

Face Embedding —A compact numerical vector (typically 128–512 numbers) representing a face so that similar faces are geometrically close in the vector space.

Triplet Loss —A training objective that pulls embeddings of the same person together while pushing embeddings of different people apart, using three-image comparison groups.

False Positive Rate —The proportion of non-matching face pairs that the system incorrectly declares a match; critical in identification contexts.

FRVT —NIST's Face Recognition Vendor Test, a large-scale independent evaluation of commercial face recognition algorithms.

The Takeaway

Face recognition is not a single algorithm but a pipeline with multiple points of failure. Bias in training data, poor image quality, and inadequate human oversight each compound the risk of error — with consequences that can be severe when used for law enforcement identification.

Lesson 2 Quiz

Face Recognition: From Detection to Identity · 3 questions

What does a face embedding represent?

Correct. Face embeddings encode identity information geometrically — the distance between two embeddings reflects how similar the faces are, making matching fast and scalable to large databases.

Not quite. Embeddings are not raw pixels or simple attributes; they are learned vectors where geometric distance corresponds to face similarity.

The 2018 "Gender Shades" audit by Buolamwini and Gebru found that commercial face analysis systems had error rates as high as 34% for which group?

Correct. Dark-skinned women had the highest gender classification error rates — up to 34.7% — compared to under 1% for light-skinned men, revealing a compounded bias along both race and gender dimensions.

Not quite. The worst error rates — up to 34.7% — were for dark-skinned women, who were doubly underrepresented in the training data.

According to the 2019 NIST FRVT study, how did false positive rates for African-American and Asian faces compare to Caucasian faces in many commercial algorithms?

Correct. NIST found false positive rates 10 to 100 times higher for African-American and Asian faces in many systems, directly relevant to the risk of wrongful identification in law enforcement use.

Not quite. The NIST FRVT documented dramatically higher false positive rates — 10 to 100 times — for African-American and Asian faces, not a marginal difference.

Lab 2 · Recognition, Bias & Accuracy

Investigate the face recognition pipeline and its failure modes — ask at least 3 questions to complete.

Your Mission

Explore the technical and social dimensions of face recognition with your AI tutor. Dig into embedding spaces, triplet loss, the Gender Shades findings, NIST FRVT results, and the Robert Williams case. Ask about what "accuracy" actually means and who bears the cost of errors.

Starter prompts: "How does triplet loss training make embeddings more accurate?" · "Why did the Gender Shades study find worse results for dark-skinned women specifically?" · "What safeguards could prevent wrongful arrests from face recognition errors?"

AI Tutor — Face Recognition & Bias

Lab 2

Welcome to Lab 2. We can explore the technical pipeline of face recognition — embeddings, matching thresholds, loss functions — or dig into real-world bias audits like Gender Shades and the NIST FRVT. We can also discuss the policy implications of the Robert Williams wrongful arrest case. What interests you?

Lesson 3 · Module 2

Object Detection: How AI Reads a Scene

YOLO, self-driving cars, and the art of recognizing a thousand things at once

How does an AI system identify every object in a scene simultaneously — and fast enough to react?

On May 7, 2016, a Tesla Model S operating in Autopilot mode struck a tractor-trailer crossing a Florida highway. The vehicle's camera-based system, using Mobileye's EyeQ chip and software, failed to distinguish the white side of the truck from a bright sky. The National Highway Traffic Safety Administration investigation concluded that the system was designed for highway lane-keeping, not full obstacle detection, but the crash brought intense scrutiny to exactly how well AI systems could identify objects under real-world conditions.

From Classification to Detection

Image classification answers "what is in this image?" Object detection answers "what is in this image, where is it, and are there multiple instances?" These are fundamentally different problems. Classification produces one label; detection produces a list of bounding boxes, each with a class label and a confidence score.

Early deep learning detection systems like R-CNN (Region-based CNN, 2014) worked by first generating ~2,000 candidate regions using a separate algorithm, then running a CNN classifier on each one. The result was accurate but slow — about 49 seconds per image on a GPU, making it useless for real-time applications.

YOLO: You Only Look Once (2016)

Joseph Redmon and collaborators introduced YOLO in 2016 with a radical reframing. Instead of examining candidate regions sequentially, YOLO divides the image into a grid (say, 13×13 cells) and has each cell simultaneously predict bounding boxes, confidence scores, and class probabilities. The entire image is processed in a single forward pass through the network — hence "You Only Look Once."

The original YOLO ran at 45 frames per second while detecting 20 object classes, a roughly 450× speed improvement over R-CNN. Subsequent versions — YOLOv3 (2018), YOLOv5 (2020), and YOLOv8 (2023) — improved accuracy significantly while maintaining real-time performance. YOLOv8 can detect 80 categories from the MS-COCO dataset — from people and cars to toothbrushes and refrigerators — at over 160 FPS on modern hardware.

The MS-COCO Dataset

Microsoft's Common Objects in Context (COCO), released in 2014, contains 328,000 images with over 2.5 million labeled object instances across 80 categories. Unlike ImageNet, COCO images depict objects in realistic, cluttered scenes rather than isolated against clean backgrounds. It remains the primary benchmark for object detection research.

How YOLO Makes Predictions

Each grid cell predicts several candidate boxes, each defined by center coordinates, width, height, and an "objectness" score indicating how likely the cell contains an object at all. Class probabilities condition on there being an object. At inference time, boxes with low objectness scores are discarded, and a process called Non-Maximum Suppression (NMS) removes redundant overlapping boxes, keeping only the highest-confidence prediction for each object.

The system is trained end-to-end: the loss function simultaneously penalizes localization error (wrong box position), confidence error (wrong objectness score), and classification error (wrong class label). This joint optimization is what lets the network balance all three tasks efficiently.

Real-World Performance Gaps

Benchmark accuracy on COCO does not always translate to the real world. In 2021, researchers at Carnegie Mellon University published a study showing that stop signs with stickers placed on them — adversarial perturbations designed to be invisible to humans — caused object detection models to classify the sign as a speed limit sign in 100% of test cases. Autonomous driving systems relying on such detectors would fail to stop.

Similarly, domain shift — training on images from one country's streets and deploying in another with different road markings, vegetation, and traffic patterns — can reduce detection accuracy substantially. The Tesla 2016 crash illustrated a related problem: the system was optimized for common scenarios; the unusual geometry of a crossing trailer produced a failure mode not represented in its training data.

Key Terms

Bounding Box —A rectangle defined by center coordinates plus width and height, used to localize an object within an image.

Confidence Score —A number between 0 and 1 representing how certain the model is that a predicted box contains an object of the stated class.

Non-Maximum Suppression —A post-processing step that removes redundant overlapping bounding box predictions, keeping only the most confident one per object.

Domain Shift —Degradation in model performance when deployment data differs significantly from training data in distribution.

Why This Matters

Object detection is the perceptual backbone of autonomous vehicles, warehouse robots, medical imaging analysis, and retail checkout systems. The gap between benchmark performance and real-world reliability is not a minor implementation detail — it determines whether these systems are safe to deploy.

Lesson 3 Quiz

Object Detection: How AI Reads a Scene · 3 questions

What fundamental architectural choice allows YOLO to be dramatically faster than R-CNN?

Correct. The "single forward pass" design — dividing the image into a grid and predicting all boxes simultaneously — is what enables YOLO's real-time performance. R-CNN ran a CNN ~2,000 times per image; YOLO runs once.

Not quite. YOLO's speed comes from its architecture: one forward pass for the entire image rather than running a classifier on thousands of candidate regions.

What is Non-Maximum Suppression used for in object detection?

Correct. NMS is a clean-up step. When multiple overlapping boxes all predict "car," NMS keeps only the one with the highest objectness score and discards the rest, preventing duplicate detections.

Not quite. NMS is a post-processing step that removes duplicate overlapping boxes — it doesn't touch pixel values or confidence score magnitudes.

A 2021 CMU study found that adversarial stickers on stop signs caused object detection models to misclassify them as speed limit signs in what percentage of test cases?

Correct. The adversarial stickers caused misclassification in 100% of test cases — a complete failure with perturbations that humans could barely notice, highlighting a critical robustness gap.

Not quite. The adversarial attack succeeded in 100% of test cases, completely fooling the detector while remaining nearly invisible to human observers.

Lab 3 · Object Detection in the Wild

Explore YOLO architecture, real-world failure modes, and deployment tradeoffs — ask at least 3 questions.

Your Mission

Use your AI tutor to explore how object detection systems work in deployment. Investigate YOLO's grid prediction mechanism, NMS, the COCO benchmark, and why real-world performance can diverge from test scores. Consider the autonomous driving context and adversarial attacks.

Starter prompts: "How does YOLO decide which grid cell 'owns' an object that spans multiple cells?" · "What is domain shift and how might it affect a self-driving car deployed in a new country?" · "Why are adversarial attacks on object detectors considered a safety risk?"

AI Tutor — Object Detection

Lab 3

Welcome to Lab 3. I can help you explore object detection systems — from YOLO's architecture and the COCO benchmark to adversarial vulnerabilities and the real-world safety implications for autonomous vehicles. What would you like to investigate?

Lesson 4 · Module 2

Computer Vision in Everyday Life

Smartphones, stores, airports, and the surveillance infrastructure hiding in plain sight

Where does face and object recognition actually appear in your life — and what are the rules governing it?

In 2018, Amazon launched Amazon Go, its cashierless grocery format. Cameras in the ceiling track every customer using computer vision as they pick up products, updating a virtual cart in real time. No checkout. No cashier. The system combines object detection (identifying products), person re-identification (tracking individuals across camera views), and inventory management into a seamless retail experience — or, depending on your perspective, a comprehensive surveillance infrastructure operated by a private company inside a grocery store.

Face Unlock on Smartphones

Apple's Face ID, introduced with the iPhone X in November 2017, was the first mass-market implementation of 3D structured-light face recognition for device unlock. The system projects 30,000 invisible infrared dots onto the face, reads their distortion pattern with an infrared camera, and builds a depth map. A neural network converts this into a mathematical representation that updates over time as the user's face changes (glasses, haircut, aging). Apple reports a false accept rate of approximately 1 in 1,000,000 — compared to 1 in 50,000 for Touch ID fingerprints.

Android's implementation has varied. Some manufacturers use a 2D selfie camera without infrared depth sensing, which is significantly less secure. Samsung's iris recognition (Galaxy S8, 2017) uses near-infrared light to capture the unique patterns of the iris but proved susceptible to a high-resolution printed photo of the eye placed over a contact lens — a demonstration by the Chaos Computer Club in 2017.

Airports and Border Control

U.S. Customs and Border Protection's Biometric Entry-Exit Program began large-scale facial recognition deployment at airports in 2017. By 2023, CBP reported using facial recognition at over 200 airports and land border crossings, processing over 300 million traveler comparisons. The system compares a live photo taken at boarding against passport and visa photos in government databases.

In 2019, a passenger traveling through Washington Dulles airport was identified as not being the person on the passport he was carrying — the first documented case of facial recognition catching a passport impostor at a U.S. airport. CBP has pointed to this as evidence of the system's effectiveness. Critics note that the same technology, when applied to the much larger population of legitimate travelers, must also be evaluated by its false positive rate — travelers incorrectly flagged for additional screening.

Clearview AI — 2020

In January 2020, the New York Times revealed that Clearview AI had scraped over 3 billion facial images from public websites — Facebook, Instagram, LinkedIn, news sites — without consent and built a face recognition product sold to law enforcement. By 2021, the company reported over 3,100 law enforcement agency customers. Canadian, Australian, British, and French regulators found Clearview in violation of privacy laws. The company's existence demonstrated that the barrier to building a population-scale face recognition database had dropped to near zero for any well-funded actor willing to scrape public data.

Retail and Surveillance

Beyond Amazon Go, computer vision appears in retail loss prevention (face matching against shoplifting databases), smart city infrastructure (the city of London operated over 691,000 CCTV cameras as of 2020), and workplace monitoring (systems that track employee attention at computer workstations using webcams). In the United States, over 20 cities — including San Francisco (2019), Boston (2020), and Portland, Oregon (2020) — have passed ordinances banning government use of facial recognition technology, while federal legislation remains absent.

In 2022, the European Union's proposed AI Act classified real-time biometric identification in public spaces as a "prohibited AI practice" with narrow exceptions for law enforcement. The final regulation, adopted in 2024, became the world's first comprehensive AI law to directly restrict computer vision applications.

Key Terms

Structured Light —A 3D sensing technique that projects a known pattern of light and infers depth from how the pattern deforms on a surface.

Person Re-identification —Tracking the same individual across multiple non-overlapping camera views without continuous line-of-sight coverage.

Biometric Entry-Exit —CBP's airport and border program that matches live traveler photos against government identity document databases using face recognition.

EU AI Act —The European Union's 2024 regulation that categorizes AI applications by risk level and bans real-time biometric identification in public spaces with limited exceptions.

The Bigger Picture

The same computer vision capabilities that let your phone unlock with a glance also enable mass surveillance at scale. The technical systems are not inherently good or bad — but their deployment context, data policies, error rates, and oversight structures determine whether they protect or erode rights. Understanding the technology is prerequisite to evaluating those choices.

Lesson 4 Quiz

Computer Vision in Everyday Life · 3 questions

Apple's Face ID (iPhone X, 2017) uses which technique to build a 3D face map?

Correct. The TrueDepth camera system projects a grid of infrared dots and uses their deformation to compute a precise depth map, enabling 3D face recognition significantly more secure than 2D approaches.

Not quite. Face ID uses structured light — 30,000 infrared dots — rather than stereo cameras, ultrasound, or purely software-based depth estimation.

What made Clearview AI's approach to building a facial recognition database controversial?

Correct. Clearview scraped public websites without consent — a practice ruled in violation of privacy laws by multiple national regulators — demonstrating how easily a population-scale face database could be assembled from publicly visible images.

Not quite. Clearview's controversy centered on scraping billions of images from public-facing websites without consent, then selling access to law enforcement without public knowledge.

Which of the following correctly describes a provision of the EU AI Act (adopted 2024) regarding computer vision?

Correct. The EU AI Act's prohibition on real-time biometric identification in public spaces — with narrow exceptions — is one of the world's most significant regulatory restrictions on public surveillance computer vision.

Not quite. The EU AI Act specifically targets real-time public space biometric identification as a prohibited practice — it doesn't mandate central databases or ban all commercial biometric applications.

Lab 4 · Vision Systems in Daily Life

Explore smartphones, airports, retail surveillance, and policy — ask at least 3 questions to complete.

Your Mission

You're in conversation with an AI tutor focused on real-world deployments of computer vision. Explore how Face ID works technically, how CBP's biometric program operates at airports, what Amazon Go's tracking system actually does, and what the EU AI Act and city-level bans mean for how this technology is governed.

Starter prompts: "How does Amazon Go track a shopper who picks up and puts back a product multiple times?" · "What are the privacy tradeoffs of CBP's airport facial recognition program?" · "Why have some U.S. cities banned facial recognition while others expanded it?"

AI Tutor — Vision in Daily Life

Lab 4

Hello! Lab 4 covers how computer vision has been deployed in the real world — from Face ID on your phone to Clearview AI, CBP border control, Amazon Go, and the policy landscape around public surveillance. What would you like to explore?

Module 2 Test

Cameras That Recognize Faces and Objects · 15 questions · Pass at 80%

1. In what year did Viola and Jones publish their real-time face detection algorithm?

Correct. The Viola-Jones detector was published in 2001 and became the foundation for real-time face detection in consumer cameras throughout the 2000s.

The Viola-Jones paper was published in 2001, enabling real-time face detection on consumer hardware of that era.

2. What data structure does Viola-Jones use to compute rectangular pixel sums in exactly four operations?

Correct. The integral image (summed area table) precomputes cumulative pixel sums so that any rectangular region sum requires only four lookups, making Haar feature computation extremely fast.

The integral image is the data structure that enables constant-time rectangular sum computation, a key speed component of Viola-Jones.

3. Google's FaceNet (2015) demonstrated face recognition using which type of loss function?

Correct. FaceNet's triplet loss trains the network using groups of three images — an anchor, a positive (same person), and a negative (different person) — pulling the anchor closer to the positive while pushing the negative away.

FaceNet used triplet loss, which trains embeddings by comparing anchor-positive-negative image triples to enforce geometric clustering by identity.

4. The NIST FRVT 2019 study found that false positive rates for African-American faces in many commercial algorithms were how much higher than for Caucasian faces?

Correct. NIST's large-scale independent evaluation found 10 to 100 times higher false positive rates for African-American faces — a finding with direct relevance to wrongful identification risk in law enforcement use.

NIST found 10 to 100 times higher false positive rates for African-American faces — a dramatic disparity documented in their 2019 FRVT report.

5. The wrongful arrest of Robert Williams in Detroit (2020) involved which technology?

Correct. The Detroit Police used facial recognition software that incorrectly matched a blurry surveillance image to Williams's driver's license photo. He was arrested and held for 30 hours before the error was acknowledged.

The Williams case involved facial recognition — a false match between a surveillance image and his driver's license photo — documented as the first known wrongful U.S. arrest driven by facial recognition.

6. ArcFace loss (2019) improves face recognition embedding quality by doing what?

Correct. ArcFace introduces additive angular margin loss that enforces tighter intra-class compactness and larger inter-class separability in the hyperspherical embedding space.

ArcFace's key innovation is the angular margin loss that makes same-identity embeddings cluster more tightly — improving discrimination without requiring larger models.

7. What is the primary benchmark dataset used to evaluate object detection algorithms?

Correct. Microsoft's Common Objects in Context (COCO) dataset, with 328,000 images and 2.5 million labeled object instances across 80 categories in realistic cluttered scenes, is the primary benchmark for object detection research.

MS-COCO is the standard object detection benchmark — ImageNet is used for image classification, and WIDER FACE is for face detection specifically.

8. What did the original YOLO paper (2016) claim as its processing speed on a GPU?

Correct. The original YOLO ran at 45 fps while detecting 20 object classes — compared to R-CNN's ~49 seconds per image, YOLO was roughly 450 times faster.

Original YOLO achieved 45 fps — a dramatic leap over R-CNN's 49 seconds per image that made real-time detection practical.

9. What is "domain shift" in the context of object detection?

Correct. Domain shift occurs when the statistical distribution of deployment data differs from training data — e.g., training on U.S. roads and deploying in Japan — causing significant accuracy degradation.

Domain shift refers to the performance drop when deployment conditions (weather, geography, camera type) differ from training data distribution.

10. Apple's Face ID reports a false accept rate of approximately 1 in how many attempts?

Correct. Apple reports Face ID's false accept rate at approximately 1 in 1,000,000 — compared to Touch ID fingerprints at 1 in 50,000 — due to the 3D structured-light depth map capturing far more unique information.

Apple's stated false accept rate for Face ID is 1 in 1,000,000, significantly lower than Touch ID's 1 in 50,000.

11. The "Gender Shades" paper was published in what year, and by researchers at which institution?

Correct. Joy Buolamwini and Timnit Gebru published "Gender Shades" in 2018 through the MIT Media Lab, auditing commercial face analysis systems from IBM, Microsoft, and Face++.

Gender Shades was published in 2018 by Joy Buolamwini (MIT Media Lab) and Timnit Gebru, auditing gender classification systems across three major tech companies.

12. How does Clearview AI's face recognition database primarily differ from government biometric databases?

Correct. Clearview scraped ~3 billion images from Instagram, Facebook, LinkedIn, and news sites without consent — a fundamentally different sourcing model from government passport or criminal justice databases that rely on official collection processes.

Clearview's defining characteristic is building its database through mass scraping of public web images without consent, rather than official identity document collection.

13. What does the EU AI Act (2024) classify real-time biometric identification in public spaces as?

Correct. The EU AI Act places real-time public biometric identification in the "prohibited practices" category — the most restrictive tier — with only narrow exceptions for law enforcement under judicial authorization.

The EU AI Act classifies real-time public biometric identification as a prohibited AI practice — the strictest category — with limited law enforcement exceptions.

14. In the May 2016 Tesla Autopilot fatal crash in Florida, what specifically caused the camera system to fail to detect the obstacle?

Correct. NHTSA's investigation found the camera system failed to distinguish the white trailer against the bright sky — a failure of contrast-based visual discrimination under real-world lighting conditions.

The NHTSA investigation found the system failed to distinguish the white trailer side from the bright sky — a classic real-world vision challenge not adequately handled by the deployed system.

15. What was the Fujifilm FinePix F10 notable for when it was released in 2004?

Correct. The FinePix F10 (2004) was the first consumer camera to ship with face detection built in, using a variant of Viola-Jones technology. Canon and Nikon followed with their own implementations in 2007.

The Fujifilm FinePix F10 holds the distinction of being the first mass-market camera with built-in face detection, launching the era of automatic face-finding in consumer photography.