How AI Sees Your World · Introduction

Machines Have Learned to Look — and the World Will Never Be the Same

Why understanding computer vision is no longer optional for anyone who lives in the modern world.

In 1839, when Louis Daguerre announced the daguerreotype to the French Academy of Sciences, the painter Paul Delaroche reportedly declared, "From today, painting is dead." He was wrong about painting — but profoundly right that something irreversible had happened to the relationship between human beings and visual reality. Within a decade, photographic studios had opened in every major city on earth. Within two, photography had transformed journalism, science, crime investigation, warfare, and personal identity. The shift was not gradual. It cascaded.

What is cascading now is something of comparable scale. In 2012, a neural network called AlexNet — built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto — cut the ImageNet image-classification error rate nearly in half in a single competition. That result triggered a decade of relentless acceleration: self-driving car programs at Google, Waymo, and Tesla; medical imaging systems that outperform radiologists on specific cancer-detection tasks; real-time face recognition deployed at airports, stadiums, and street corners across dozens of countries. Cameras no longer merely record. They interpret, classify, and decide.

This course exists to make that machinery legible to you. We will not pretend that computer vision is magic, nor that its consequences are uniformly good. We will look at what it actually does — the mathematics it runs, the training data it depends on, the documented cases where it has worked brilliantly and where it has failed with serious human consequences. Four lessons, four labs, one honest goal: you leave here seeing AI vision systems the way an informed adult should, not the way a press release wants you to.

How AI Sees Your World · Lesson 1

What Is Computer Vision, Really?

From pixels to predictions — the chain of computation that lets machines name what they see.

If a camera is just recording numbers, how does a machine turn numbers into understanding?

On November 26, 2012, the ImageNet Large Scale Visual Recognition Challenge published its results. For the previous three years, the best competing systems had trimmed classification error by fractions of a point each cycle — painstaking, incremental progress. Then AlexNet appeared. Its top-5 error rate was 15.3 percent, against the runner-up's 26.2 percent. The gap was so large that several judges initially assumed a reporting error. There had been no error. A deep convolutional neural network trained on two NVIDIA GTX 580 GPUs had simply learned to see in a way that handcrafted algorithms could not match. The field did not gradually absorb this result. It pivoted overnight.

What made AlexNet different was not cleverness about vision — it knew nothing about eyes or optics or the visual cortex. It had been shown roughly 1.2 million labeled photographs and adjusted hundreds of millions of numerical weights until its error on a held-out test set dropped to a historic low. The "understanding" it achieved was entirely statistical: certain patterns of pixel values reliably preceded certain labels. That is still, fundamentally, what every modern computer vision system is doing — including the one that unlocks your phone with your face, reads your license plate at the highway toll gate, and monitors whether you are wearing a hard hat on a construction site.

1. Images Are Just Numbers

Every digital image is a grid of numbers. A standard color photograph is three overlapping grids — one for red intensity, one for green, one for blue — where each cell holds a value between 0 and 255. A 1080p frame contains 1,920 × 1,080 × 3 = approximately six million numbers. A computer vision system receives these numbers as its raw input. It has no eyes. It has no intuition. It has arithmetic at enormous scale.

This is important to hold onto. When a system "recognizes" your face, it has not done anything resembling what you do when you recognize a friend across a room. It has computed a very high-dimensional function over a grid of pixel values and produced a numerical output that was trained to correspond to an identity label. The sophistication is real; the resemblance to human vision is largely metaphorical.

2. The Classical Pipeline vs. Deep Learning

Before 2012, most computer vision systems were built around handcrafted features — mathematical operations that human engineers designed to detect edges, corners, textures, and gradients. The SIFT (Scale-Invariant Feature Transform) algorithm, published by David Lowe in 1999 and refined through the early 2000s, became the dominant approach. It could find and match keypoints across images even under changes in scale, rotation, and lighting. It was elegant, interpretable, and slow to improve past a certain ceiling.

Deep learning replaced this with learned features. Instead of engineers specifying what to look for, a convolutional neural network (CNN) learns from data which mathematical filters are useful. Early layers tend to learn edge detectors and color blobs. Middle layers combine those into shapes and textures. Later layers combine those into object parts. The final layer produces a probability distribution over categories. Nothing in this process was hand-designed; it emerged from optimization against millions of labeled examples.

The practical consequence is that modern systems are extraordinarily capable on the distributions they were trained on — and can fail in startling ways on inputs that fall outside that distribution. A 2019 study by MIT researchers found that a state-of-the-art ImageNet classifier could be fooled by simply rotating a test image by 45 degrees, because the training set had not adequately represented tilted objects. Human vision has no such vulnerability.

Why This Matters

Distribution shift — the gap between training data and real-world deployment conditions — is the single most common source of computer vision failures in production systems. Understanding this concept explains most of the surprising failures you read about in the news.

3. The Three Core Tasks

Computer vision encompasses many sub-problems, but three are foundational and appear in nearly every application you will encounter in this course:

Classification Given an image, assign it to one of a fixed set of categories. "This image contains a cat." The simplest task, but the foundation for everything else. AlexNet was a classifier.

Detection Locate and classify multiple objects within an image, drawing bounding boxes around each. "There is a person at coordinates (120, 45) and a car at (340, 200)." Used in self-driving vehicles and security cameras.

Segmentation Assign every pixel in the image to a category. "These 14,000 pixels are road. These 3,200 pixels are pedestrian." Required for surgical robots and autonomous vehicle path planning.

4. Training Data Is the Foundation

No component of a computer vision system matters more than the data it was trained on. ImageNet — the dataset that enabled AlexNet — was assembled by Fei-Fei Li and her team at Princeton and Stanford between 2006 and 2009. It required three years, crowdsourced labeling via Amazon Mechanical Turk, and careful curation of 14 million images across 21,841 categories. It was an unprecedented act of data engineering, and it made the deep learning revolution in vision possible.

The composition of training data determines what a system can see — and what it cannot. A classifier trained only on photographs taken in North America will perform worse on images from Southeast Asia or West Africa, because the visual environments, skin tones, clothing, architecture, and lighting conditions differ. This is not a hypothetical. A 2018 study by Joy Buolamwini and Timnit Gebru (Gender Shades) demonstrated that three commercial facial analysis systems from Microsoft, IBM, and Face++ had error rates on dark-skinned women that were up to 34 percentage points higher than on light-skinned men. The gap was directly traceable to underrepresentation in training data.

Core Principle

A computer vision system can only generalize to what its training data represents. Every application of CV in the real world — medical, legal, commercial — inherits the limitations and biases of the dataset it was built on. This is not a flaw to be patched; it is a structural property of the paradigm.

5. What "Seeing" Actually Means Here

It is worth being precise about what computer vision systems do and do not do. They do not perceive. They do not understand context in the way humans do. They do not know that a photograph of a dog is an image of a living creature with experiences. They perform extremely high-dimensional pattern matching, and they do it fast enough and accurately enough that it is genuinely useful — and genuinely dangerous when misapplied or misunderstood.

The vocabulary of vision — "sees," "recognizes," "understands" — is a convenient shorthand that can mislead. A facial recognition system does not recognize you the way your mother does. It computes a feature vector from your face image and measures its distance from stored vectors in a database. If the distance is below a threshold, it returns a match. The threshold is a design choice. The feature vectors are trained artifacts. The "recognition" is an engineering output, not a cognitive act. Keeping this distinction clear will make you a much better reader of claims about what AI vision systems can and cannot do.

Lesson 1 Quiz

Five questions — select the best answer for each.

1. A digital color image is fundamentally stored as:

Correct. Every digital image is a numerical array. Color images use three channels (RGB), each holding pixel intensity values from 0 to 255. This is the raw material a vision system receives.

Not quite. Images are stored as raw pixel intensity numbers — not shape descriptions, edge summaries, or text labels. Those are derived representations computed afterward.

2. What was historically significant about AlexNet's performance at the 2012 ImageNet challenge?

Correct. The gap between AlexNet and its nearest competitor was nearly 11 percentage points — far beyond incremental progress. Researchers recognized immediately that the field had changed.

Review the lesson. AlexNet's significance was the scale of its performance gap over prior approaches, not any absolute milestone of perfection or "first classification."

3. In the context of computer vision, "distribution shift" refers to:

Correct. Distribution shift is the core reason vision systems fail in the real world. A model trained on one visual distribution — lighting, geography, demographics — may perform poorly when deployed in a different environment.

Distribution shift means the real-world data doesn't match training data. It is the most common cause of production CV failures, and it explains many high-profile mistakes.

4. The Gender Shades study by Buolamwini and Gebru (2018) found that commercial facial analysis systems had error rates up to 34 percentage points higher for dark-skinned women than for light-skinned men. The primary cause identified was:

Correct. The disparity traced directly to training data composition. Systems optimized on datasets skewed toward lighter-skinned subjects generalized poorly to underrepresented groups.

The study attributed the disparity to training data bias — not hardware, not intentional design, and not the choice of algorithm. Data composition is the root cause.

5. Which of the following most accurately describes what a facial recognition system does when it "recognizes" a person?

Correct. Facial recognition is distance computation in a high-dimensional feature space. The threshold for a "match" is a design parameter — lowering it reduces false negatives but increases false positives.

Facial recognition has no perception or intuition. It computes numerical feature vectors and measures distances between them. The match threshold is a human-set parameter with real consequences for error rates.

Lab 1: Interrogating the Pixel

A conversation-based lab · Lesson 1 · How AI Sees Your World

What You'll Do

You have an AI lab assistant who can help you probe the core concepts from Lesson 1: what images actually are numerically, how convolutional networks differ from classical feature-engineering approaches, and what "distribution shift" means in concrete terms. Ask questions, push back on answers, and try to connect these ideas to systems you have encountered in the real world.

Complete at least three substantive exchanges to finish this lab.

Try starting with: "Explain distribution shift to me as if I've never heard the term" — then dig into a follow-up based on the answer you get.

AI Lab Assistant

Lesson 1 · Computer Vision Fundamentals

Welcome to Lab 1. I'm your lab assistant for the fundamentals of computer vision — pixels, convolutional networks, training data, and the gap between what machines compute and what humans perceive. What would you like to explore first?

How AI Sees Your World · Lesson 2

How Machines Learn to See: Training, Layers, and What Goes Wrong

Inside the black box — the mechanics of how convolutional neural networks build visual knowledge from data.

When a neural network "learns" to recognize a cat, what exactly is changing inside it?

In 2015, Google Photos launched with automatic image labeling. Within days, software engineer Jacky Alcine discovered that the system had labeled photographs of him and a Black friend as "gorillas." Google's response was swift: the company removed the category label entirely from its classifier — an acknowledgment that the training data problem was too deep to fix quickly. As of 2023, eight years later, independent testing showed that "gorillas," "chimps," and related terms remained blocked from Google Photos results, because Google had not solved the underlying bias. The company deleted the categories rather than correcting the model. This was not a fringe case. It was a direct consequence of how these systems are built.

To understand why that happened — and why deletion was easier than repair — you need to understand what a convolutional neural network actually learns, where that learning is stored, and why it can encode the statistical regularities of biased data as faithfully as it encodes the statistical regularities of accurate data.

1. What a Neural Network Is

A neural network is a mathematical function with a very large number of adjustable parameters — called weights. A modern vision model like ResNet-50 (released by Microsoft Research in 2015) has approximately 25 million weights. GPT-4's vision components operate at orders of magnitude larger. Each weight is a single floating-point number. The network's "knowledge" — everything it has learned about what a dog or a face or a tumor looks like — is distributed across all of those numbers simultaneously. There is no single weight that "knows about dogs." There is a pattern of weight values that, taken together, produces outputs associated with dogs.

During training, the network is shown an image, produces a prediction, and the prediction is compared to the correct label. The difference — the loss — is propagated backward through the network via an algorithm called backpropagation, slightly adjusting each weight to reduce the error. This is done millions of times across millions of images. The weights that result are the trained model.

2. Convolutional Layers: The Key Architectural Idea

The architecture that made deep learning dominate computer vision is the convolutional layer. Rather than connecting every neuron to every pixel (computationally catastrophic for large images), a convolutional layer slides a small filter — typically 3×3 or 5×5 pixels — across the image, computing a dot product at each position. This produces a new, transformed version of the image called a feature map.

The insight is that visual features like edges and textures appear at many positions in an image. A filter that detects a vertical edge should be useful everywhere, not just in one corner. By sharing filter weights across positions, convolutional networks can learn spatially invariant features with far fewer parameters than fully connected approaches.

Research by Matthew Zeiler and Rob Fergus in 2013 — using a technique called deconvolution to visualize what individual filters respond to — showed that early convolutional layers learn edge detectors and color gradients (similar to what SIFT engineers had handcrafted). Middle layers learn textures and shapes. Late layers learn high-level concepts like "face," "wheel," or "text." This hierarchy emerges from training data alone.

Key Insight

The features a CNN learns are not designed by humans — they emerge from the statistics of training data. This is why biased training data produces biased features. The network faithfully learns whatever regularities are present, useful or harmful.

3. Overfitting and Generalization

A network with enough parameters can, in principle, memorize every training example — producing perfect accuracy on training data while failing entirely on new images. This is called overfitting. Preventing it is a central engineering challenge. Common techniques include dropout (randomly disabling neurons during training, forcing redundancy), data augmentation (flipping, cropping, and distorting training images to increase variety), and regularization (penalizing large weight values mathematically).

Generalization — the ability to perform well on images the network has never seen — is the actual goal, and it is measured on a held-out test set. But even good test performance within a laboratory setting does not guarantee good performance in deployment. The 2019 MIT study that found state-of-the-art classifiers failed on 45-degree rotations used test images from the same distribution as the training set. When researchers systematically varied image corruptions (blur, noise, compression artifacts), they found that models that ranked similarly on clean test sets diverged dramatically under corruption — revealing that standard evaluation metrics measure lab performance, not real-world robustness.

4. Transfer Learning: One Model, Many Applications

Training a large vision model from scratch requires millions of labeled images and weeks of GPU computation. Most real-world applications cannot afford this. The dominant solution is transfer learning: take a model pre-trained on a large general dataset (typically ImageNet), and fine-tune it on a smaller domain-specific dataset.

In 2017, Stanford researchers published a paper demonstrating that a CNN pre-trained on ImageNet and then fine-tuned on 129,450 dermatoscopy images could classify skin cancer with accuracy comparable to board-certified dermatologists. This was possible precisely because the low-level visual features learned from natural images — edges, textures, gradients — transfer to medical images. The approach has since been replicated across radiology, ophthalmology, and pathology.

Transfer learning also transfers biases. A model pre-trained on a dataset skewed toward Western contexts will carry those biases into any fine-tuned application, even one that uses locally representative training data, because the lower-level features are frozen from the biased source. This is a documented problem in global health AI deployments.

Core Principle

Modern computer vision is almost entirely built on transfer learning. Understanding what that means — both the capability it enables and the bias it propagates — is essential for evaluating any claim about a CV system's real-world performance.

Lesson 2 Quiz

Five questions — select the best answer for each.

1. In a trained neural network, the "knowledge" about visual categories is:

Correct. Neural network knowledge is distributed — it exists as a pattern across all weights simultaneously. This is why it is hard to "remove" a bias; it is not stored in one place.

Neural networks do not have dedicated memory cells per concept. Knowledge is distributed across all weights — which is precisely why bias is so difficult to surgically remove.

2. The primary advantage of convolutional layers over fully-connected layers for image processing is:

Correct. Weight sharing is the key insight. A filter that detects a vertical edge is useful at every position, so the same weights are reused across the image — making the approach tractable for high-resolution inputs.

The advantage is parameter efficiency through weight sharing. The same filter slides across the entire image, detecting the same type of feature wherever it appears with a single set of weights.

3. When Google Photos labeled Black users' faces as "gorillas" in 2015, and the long-term fix was to remove the category label entirely, what does this reveal about the nature of trained CNN models?

Correct. Because bias is distributed across weights rather than localized, removing it is not a targeted fix. Google found it easier to block the output category than to retrain the underlying feature representations.

The lesson here is about how bias is stored in a model. It is distributed across weights — not localized — making targeted correction extremely difficult. Deleting the output category was a workaround, not a fix.

4. Transfer learning in computer vision means:

Correct. Transfer learning leverages large-scale pre-training to bootstrap domain-specific models — enabling practical applications (like medical imaging classifiers) that could not be trained from scratch on available data.

Transfer learning means using a pre-trained model as a starting point for a new, smaller training task. It is how most real-world CV applications are built, and it also propagates the biases of the source model.

5. The 2017 Stanford skin cancer classification study is significant because it demonstrated:

Correct. Transfer learning from ImageNet to medical imaging worked because textures and gradients in natural photos share structural properties with those in dermatoscopy images. 129,450 images was sufficient — far less than training from scratch would require.

The study showed the opposite of training-from-scratch requirements. Transfer learning from natural image features enabled dermatologist-level accuracy with a comparatively small medical dataset.

Lab 2: Inside the Black Box

A conversation-based lab · Lesson 2 · How AI Sees Your World

What You'll Do

Explore how CNNs actually learn — weights, backpropagation, convolutional layers, overfitting, and transfer learning. Push the assistant on specific real cases: the Google Photos incident, the Stanford skin cancer study, or any other real-world application you want to understand more deeply.

Complete at least three substantive exchanges to finish this lab.

Try: "Why is bias in a neural network so hard to remove once it's trained?" — then follow where the conversation goes.

AI Lab Assistant

Lesson 2 · CNN Architecture & Learning

Welcome to Lab 2. I can help you dig into how convolutional networks learn — weights, backpropagation, what those layers are actually computing, and why biases can be so deeply baked in. What's on your mind?

How AI Sees Your World · Lesson 3

Computer Vision in the Wild: Faces, Cars, and Surveillance

The gap between laboratory benchmarks and real-world deployment — documented in concrete cases.

When a system that works in testing fails in deployment, who is responsible for the consequences?

On January 9, 2020, Robert Julian-Borchak Williams, a Black man living in Detroit, was arrested in his driveway in front of his wife and daughters. The charge was felony theft. He was held for thirty hours in police custody. The identification that led to his arrest came from a Michigan State Police facial recognition system, which had matched a surveillance video frame to his driver's license photo. When investigators manually reviewed the match — something the vendor's contract required — they confirmed it. They were wrong. The video showed a different man. Williams was the first documented case in the United States of a wrongful arrest based on facial recognition. He was not the last.

What failed was not a bug or a software crash. The system worked as designed. It produced a match above its confidence threshold. A human investigator confirmed it. The failure was structural: a technology with documented higher error rates on dark-skinned faces had been deployed for high-stakes criminal identification with no independent verification mechanism beyond the same human confirmation bias that produced the error in the first place.

1. Facial Recognition: Capability and Documented Failure

Facial recognition systems have become extraordinarily capable at controlled-condition matching — passport photos against enrollment databases, celebrity identification from studio photographs. In these conditions, leading commercial systems achieve accuracy rates above 99.5% on benchmark datasets. This performance has driven widespread adoption: the TSA deployed facial recognition at 25 major U.S. airports between 2017 and 2023; Chinese authorities operate the world's largest facial recognition surveillance network; the EU's law enforcement database Prüm Convention stores biometric data across member states.

The documented problems emerge consistently across several axes. Demographic disparities: NIST's 2019 Face Recognition Vendor Test (FRVT) evaluated 189 algorithms from 99 developers and found that most algorithms had higher false-positive rates for African-American and Asian faces compared to Caucasian faces — sometimes by a factor of 10 to 100. Uncontrolled conditions: performance degrades significantly with low-quality video, unusual angles, partial occlusion, or aging. Surveillance cameras rarely offer controlled conditions.

Documented Case

By 2023, at least three documented wrongful arrests in the United States — Robert Williams (Detroit, 2020), Michael Oliver (Detroit, 2019), and Nijeer Parks (New Jersey, 2019) — had been traced to faulty facial recognition identifications. All three were Black men. No documented wrongful arrests involved white suspects identified by facial recognition.

2. Computer Vision in Autonomous Vehicles

Self-driving vehicles use multiple sensor modalities — cameras, lidar, radar — but computer vision plays a central role in detecting pedestrians, cyclists, lane markings, and traffic signals. The stakes are direct: failure means collision.

On March 18, 2018, an Uber autonomous test vehicle struck and killed Elaine Herzberg in Tempe, Arizona — the first recorded pedestrian fatality caused by an autonomous vehicle. The NTSB investigation found that the vehicle's perception system had detected Herzberg approximately six seconds before impact, but the classification algorithm cycled between categorizing her as an unknown object, a vehicle, and a bicycle before finally classifying her as a bicycle with a predicted path not crossing the road. The system did not classify her as a pedestrian until 1.3 seconds before impact — too late to brake. She was walking her bicycle across an unlit road outside a marked crosswalk: an input scenario that fell outside the system's robust operating region.

The case illustrates a recurring pattern in deployed computer vision systems: they are evaluated on performance metrics over large test sets, but single novel inputs — inputs that fall in the gaps between training scenarios — can cause catastrophic failures. Robustness to distribution shift is not well captured by average accuracy metrics.

3. Medical Imaging: Promise and Precaution

Computer vision in medical imaging represents perhaps the domain where the gap between laboratory performance and deployment reality is most consequential. FDA-cleared AI diagnostic tools now exist for diabetic retinopathy screening, chest X-ray triage, CT pulmonary embolism detection, and pathology slide analysis. The promise is real: IDx-DR, approved by the FDA in 2018 as the first autonomous AI diagnostic system, achieved 87.4% sensitivity and 89.5% specificity for diabetic retinopathy in the approval trial.

But a 2022 retrospective study published in The Lancet Digital Health reviewed 130 clinical AI studies and found that fewer than 3% had prospective validation in patient populations different from the training population. Most studies were retrospective, single-site, and evaluated on the same distribution they were trained on. When researchers at Stanford deployed a sepsis prediction algorithm (not a vision system, but the pattern applies equally) at a new hospital, performance dropped substantially — not because the algorithm was flawed, but because local documentation practices differed from training data in ways the algorithm had not encountered.

Core Principle

Laboratory benchmark performance does not predict deployment performance in novel populations or environments. The history of computer vision deployment is largely a history of this gap — sometimes inconvenient, sometimes fatal. Demanding prospective, multi-site validation before deploying high-stakes CV systems is the informed response to this pattern.

4. Adversarial Vulnerability

In 2013, Christian Szegedy and colleagues at Google published a paper demonstrating that imperceptible perturbations to an image — changes invisible to the human eye — could cause state-of-the-art CNNs to misclassify images with high confidence. Adding carefully computed noise to an image of a panda could cause a network to classify it as a gibbon with 99.3% confidence. These are called adversarial examples.

The practical implications depend on context. In 2019, researchers at McAfee demonstrated that adding a small strip of tape to a stop sign caused Tesla's Model S autopilot to read the sign as a 45 mph speed limit sign. In 2021, researchers showed that specific patterns printed on T-shirts could cause person-detection algorithms to consistently fail to detect the wearer. The vulnerability is structural: it arises because CNNs learn statistical correlations between pixel patterns and labels, not causal relationships between visual scenes and semantic meanings. Any pattern that exploits those correlations — even if invisible or meaningless to humans — can redirect the network's output.

Lesson 3 Quiz

Five questions — select the best answer for each.

1. The 2019 NIST Face Recognition Vendor Test (FRVT) found that most evaluated algorithms had false-positive rates for African-American faces compared to Caucasian faces that were:

Correct. The FRVT found demographic disparities of this magnitude across most of the 189 algorithms tested. This is not a marginal difference — a 10x–100x false-positive rate gap has serious implications for any high-stakes application.

The FRVT found disparities of 10x to 100x, not marginal differences. This is a systematic, large-scale finding across 189 algorithms from 99 developers — not an outlier.

2. In the Uber autonomous vehicle fatality (Tempe, Arizona, 2018), what specifically caused the system to fail to brake in time?

Correct. The NTSB investigation documented that the system detected Herzberg six seconds before impact but misclassified her repeatedly before identifying her as a pedestrian 1.3 seconds before impact. The scenario — a person walking a bicycle outside a crosswalk on an unlit road — fell outside the system's robust operating range.

The NTSB found that the classification system detected the victim but misclassified her — cycling through "unknown object," "vehicle," and "bicycle" before correctly classifying her as a pedestrian with insufficient time to brake.

3. The term "adversarial example" in computer vision refers to:

Correct. Adversarial examples exploit the statistical nature of CNN learning — tiny perturbations, invisible to humans, can redirect a network's output entirely because they target the pixel-level correlations the network has learned rather than the semantic content humans perceive.

Adversarial examples are specifically crafted perturbations — not ambiguous images or mislabeled data. They are invisible to humans but exploit the statistical correlations a network has learned, causing confident misclassification.

4. The 2022 Lancet Digital Health review of 130 clinical AI studies found that fewer than 3% had what important characteristic?

Correct. Fewer than 3% of reviewed studies had prospective external validation. Most studies tested systems on the same distribution they were trained on — a form of evaluation that does not predict real-world deployment performance.

The critical gap was prospective external validation — testing in a different patient population from the one used to train. Without this, strong laboratory results don't predict deployment performance.

5. The Robert Williams wrongful arrest case in Detroit (2020) illustrates which structural problem with facial recognition deployment?

Correct. The system worked as designed. The failure was structural: known demographic disparities in error rates were not addressed before high-stakes criminal deployment, and the required human review did not provide independent verification — it confirmed the algorithm's error.

The investigators followed the required protocol. The structural failure was deploying technology with known demographic error disparities in high-stakes criminal identification, where human review reinforced rather than corrected the algorithmic error.

Lab 3: High-Stakes Deployment

A conversation-based lab · Lesson 3 · How AI Sees Your World

What You'll Do

Dig into the real-world deployment cases from Lesson 3 — facial recognition failures, autonomous vehicle incidents, medical imaging validation gaps, and adversarial vulnerabilities. The assistant can help you think through accountability, systemic causes, and what better deployment practices would look like.

Complete at least three substantive exchanges to finish this lab.

Try: "If facial recognition has such documented disparities, why do police departments continue using it?" — then probe the answer.

AI Lab Assistant

Lesson 3 · CV Deployment & Real-World Failures

Welcome to Lab 3. I'm here to help you analyze real CV deployment cases — facial recognition, autonomous vehicles, medical imaging, adversarial attacks. These are documented events with documented causes. What do you want to examine?

How AI Sees Your World · Lesson 4

What You Now Know — and What to Do With It

From passive user to informed participant — how an understanding of CV mechanics changes how you engage with these systems.

Given everything you now know about how computer vision works and fails, what does it mean to be an informed person living inside these systems?

In 2017, the city of Orlando, Florida, partnered with Amazon to pilot its Rekognition facial recognition system for real-time police surveillance. The American Civil Liberties Union obtained documents about the pilot through a Freedom of Information Act request and published them in 2018. The documents revealed that the city had deployed a system with no independent accuracy validation for the Orlando population, no public notice, no legal framework governing retention of biometric data, and no policy limiting what the system's output could be used for. When the ACLU subsequently tested Rekognition by running members of Congress against a database of arrest photos, the system incorrectly matched 28 members of Congress to the mugshots — disproportionately members who were people of color. Amazon disputed the test methodology. The city quietly ended the pilot in 2019.

This story is not primarily about Amazon or Orlando. It is about the gap that forms when powerful technology moves faster than the institutional frameworks that should govern it. The people in that city were being surveilled without their knowledge, using a system whose limitations were not publicly disclosed, under legal frameworks that did not yet exist. This is the recurring condition of computer vision deployment in 2024. Understanding the technology is a precondition for participating in the governance conversation.

1. The Vocabulary You Now Have

After three lessons, you have the vocabulary to read claims about computer vision critically. When a company announces a new system with "99% accuracy," you know to ask: accuracy on what benchmark, with what demographic composition, under what lighting and angle conditions, and validated prospectively or retrospectively? When a city announces facial recognition deployment for public safety, you know to ask: what is the false-positive rate on the demographics of the local population, not just the test set? When a medical device company announces FDA clearance, you know to ask: was clearance based on external validation data or the same dataset used for training?

These are not hostile or adversarial questions. They are the questions any informed adult should ask about any technology deployed in high-stakes contexts. The vocabulary of computer vision — distribution shift, training data composition, benchmark performance vs. deployment performance, demographic error rate disparities — is now civic vocabulary, not just technical vocabulary.

2. Where This Technology Is Embedded in Your Life

Computer vision is not a future technology. As of 2024, it is embedded in systems that touch most people in wealthy countries daily. Face ID and equivalent unlock mechanisms on smartphones use depth-sensing cameras and CNNs to verify identity. Google Lens and Apple Visual Look Up perform real-time object recognition. YouTube and TikTok use CV to classify video content for recommendation and moderation. Amazon Go stores use computer vision and sensor fusion to track what shoppers remove from shelves. Hospital emergency departments in dozens of U.S. states use AI-assisted triage tools that incorporate imaging data. Dozens of U.S. cities use license plate readers that automatically log the location of every passing vehicle.

Most of these systems are invisible to their subjects. The person whose car is logged by a license plate reader is not notified. The shopper in an Amazon Go store sees the technology described as frictionless convenience, not as a dense network of tracking cameras. The patient whose imaging study is preprocessed by an AI triage tool may not be informed. This invisibility is not accidental — it is the default condition of deployed computer vision. You are almost certainly being seen by machines that you cannot see.

On Invisibility

The default condition of computer vision deployment is asymmetric: the system sees you; you do not see the system. Every privacy framework that has emerged around CV — from GDPR's biometric data provisions to Illinois's Biometric Information Privacy Act (BIPA, passed 2008) — is an attempt to rebalance that asymmetry. Understanding it is the first step toward having a position on whether the current balance is acceptable.

3. Questions of Governance and Accountability

The governance landscape for computer vision is fragmentary and evolving rapidly. The EU AI Act (passed 2024) classifies real-time biometric identification in public spaces as "high risk" and imposes strict requirements for transparency, human oversight, and accuracy auditing — including mandatory demographic parity testing before deployment. The United States has no comparable federal framework as of 2024; instead, a patchwork of state laws (Illinois's BIPA, Texas's CUBI Act, Washington's My Health My Data Act) applies to narrow use cases.

The San Francisco Board of Supervisors voted in 2019 to ban city agencies from using facial recognition technology — the first such ban in the United States. Portland, Oregon; Somerville and Cambridge, Massachusetts followed. Boston banned facial recognition by city departments in 2020. These are local responses to a federal regulatory vacuum. They represent communities making explicit decisions about what asymmetry of surveillance they will accept.

Accountability for computer vision failures is also legally underdeveloped. When Robert Williams was wrongfully arrested, he sued the City of Detroit — not the vendor of the facial recognition system. The legal frameworks for assigning liability when an algorithmic output contributes to harm are being built in real time, in courts that often lack technical expertise to evaluate the underlying systems.

4. What This Course Has and Has Not Done

This module has given you the foundational mechanics of computer vision: what images are numerically, how CNNs learn from data, what the core tasks are, where the technology has been deployed and what has gone wrong, and where governance frameworks stand. What it has not done — and cannot do in a single module — is resolve the genuine ethical and policy debates that surround this technology.

Reasonable, informed people disagree about whether the security benefits of facial recognition in law enforcement outweigh its documented risks of wrongful identification. They disagree about whether AI-assisted medical imaging, even with its validation gaps, represents a net improvement in care quality for populations that currently lack access to specialist expertise. They disagree about the appropriate role of government in regulating private companies' use of these systems. These are live debates requiring exactly the kind of informed engagement that technical literacy enables.

You are now better equipped to have those debates — not because you know where they end, but because you understand what is actually at stake in them. That is what this course was for.

Closing Principle

Computer vision systems are not neutral tools. They are artifacts that encode the choices — about data, thresholds, deployment contexts, and accountability — of the people who build and deploy them. Understanding how they work is the minimum precondition for having an opinion about whether those choices are acceptable. You now have that minimum.

Lesson 4 Quiz

Five questions — select the best answer for each.

1. When evaluating a claim of "99% accuracy" for a new computer vision system, which question is most critical to ask first?

Correct. Benchmark composition, demographic representation, and validation methodology are what determine whether a stated accuracy figure is meaningful for a real-world deployment context. "99% accuracy" in isolation is nearly uninformative.

The most critical questions concern what the benchmark actually measured — its demographic composition, environmental conditions, and whether the evaluation was prospective (on new data) or retrospective (on training-distribution data).

2. Illinois's Biometric Information Privacy Act (BIPA, 2008) and the EU AI Act (2024) both represent attempts to address which fundamental condition of computer vision deployment?

Correct. Both frameworks are responses to the default asymmetry of CV deployment: the system sees you without your awareness. BIPA requires informed consent before biometric data collection; the EU AI Act requires transparency, auditing, and human oversight for high-risk CV systems.

Both frameworks address the asymmetry of surveillance — the condition in which systems collect and use biometric data about individuals without those individuals' knowledge or consent. That is the governance problem they are both trying to address.

3. When San Francisco banned city agency use of facial recognition in 2019, it was notable because:

Correct. San Francisco's ban was significant as the first U.S. municipal restriction — a local governance response to the absence of federal regulation. It applied to city departments, not private companies. Several other cities followed with similar measures.

San Francisco's ban applied to city agencies and was the first such action in the United States — not globally, and not extended to private companies. It represented local governance filling a federal regulatory vacuum.

4. In the Amazon Rekognition / ACLU test (2018), which result was most concerning about the system's real-world reliability?

Correct. The test used only publicly available images of members of Congress against an arrest photo database — standard conditions for any public-facing deployment. The 28 incorrect matches, skewed toward legislators of color, directly reflected the demographic error disparities documented in NIST testing.

The ACLU test found 28 false positive matches among members of Congress, disproportionately affecting members who were people of color — consistent with the demographic error rate disparities documented by NIST for most facial recognition systems.

5. This module argues that technical literacy about computer vision is "civic vocabulary" because:

Correct. The concept of "civic vocabulary" means that understanding CV is now necessary for meaningful democratic participation in decisions about how these systems are deployed and governed — not just for technical professionals, but for anyone affected by them.

Civic vocabulary refers to knowledge needed for democratic participation and accountability. CV systems affect criminal justice, healthcare, and public safety — meaning that being an informed citizen now requires understanding how they work.

Lab 4: Governance and Your Position

A conversation-based lab · Lesson 4 · How AI Sees Your World

What You'll Do

Use what you've learned across the entire module to engage with governance questions: what regulations exist, what their limits are, and where you personally come down on contested trade-offs. The assistant can help you reason through your own position — not by telling you what to think, but by stress-testing the arguments you make.

Complete at least three substantive exchanges to finish this lab.

Try: "Should facial recognition be banned in public spaces entirely, or are there conditions under which it's acceptable? Help me think through both sides." — then push back on whatever case you find weakest.

AI Lab Assistant

Lesson 4 · CV Governance & Informed Participation

Welcome to Lab 4 — the final lab of this module. You've covered the mechanics, the real-world cases, and the governance landscape. Now let's put it together. I'll help you reason through contested questions about computer vision policy — accountability, surveillance, regulation, trade-offs. What do you want to think through?

Module Test

15 questions across all four lessons — 80% required to pass.

1. A 1080p color digital image contains approximately how many individual numerical values?

Correct. Three channels × 1920 × 1080 = 6,220,800 values — each a pixel intensity from 0 to 255.

Three RGB channels × 1920 × 1080 pixels = approximately 6 million values.

2. AlexNet's 2012 ImageNet result triggered a paradigm shift because:

Correct. The margin — not any absolute milestone — was the signal. An 11-point gap in a field that typically moved in fractions of a point meant the underlying approach had changed fundamentally.

The significance was the scale of the gap over competitors — approximately 11 percentage points — not any absolute perfection or novelty of neural networks per se.

3. The SIFT algorithm (David Lowe, 1999) represents which approach to computer vision?

Correct. SIFT is a classic handcrafted feature detector — mathematically designed by human engineers to find and describe distinctive local features in images. It preceded and was eventually superseded by learned CNN features.

SIFT is handcrafted feature engineering — mathematical operations defined by human engineers to detect visual structures. It is the paradigm that deep learning replaced.

4. Fei-Fei Li's ImageNet dataset was historically significant because:

Correct. ImageNet's scale — the result of three years of careful engineering and curation — was the enabling condition for the deep learning revolution in vision. Data engineering was as important as algorithmic innovation.

ImageNet was significant for its scale and careful curation — 14 million images across over 21,000 categories. This enabled CNNs to demonstrate capabilities that smaller datasets had not revealed.

5. In a convolutional neural network, the primary function of a convolutional layer's filters is to:

Correct. Convolutional filters slide across the image computing dot products, producing feature maps that highlight where specific patterns (edges, textures, shapes) appear. The filters' weights are learned during training.

Convolutional filters slide across the image computing dot products at each position, creating feature maps that respond to learned visual patterns — edges, textures, shapes — at every location.

6. The Zeiler and Fergus (2013) visualization research on CNNs found that:

Correct. The Zeiler-Fergus visualization work revealed the hierarchical feature structure of CNNs — from primitive edge detectors in early layers to complex concept detectors in later layers — confirming that the hierarchy emerged from data, not design.

Zeiler and Fergus showed a clear hierarchy: early layers → edges/colors; middle layers → textures/shapes; late layers → high-level concepts. This hierarchy emerged from training, not human engineering.

7. The Gender Shades study (Buolamwini & Gebru, 2018) found error rate disparities of up to 34 percentage points. The study attributed this to:

Correct. All three tested systems — from Microsoft, IBM, and Face++ — had higher error rates for dark-skinned women, tracing to training data that underrepresented this group. The disparity was in the data, not the architecture.

Gender Shades identified training data underrepresentation as the cause — not hardware, not deliberate design, not architecture. The disparity existed because the training sets were not demographically representative.

8. Transfer learning propagates biases from source to target domain because:

Correct. When early layers are frozen in transfer learning, the biased feature representations from pre-training remain fixed. More representative fine-tuning data can improve high-level classification but may not correct biases embedded in lower-level features.

Frozen lower-level features carry source biases into fine-tuned applications. Even if fine-tuning data is representative, the feature extraction pipeline may remain biased from pre-training.

9. The NTSB investigation into the 2018 Uber autonomous vehicle fatality found that the proximate cause was:

Correct. The NTSB documented that the system detected Herzberg but misclassified her repeatedly before correct identification at 1.3 seconds — an input scenario (person walking a bicycle outside a crosswalk, low lighting) that fell outside the system's robust operating range.

The NTSB found repeated misclassification was the cause. The system detected the victim but cycled through incorrect categories before correctly classifying her as a pedestrian with insufficient time remaining to brake.

10. Adversarial examples demonstrate which structural vulnerability of CNN-based vision systems?

Correct. Adversarial examples exist because CNNs optimize statistical pixel-label correlations, not semantic understanding. Perturbations that redirect those correlations — even invisibly — can reliably produce confident misclassifications.

Adversarial vulnerability is structural: CNNs learn statistical pixel-label correlations, not meaning. Perturbations exploit these correlations in ways that are invisible to human perception but redirect the network's output entirely.

11. The 2022 Lancet Digital Health finding that fewer than 3% of clinical AI studies had prospective external validation implies:

Correct. Retrospective evaluation on training-distribution data does not predict performance in new populations or clinical settings. The 3% figure means the field has largely not demonstrated the external validity needed for confident deployment.

The finding means most clinical AI accuracy claims cannot be trusted to predict real-world performance. Retrospective evaluation on familiar data does not reveal how systems will perform in new hospitals, populations, or documentation environments.

12. Robert Williams (Detroit, 2020), Michael Oliver (Detroit, 2019), and Nijeer Parks (New Jersey, 2019) share which characteristic relevant to this module?

Correct. All three documented U.S. wrongful arrests attributable to facial recognition involved Black men — a pattern directly consistent with the demographic error rate disparities found in NIST's FRVT evaluation of commercial systems.

All three documented U.S. facial recognition wrongful arrests involved Black men — consistent with the documented demographic disparities in facial recognition error rates across commercial systems.

13. The EU AI Act (2024) classifies real-time biometric identification in public spaces as "high risk" and requires:

Correct. The EU AI Act's high-risk classification for biometric identification requires transparency, human oversight, accuracy auditing, and demographic parity testing — an attempt to impose systematic accountability on deployment rather than prohibit it.

The EU AI Act requires transparency, human oversight, auditing, and demographic parity testing for high-risk CV systems — not prohibition or data localization requirements.

14. The Amazon Rekognition / Orlando pilot case (2017–2019) illustrates:

Correct. The Orlando pilot was deployed without public notification, without independent accuracy validation for the local population, and without any legal framework governing data retention or use. The ACLU obtained documentation only through FOIA — after deployment had begun.

The Orlando case showed that deployment can happen without public knowledge, independent validation, or legal framework. The ACLU found out through FOIA. This is the default condition of CV deployment in the absence of mandatory transparency requirements.

15. This module describes technical CV literacy as "civic vocabulary" rather than merely technical vocabulary. This framing argues that:

Correct. "Civic vocabulary" means that understanding CV is now a requirement for informed democratic participation — not because everyone needs to build models, but because these systems affect criminal justice, healthcare outcomes, and civil liberties in ways that require public accountability.

Civic vocabulary means the concepts are needed for democratic participation — to evaluate claims, demand accountability, and engage meaningfully with policy debates about systems that affect criminal justice, healthcare, and civil liberties.