In October 2023, Waymo's fully autonomous robotaxis — operating with no safety driver — completed over one million paid passenger trips in San Francisco and Phoenix. Riders hailed rides through an app, climbed into a Jaguar I-PACE fitted with a roof-mounted sensor dome, and arrived at their destinations without a human hand ever touching the wheel. The vehicle's visual AI processed roughly twenty camera feeds, lidar pulses, and radar returns simultaneously, every hundredth of a second.
No single sensor gives a self-driving car everything it needs. Production systems like Waymo's fifth-generation Driver and Cruise's AV combine three complementary modalities, each filling gaps the others leave.
Cameras are the richest source of visual information — color, texture, lane markings, traffic-light state, facial expressions on pedestrians, text on signs. They are cheap, high-resolution, and the most human-like sensor. Their weakness: performance degrades in heavy rain, glare, or darkness, and they produce no direct depth measurement. Everything 3-D must be inferred by the AI from a 2-D projection.
Lidar (Light Detection and Ranging) fires rapid pulses of laser light and times their return. The result is a precise 3-D point cloud — a real-time geometric map of every nearby surface, accurate to centimeters at 100+ meters. Waymo's custom Laser Bear Honeycomb sensors spin 360° and fire millions of pulses per second. Lidar is nearly unaffected by lighting but is expensive and can be confused by heavy precipitation.
Radar uses radio waves rather than light. It penetrates fog, rain, and snow that blind cameras and scatter lidar. It also measures velocity directly via the Doppler effect — instantly knowing whether that object ahead is stationary or moving at 60 mph. Radar resolution is low, though; it cannot read a stop sign or identify a pedestrian's pose.
Cameras see richly but not in 3-D. Lidar measures space precisely but costs hundreds of dollars per unit. Radar sees through weather but blurrily. The fusion of all three produces a perception layer more robust than any individual sensor — a philosophy called sensor redundancy.
Raw sensor data is nearly useless on its own. A lidar point cloud is just a cloud of numbers; a camera frame is a 2-D array of colored pixels. The perception stack — layers of convolutional neural networks and transformer models — must classify every detected object, predict its future motion, and assign a confidence score. Waymo's models are trained on data collected from tens of millions of real-world miles, plus billions of additional miles generated in simulation.
Tesla takes a different approach: cameras only, no lidar. Its Full Self-Driving system relies on a custom AI chip (the FSD Chip, first deployed in 2019) processing eight camera feeds through a neural network trained on a fleet of millions of vehicles. Tesla argues that humans drive with eyes alone, so cameras should suffice. Critics argue that human visual processing evolved over millions of years; silicon networks trained for a few years need the redundancy that lidar provides.
As of early 2024, Waymo's vehicles had driven over 7 million fully autonomous miles on public roads. Tesla's FSD fleet had accumulated over 500 million miles of FSD-engaged driving — though with a human safety driver present and able to intervene.
You are an AV systems designer advising a startup that must choose its sensor stack for a robotaxi operating in both sunny Arizona and foggy San Francisco. Consider cost, weather resilience, depth accuracy, and regulatory factors.
At 9:58 p.m. on March 18, 2018, a Volvo XC90 operated by Uber's Advanced Technologies Group struck and killed Elaine Herzberg, who was walking a bicycle across a multi-lane road in Tempe, Arizona. The National Transportation Safety Board's investigation found that the vehicle's perception system had detected Herzberg six seconds before impact — classifying her first as an unknown object, then as a vehicle, then as a bicycle — cycling between categories because it had no stable class for "pedestrian not in a crosswalk." The system never generated an alert. The safety driver was looking at a device in their lap.
The crash became a watershed moment in AV development. It demonstrated that perception accuracy in controlled test conditions does not translate to reliability across all real-world distributions — a problem researchers call distribution shift.
Modern AV perception systems use object detection networks — most commonly variants of YOLO (You Only Look Once) and transformer-based architectures — to draw bounding boxes around every detected entity in a scene and assign a class label with a confidence score. Classes include: car, truck, motorcycle, pedestrian, cyclist, traffic cone, stop sign, traffic light, and dozens more.
The challenge is not detecting a pedestrian in ideal conditions — a human walking on a sidewalk in daylight is easy. The challenge is detecting a pedestrian at night, partially occluded by a parked car, crossing at an unexpected location, while the AV is moving at 40 mph. The Uber crash revealed that detection confidence thresholds were set too conservatively: the system would discard a detection rather than act on an uncertain one. Overconfident classification and under-confident classification are both dangerous in different ways.
The NTSB determined that Uber's system had been programmed to suppress false positives by requiring high confidence before triggering emergency braking. This threshold caused the system to hesitate fatally. After the crash, Uber and the broader industry revised guidance on how to balance false-positive suppression against reaction time.
Beyond bounding boxes, advanced perception uses semantic segmentation — assigning a class label to every single pixel in the camera frame. The road is one color, the sidewalk another, buildings another. This gives the AV a much richer spatial understanding: it can see exactly where the drivable surface ends, where a puddle begins, and precisely how wide the lane is at a given point.
Waymo and Mobileye use segmentation networks running in parallel with object detection. The outputs are fused with lidar point clouds to produce a labeled 3-D occupancy grid — essentially a real-time map of the world divided into cells, each marked: free, occupied, unknown.
Detecting what is present is only half the task. The AV must predict what each agent will do next. A pedestrian stepping off the curb is likely about to cross. A vehicle with its turn signal on is likely about to change lanes. A ball rolling into the road suggests a child may follow.
Waymo's prediction models — described in a 2022 paper from the Waymo Research team — output a probability distribution over possible future trajectories for each agent over the next eight seconds. The planning system then selects driving actions that maintain safety margins across the most probable scenarios. This is called probabilistic prediction, and it is one of the most active research areas in autonomous driving.
You are an AI safety researcher reviewing perception system failures. Using the 2018 Uber ATG crash as your starting case study, probe what kinds of scenarios cause AV vision systems to fail — and what engineering solutions are being deployed.
By 2023, Mobileye's Road Experience Management (REM) system had collected over eight billion kilometers of anonymized driving data from dashcam-equipped vehicles — taxis, trucks, and consumer cars — to build and continuously update a centimeter-level HD map of public roads in more than 40 countries. Every equipped vehicle acts as a mapping probe, uploading road geometry observations in the background. The aggregate becomes a map accurate enough to localize a vehicle to within 10 centimeters — far beyond what GPS alone can provide.
Consumer GPS — the kind in a smartphone or a standard car navigation system — achieves accuracy of roughly 3–5 meters under good conditions. That sounds precise, but a standard highway lane is 3.7 meters wide. A 5-meter GPS error could place a vehicle in the adjacent lane or on the shoulder. For a human driver glancing at a navigation screen, that imprecision is acceptable. For an autonomous vehicle making its own steering decisions at 70 mph, it is not.
Differential GPS (DGPS) and Real-Time Kinematic (RTK) GPS systems use ground-based correction stations to achieve sub-meter or even centimeter accuracy — but they are expensive, require specialized hardware, and lose precision in urban canyons where tall buildings block satellite signals.
High-Definition maps go far beyond standard navigation maps. An HD map records not just road centerlines and speed limits but: precise lane geometry (width, curvature, grade), lane markings (solid, dashed, double yellow), traffic sign positions and types, traffic light locations and phases, curb positions, crosswalk locations, and overhead clearances. This data is stored as a 3-D geometric model accurate to centimeters.
AV companies including Waymo, Cruise, Mobileye, and Baidu use HD maps as a prior — a known, trusted baseline — against which real-time sensor data is matched. The vehicle knows roughly where it is from GPS; it then aligns its lidar point cloud to the stored HD map geometry to refine its position to centimeter accuracy. This process is called map-based localization or point cloud registration.
HD maps become stale. A construction zone that reroutes lanes, a new traffic signal, a recently painted crosswalk — none of these are in yesterday's map. AVs must detect and handle discrepancies between their HD map and real-time sensor data. Waymo's system flags map conflicts and falls back to sensor-only navigation in affected zones. Keeping HD maps current is one of the largest operational costs in the industry.
For areas without pre-built HD maps — or when map data is unreliable — AVs use SLAM algorithms. SLAM builds a local map from sensor data in real time while simultaneously estimating the vehicle's position within that map. It is computationally intensive and less precise than map-based localization, but it provides a fallback in unmapped territory.
Tesla's camera-only approach uses a form of visual SLAM: the neural network builds a local 3-D understanding of the environment from camera parallax and optical flow, estimating position and obstacles together. This is one reason Tesla can operate in areas with no HD map, while Waymo and Cruise must pre-map every deployment zone before service begins.
You are a city planner evaluating whether to approve a robotaxi service for your mid-size city. The company needs three months to build HD maps before service can begin. Your city does frequent road construction. Probe the AI about the mapping pipeline, freshness requirements, and risk management strategies.
In June 2021, NHTSA issued a Standing General Order requiring all AV operators to report any crash involving a Level 2 or higher automated driving system within 24 hours. Between July 2021 and May 2023, NHTSA received 392 reports of crashes involving Teslas with Autopilot or FSD engaged — the highest volume of any manufacturer, partly because Tesla had by far the largest fleet of Level 2 vehicles. Waymo reported 18 crashes during the same period across its smaller fleet of fully autonomous vehicles. The raw numbers are difficult to compare without normalizing for miles driven and operational conditions, but the data gave regulators — and the public — their first systematic view of where automated driving systems were involved in collisions.
Waymo published an independent safety report in 2023 comparing its crash rate per million miles to the average human driver on comparable roads. The report, authored in part by researchers at the Virginia Tech Transportation Institute, found that Waymo's vehicles had a significantly lower rate of police-reported crashes and injury crashes than human drivers in equivalent urban environments. However, AVs had a higher rate of minor rear-end collisions — where human drivers following a Waymo vehicle were surprised by its cautious, abrupt braking behavior.
This finding points to a challenge not of the AV's visual AI but of its interaction with human drivers who have not adapted to machine driving patterns. AVs follow rules precisely; human drivers follow norms flexibly. The gap between legal driving behavior and expected driving behavior creates collision risk at the human-machine interface.
In October 2023, California's DMV suspended Cruise's robotaxi permit after a Cruise vehicle struck a pedestrian who had already been hit by another human-driven vehicle. The Cruise AV subsequently dragged the pedestrian 20 feet before stopping, because its perception system classified the pedestrian as having cleared the vehicle's path. Cruise recalled all 950 robotaxis from U.S. roads. General Motors later disclosed that Cruise had shared incomplete information with regulators about the sequence of events — leading to criminal and civil investigations. The incident demonstrated that perception failures in edge cases can have catastrophic consequences even for vehicles with otherwise strong safety records.
Despite dramatic progress, current AV visual AI has documented weak points:
Unusual object categories: Objects the training data never included — a large piece of furniture on a highway, a horse-drawn carriage, a flock of birds — can be misclassified or ignored. Tesla's FSD was documented in 2022 approaching a stopped train broadside on a crossing, apparently not recognizing it as an obstacle.
Adversarial conditions: Researchers at the University of Washington demonstrated in 2019 that attaching specific sticker patterns to a stop sign could cause classification networks to misread it as a speed limit sign with high confidence — a "physical adversarial example." Real-world graffiti on signs can have similar effects.
Social and contextual cues: A human driver can read body language: the pedestrian making eye contact before stepping off the curb, the delivery driver about to open their door. Current AV perception models work primarily on geometric and visual features, not on social intent inference.
Extreme weather: Even lidar is significantly degraded by heavy snow accumulation on sensors and dense fog. The operational design domain of every current AV excludes certain weather conditions — Waymo does not operate in heavy rain above certain thresholds.
As of 2024, the United States has no federal AV-specific regulations — AVs are regulated state-by-state. California, Arizona, and Texas have the most permissive frameworks. California's DMV oversees permit applications; companies must submit safety cases and report crashes. The European Union is developing harmonized AV type-approval rules under UNECE Working Party 29, which will require systematic safety validation of perception systems including computer vision.
The core regulatory challenge is that visual AI is a statistical system — it is not correct 100% of the time, by definition. Regulators accustomed to evaluating mechanical systems with binary pass/fail criteria are adapting to evaluate probabilistic AI systems that improve over time but can never be certified as perfectly safe.
The question is not whether AVs will ever crash — they will. The question is whether they crash less often, less severely, and less randomly than human drivers. Early evidence from Waymo's deployed fleet suggests the answer is trending toward yes. But that answer must be earned mile by mile, in every new city, weather condition, and edge case the world can produce.
You are a policy advisor to a state transportation department drafting AV operating permit requirements. Using real incidents — Uber 2018, Cruise 2023, Tesla SGO data — build a set of evidence-based requirements for how AV companies must demonstrate that their visual AI perception systems are safe enough for public roads.