In the autumn of 1879, Thomas Edison demonstrated a practical incandescent bulb at Menlo Park. Within a decade, electrical infrastructure had begun reshaping factories, cities, and the rhythm of daily life — yet few people in 1879 understood that the real constraint on electrification was not the bulb itself but the entire ecosystem of generators, transmission lines, and standardized voltages that had to be built around it. The bulb was the visible symbol; the infrastructure was the actual determinant of how fast the technology could spread and what it could ultimately do.
The same pattern is repeating today in artificial intelligence. When GPT-4 launched in March 2023, or when AlphaFold 2 solved the protein-folding problem in 2020, public attention landed on the software — the model, the benchmark score, the dramatic demo. Less visible was the hardware substrate that made each breakthrough possible: tens of thousands of GPUs, custom networking fabrics, and data-center cooling systems consuming megawatts of power. The chip is the generator. The model is the light bulb. The infrastructure race is the real story.
This course maps that race: how graphics processors became the engines of modern AI, why companies like NVIDIA, Google, and a constellation of startups are now spending billions designing chips for AI specifically, and how hardware constraints shape — and sometimes hard-limit — what AI systems can learn and do. You will leave with a working mental model of the stack that sits beneath every AI product you encounter. The course does not assume an engineering background, but it does assume you want honest analysis over comfortable simplification.
If you finish every module, here's who you become:
At the ImageNet Large Scale Visual Recognition Challenge in 2012, a team from the University of Toronto — Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton — submitted results from a convolutional neural network called AlexNet. Their top-5 error rate was 15.3%. The second-place team scored 26.2%. The gap was not incremental; it was a discontinuity. What made AlexNet possible was not a new theoretical insight — convolutional networks had existed since Yann LeCun's 1989 work. What was new was that Krizhevsky had trained AlexNet on two NVIDIA GTX 580 GPUs, each with 3 GB of memory, running in parallel. The hardware had finally crossed a threshold that the algorithms had been waiting at for twenty years.
This is the pattern this lesson examines: hardware thresholds that suddenly make previously impractical ideas practical. Understanding where those thresholds are, and how fast they are moving, is the core skill for anyone trying to anticipate what AI will be capable of next.
Backpropagation — the algorithm used to train virtually every deep neural network — was described in its modern form by Rumelhart, Hinton, and Williams in 1986. The idea of multi-layer networks predates even that, with roots in the 1940s and 1950s. Yet through the 1990s and most of the 2000s, neural networks remained a niche academic pursuit, repeatedly losing benchmark competitions to support vector machines and other methods that were more computationally tractable on the hardware of the era.
The reason was arithmetic. Training a deep network requires performing billions of floating-point multiplications per second, repeated across millions of training examples, thousands of times. The CPUs available in the 1990s could execute these operations, but not at the speed required to train large networks in any reasonable time frame. A training run that takes an hour on a 2012-era GPU cluster would have taken years on 1990s hardware. Years-long experiments are not experiments; they are career bets that few researchers could afford to make.
This is the first principle of the hardware-AI relationship: compute determines which hypotheses are testable. Researchers do not merely run the experiments they think of — they run the experiments they can afford to run. Hardware constraints shape the entire research agenda, invisibly and pervasively.
In 1990, training the network that won ImageNet 2012 would have taken an estimated several years on a high-end workstation CPU. By 2012, the same training run on two GTX 580 GPUs took about six days. The algorithm was identical in concept. Only the hardware had changed.
A modern CPU is optimized for latency: executing a single complex task as fast as possible. It achieves this through large caches, branch prediction, and out-of-order execution — all designed to minimize the time between issuing an instruction and receiving its result. A CPU from 2012 had four to eight cores, each capable of sophisticated sequential computation.
A GPU is optimized for throughput: executing thousands of simple tasks simultaneously. The NVIDIA GTX 580 that Krizhevsky used had 512 CUDA cores. Its successor architectures scaled that to thousands of cores. Each core is simpler than a CPU core, but the aggregate throughput for parallelizable workloads — like matrix multiplication — is orders of magnitude higher. Neural network training is, at its core, a sequence of matrix multiplications. GPUs and neural networks fit together almost perfectly.
The insight was not obvious. Krizhevsky's 2012 paper noted that GPU training was partly enabled by NVIDIA's CUDA programming platform, released in 2007, which allowed researchers to write general-purpose code that ran on GPU hardware without requiring expertise in graphics programming. CUDA was the software interface that made the hardware accessible. This is a recurring theme: hardware capability requires a corresponding software interface before it translates into research productivity.
In 2020, researchers at OpenAI published a paper known informally as the "Scaling Laws" paper (Kaplan et al., 2020). Its central finding was that language model performance improves predictably and smoothly as a power-law function of three variables: model size (number of parameters), dataset size (number of training tokens), and compute budget (total FLOPs used in training). Crucially, the relationship held across many orders of magnitude with no sign of saturation at the scales studied.
The implication was profound. If performance scales predictably with compute, then the question "how good will AI be?" becomes, to a significant degree, the question "how much compute will be available?" AI capability forecasting becomes hardware forecasting. The researchers and companies that understood this earliest — and acted on it — began the investments that produced GPT-3 (175 billion parameters, trained in 2020), GPT-4 (training details not disclosed, but estimated at far greater scale), and Google's PaLM (540 billion parameters, 2022).
A follow-up paper from DeepMind, Chinchilla (Hoffmann et al., 2022), refined the optimal compute allocation, arguing that prior large models had been significantly undertrained relative to their parameter count — that a smaller model trained on more data could match a larger model trained on less. Even this revision did not challenge the core scaling insight; it merely adjusted the optimal point on the compute-allocation curve. In either framework, hardware remains the binding constraint.
Once scaling laws were published, AI development became partially legible to financial analysts, policymakers, and strategists who had no background in machine learning. If you can predict how capability scales with compute, and you can track compute availability, you can build rough capability forecasts. Hardware is the variable that non-experts can most directly observe and measure.
OpenAI's analysis of AI training compute (published 2018, updated subsequently) found that from 2012 to 2018, the amount of compute used in the largest AI training runs doubled approximately every 3.4 months — a pace far exceeding Moore's Law's historical rate of doubling every 18–24 months for transistor count. This was not because chips were improving that fast; it was because researchers and companies were deploying exponentially more chips per training run, and because software improvements (better parallelization, mixed-precision training) multiplied effective throughput.
AlexNet (2012) used roughly 0.0001 petaflop-days of compute. AlphaGo (2016) used roughly 1,900 petaflop-days. GPT-3 (2020) used approximately 3,640 petaflop-days. These numbers are estimates with significant uncertainty, but their order-of-magnitude ratios are informative. Between 2012 and 2020, compute per frontier training run increased by roughly a factor of 300,000.
This pace has forced hardware manufacturers to evolve at a speed they had not previously faced. NVIDIA's response — the shift from gaming-oriented GPUs to dedicated AI accelerators — is the story that subsequent lessons in this module will examine in detail.
AlexNet training compute: ~0.0001 petaflop-days. GPT-3 training compute: ~3,640 petaflop-days. That is a 36-million-fold increase in 8 years. Hardware supply grew to meet it, but only barely, and at enormous cost.
Raw compute throughput (FLOP/s) is the metric most often cited in hardware comparisons. But for AI workloads, a second metric is frequently the actual limiting factor: memory bandwidth — the rate at which data can be moved between a chip's memory and its processing units. A chip can have abundant compute but still stall, waiting for data to arrive.
This distinction became critical as model sizes grew. When a model's parameters exceed the on-chip memory of a single GPU, they must be distributed across multiple GPUs or stored in slower off-chip memory, with constant data movement. The NVIDIA A100 (2020) addressed this partly through its 80 GB HBM2e memory with 2 TB/s bandwidth — roughly 10× the bandwidth of a high-end CPU. But even that ceiling was hit by models with hundreds of billions of parameters, requiring complex multi-GPU configurations with custom interconnect hardware (NVLink, InfiniBand) to keep data moving fast enough.
This is why hardware for AI is not simply "fast CPUs." It requires a specific profile: massive parallel compute, enormous memory capacity, and extremely high memory bandwidth — a combination that neither CPUs nor gaming GPUs were designed to deliver at scale.
In this lab you will explore the relationship between hardware availability and AI capability by discussing real historical cases with an AI assistant trained on this lesson's material.
Consider asking: How might AI development have differed if CUDA had never been created? What would scaling laws mean for a company trying to forecast AI capability in 2026? Why might memory bandwidth matter more than raw FLOP/s for certain workloads?
In 2006, Jensen Huang, NVIDIA's co-founder and CEO, made a bet that nearly no one outside his company understood. NVIDIA released CUDA — a programming framework allowing developers to write general-purpose code for GPU hardware. At the time, NVIDIA's revenue came almost entirely from gaming. CUDA represented an investment with no clear customer. The scientific computing market was real but small. No one at NVIDIA, by their own later accounts, anticipated that machine learning researchers would be the primary beneficiaries. Huang called it a "moonshot" in hindsight. At the time it was simply a platform play — make GPUs useful for more than games, and perhaps sell more of them.
The 2012 AlexNet result changed everything. Within two years of that paper, GPU clusters for deep learning had become the primary growth driver in NVIDIA's data center business. The company had accidentally built the foundation of an industry.
NVIDIA's dominance in AI compute is not solely a function of having good chips. It is substantially a function of software ecosystem lock-in. CUDA, released in 2007, accumulated over a decade of optimized libraries before the deep learning boom: cuDNN (deep learning primitives), cuBLAS (linear algebra), NCCL (multi-GPU communication). When researchers needed to train neural networks in 2013, 2015, 2017, they reached for these libraries because they were mature, well-documented, and faster than alternatives.
The competing platforms — AMD's ROCm, Intel's OneAPI — faced a compounding disadvantage. Not only did they have to match NVIDIA's hardware performance; they had to replicate years of library development and, crucially, convince researchers and engineers to retool workflows that already worked. The switching cost grew larger every year as more AI frameworks (TensorFlow, PyTorch) optimized specifically for CUDA. By 2020, the CUDA ecosystem was arguably a more durable competitive advantage than NVIDIA's chip architecture itself.
This is a pattern worth generalizing: in hardware markets, the software ecosystem often outlasts the hardware advantage. The company that wins early and captures developer mindshare can defend its position long after competitors achieve hardware parity.
NVIDIA held an estimated 70–95% share of the market for AI training accelerators as of 2023, depending on the segment measured. Its H100 GPU, released in 2022, commanded prices of $25,000–$40,000 per unit, with lead times stretching to six months or more. Major cloud providers and AI labs reported that GPU availability — not funding, not talent — was the primary constraint on their research programs.
NVIDIA's GPU architecture has evolved through distinct generations, each reflecting lessons learned from AI workloads. The Pascal architecture (2016, P100) was the first designed with data center AI explicitly in mind, introducing 16-bit floating-point (FP16) compute — a format sufficient for neural network training but not for the rendering tasks GPUs had historically performed. This halved memory requirements and doubled effective throughput for AI workloads relative to 32-bit operations.
The Volta architecture (2017, V100) introduced Tensor Cores — specialized hardware units designed specifically for the matrix-multiply-accumulate operations at the heart of neural network training. Where regular CUDA cores perform one multiplication per clock cycle, Tensor Cores perform 64 multiplications per clock cycle on small matrix tiles. The V100 could deliver 125 TFLOP/s on AI workloads versus 14 TFLOP/s on general FP32 compute — a 9× uplift specifically for AI.
The Ampere architecture (2020, A100) extended this further with third-generation Tensor Cores supporting TF32, BF16, INT8, and FP64 precision modes, allowing the same chip to serve both AI training and scientific simulation workloads. The A100 also introduced Multi-Instance GPU (MIG) technology, allowing a single physical GPU to be partitioned into up to seven isolated instances — critical for cloud providers serving many customers simultaneously.
The Hopper architecture (2022, H100) added the Transformer Engine — hardware specifically optimized for the attention mechanism in transformer models, which had become the dominant architecture for large language models after the 2017 "Attention Is All You Need" paper by Vaswani et al. The H100 can dynamically switch between FP8 and FP16 precision within a single layer, increasing throughput while maintaining training stability.
NVIDIA's dominance has attracted well-funded competition from multiple directions. Google's TPU (Tensor Processing Unit) program began in 2015 with internal deployment and reached its fourth generation by 2021. TPUs are application-specific integrated circuits (ASICs) designed exclusively for tensor operations — they have no rendering capability and are not sold externally (except through Google Cloud). Google has used TPUs to train PaLM, Gemini, and other frontier models. Benchmarks suggest TPU v4 pods can match or exceed H100 clusters for certain training workloads at lower cost per FLOP, but the ecosystem — software, tooling, model compatibility — remains largely proprietary.
AMD's Instinct MI300X, released in 2023, offered competitive raw performance with a notable advantage: 192 GB of HBM3 memory per card, versus 80 GB on the H100. For inference on very large models, memory capacity is often the binding constraint, and the MI300X's advantage here attracted genuine enterprise interest. AMD's challenge remains the CUDA ecosystem gap — ROCm has improved substantially but lacks the maturity of CUDA's library stack.
A cohort of AI chip startups — Cerebras, Graphcore, SambaNova, Groq, d-Matrix among them — have proposed alternative architectures: wafer-scale chips (Cerebras), Intelligence Processing Units with novel memory architectures (Graphcore), and inference-specialized designs (Groq's Language Processing Unit). As of 2024, none had achieved the scale of deployment to challenge NVIDIA's position, but collectively they represent a substantial bet that the GPU is not the final form factor for AI compute.
NVIDIA's market share is real but partially an artifact of timing and ecosystem inertia rather than insurmountable technical superiority. If a competitor — including the hyperscalers building their own silicon — achieves both hardware parity and software ecosystem compatibility, the transition could be faster than historical hardware transitions. The precedent of Intel's loss of the mobile chip market to ARM architectures in the 2010s is instructive.
The most credible long-term challenge to NVIDIA's position may come not from rival chip companies but from NVIDIA's own largest customers. Google, Amazon, Microsoft, and Meta all have active custom silicon programs for AI.
Google's TPU program is the most mature. Amazon's Trainium (for training) and Inferentia (for inference) chips are deployed at scale within AWS, with Trainium2 announced in 2023 claiming cost-performance improvements over H100 for certain workloads. Meta's MTIA (Meta Training and Inference Accelerator) targets inference specifically, aimed at reducing the cost of serving recommendations and generative AI features to billions of users. Microsoft has invested in OpenAI-specific hardware discussions and announced the Maia 100 AI accelerator in 2023.
The incentive is straightforward: at the scale these companies operate, even a 20% cost reduction on compute translates into billions of dollars annually. Custom silicon, even if it requires $500 million or more in development costs, can pay back quickly. The risk is that in-house silicon creates fragmentation — code written for NVIDIA's ecosystem must be ported, a significant engineering cost.
Explore the competitive dynamics of the AI chip market with your lab assistant. Consider the interplay between hardware performance, software ecosystems, and the strategic motivations of different players.
You might ask: How durable is NVIDIA's CUDA moat if AMD achieves hardware parity? What conditions would accelerate hyperscaler adoption of in-house silicon? How should a startup choose between NVIDIA, AMD, and cloud TPU options for AI training?
On October 7, 2022, the U.S. Commerce Department published export control regulations restricting the sale of advanced semiconductors and chip-manufacturing equipment to China. The rules were more sweeping than any prior technology export control in recent history: they targeted chips capable of training large AI models, the equipment used to manufacture such chips, and — most significantly — any U.S. persons involved in the Chinese chip industry. The day the rules took effect, American engineers working at Chinese semiconductor firms were legally required to stop working immediately or seek an individual license. Many resigned that same week.
The October 7 controls were not the beginning of semiconductor geopolitics, but they marked its escalation to a new intensity. They also made unmistakable what had previously been understood only in specialized policy circles: the semiconductor supply chain is a strategic chokepoint, and advanced AI capability depends on navigating it.
Modern AI chips — NVIDIA H100s, Google TPUs, Apple's M-series — are all manufactured at advanced nodes (currently 3–5 nanometer process technology) by a very small number of fabrication facilities. TSMC (Taiwan Semiconductor Manufacturing Company) manufactures chips for NVIDIA, AMD, Apple, Qualcomm, and most major fabless chip designers. Samsung's foundry division handles some of this work. Intel, following years of manufacturing difficulties, has been rebuilding its foundry capabilities under the Intel Foundry Services program.
TSMC alone accounts for an estimated 90%+ of the world's most advanced chip manufacturing (sub-5nm nodes). Its main facilities are in Hsinchu, Taichung, and Tainan, Taiwan. This concentration is a product of decades of investment, accumulated process knowledge, and supply-chain clustering — not geography per se. But it means that disruptions to Taiwan — whether from natural disaster, political instability, or military conflict — would have immediate and severe consequences for global AI hardware supply.
TSMC has announced significant international expansion: a $40 billion investment in Phoenix, Arizona (with two fabs planned, starting 4nm production in 2025), and facilities in Kumamoto, Japan (mature nodes) and discussions about European sites. But advanced-node capacity outside Taiwan will remain limited for years, and the process knowledge embedded in Taiwan's existing facilities cannot be replicated quickly.
An industry estimate suggests that if TSMC's advanced fabs were unavailable for one year, global semiconductor supply would fall by 37% and the disruption to electronics manufacturing would exceed the combined GDP impact of the 2008–2009 financial crisis. AI chip production would be disproportionately affected, as it relies almost entirely on sub-7nm nodes that only TSMC and Samsung can currently produce.
Semiconductor fabrication at advanced nodes requires extreme ultraviolet (EUV) lithography machines — equipment that uses 13.5-nanometer wavelength light to etch circuit patterns onto silicon wafers with nanometer precision. These machines are manufactured by a single company: ASML, headquartered in Eindhoven, Netherlands.
ASML's EUV machines cost approximately $150–$200 million each, weigh 180 tons, contain over 100,000 parts, and require a year or more to install and calibrate. They are so complex that ASML sends field engineers to live on-site at customer facilities. The supply chain for a single EUV machine spans over 5,000 suppliers in more than 30 countries.
ASML is the only company in the world that can manufacture EUV lithography equipment. This is not because competitors tried and failed — it is because the technology required decades of sustained investment, including a near-bankruptcy in the early 2000s that was rescued by early commitments from Intel, TSMC, and Samsung (who collectively took equity stakes). No other entity has made the equivalent investment. As of 2024, ASML delivers approximately 50–60 EUV machines per year. Demand substantially exceeds supply.
The United States persuaded the Dutch government in 2023 to restrict ASML from exporting EUV machines to China, extending earlier restrictions on the company's older DUV (deep ultraviolet) equipment. China's advanced chip ambitions are therefore constrained not just by chip export controls but by lithography equipment export controls — a chokepoint upstream of the chip manufacturers themselves.
The October 7, 2022 Bureau of Industry and Security (BIS) rules targeted chips meeting specific performance thresholds: the initial rules restricted export of chips exceeding roughly 4,800 TOPS (tera-operations per second) with interconnect bandwidth above 600 GB/s — parameters that caught NVIDIA's A100 and H100. NVIDIA subsequently introduced modified products (A800 and H800) with reduced interconnect speeds, compliant with the initial rules. BIS updated the controls in October 2023 to close these workarounds, restricting the A800 and H800 as well.
The controls created significant market disruption. Chinese technology companies including Baidu, Alibaba, ByteDance, and Tencent had placed large orders for H100s ahead of the rules taking effect — orders that were subsequently unfulfilled. Chinese companies accelerated investment in domestic chip development, primarily through Huawei's Ascend line of AI accelerators, and began stockpiling chips that predated the restrictions. Huawei's Ascend 910B, manufactured by SMIC using older process technology, demonstrated in 2023 that competitive (if not equivalent) AI chips could be produced domestically, though at lower yield rates and higher unit costs than TSMC-manufactured equivalents.
The broader policy debate concerns whether export controls slow AI development in adversarial nations, accelerate domestic chip investment in those nations, or both. The evidence to date suggests both effects are real — controls impose meaningful friction and delay while simultaneously concentrating Chinese government investment in domestic alternatives.
Export controls on advanced chips are effective precisely because the chokepoints in the semiconductor supply chain are so concentrated. But that same concentration creates fragility for everyone — including the countries imposing controls. A world where AI hardware depends on a single island, a single lithography company, and a handful of chemical suppliers is a world where AI capability is subject to disruptions that have nothing to do with AI itself.
The CHIPS and Science Act, signed by President Biden on August 9, 2022, allocated $52.7 billion for semiconductor research, development, and manufacturing incentives in the United States, with approximately $39 billion in direct manufacturing subsidies. The Act also included a provision prohibiting recipients from expanding advanced manufacturing in "countries of concern" (primarily China) for ten years.
The major beneficiaries include TSMC Arizona (announced $66 billion total investment, receiving $6.6 billion in CHIPS grants), Intel ($8.5 billion in grants plus $11 billion in loans for Ohio and Arizona fabs), Samsung ($6.4 billion for a Texas facility), and Micron ($6.1 billion for memory chip facilities). The EU launched a parallel European Chips Act targeting €43 billion in investment to reach 20% of global chip production by 2030 — a target most analysts consider ambitious.
The fundamental challenge for reshoring is not money but time and knowledge. Advanced semiconductor manufacturing requires process knowledge accumulated over decades, embodied in the tacit expertise of engineers and technicians who currently live and work in East Asia. Training a new workforce and replicating process maturity takes years — TSMC's Arizona fab has faced production delays partly attributed to workforce and supply chain challenges in building that expertise base outside its home environment.
Discuss the geopolitical dimensions of AI hardware with your lab assistant, focusing on supply chain concentration, export controls, and the strategic implications for AI development globally.
Consider asking: How effective are export controls likely to be in slowing adversarial AI development? What scenarios could disrupt the global AI chip supply chain? How should a non-U.S. country think about AI hardware sovereignty?
By mid-2023, ChatGPT was serving an estimated 100 million active users, generating responses to millions of queries per hour. Each response required running a forward pass through a model with hundreds of billions of parameters — a computation requiring thousands of GPU operations per token generated. The training run that created GPT-4 was a one-time cost measured in millions of GPU-hours. The inference serving that cost — running the model in production, continuously, for every user query — would, over the following year, likely equal or exceed the training compute cost. And unlike training, which can be run once and stopped, inference runs forever, at the pace of user demand.
This arithmetic is forcing a rethink of AI hardware. Training and inference have different computational profiles, different memory requirements, different latency constraints, and different economic logics. The hardware optimized for one is not necessarily optimal for the other. The inference market — largely hidden from public view behind API endpoints — may ultimately dwarf the training market in total silicon value deployed.
Training a neural network is a throughput-bound, batch-oriented workload. The system processes large batches of examples simultaneously, performing forward and backward passes, and the primary metric is how many training examples can be processed per second. Latency per example matters less than aggregate throughput. Training runs can take days or weeks, and their cost is amortized over the life of the model.
Inference — running a trained model to generate predictions — is a latency-bound, often real-time workload. Users expect responses within seconds; in some applications (autonomous vehicles, real-time translation), milliseconds. The batch sizes are smaller. The memory footprint requirement is the full model size, but the computation per request is far less than a training step. Economic optimization focuses on cost per query rather than total throughput.
These differences mean that inference hardware can be more specialized and potentially cheaper than training hardware. A chip optimized for inference does not need the gradient accumulation capabilities, the high-precision floating point, or the massive parallel throughput required for training. It needs to be fast, low-power, and able to fit the model's active layers in accessible memory — often on-device, at the network edge, or in a data center inference cluster.
Groq, founded in 2016 by former Google engineers including one of the original TPU designers, developed the Language Processing Unit (LPU) — a chip architecture designed specifically for inference on language models. The LPU abandons the caches and dynamic execution of conventional processors in favor of a deterministic, compiler-scheduled design: every memory access and computation is statically planned at compile time, eliminating the latency variance that comes from cache misses and branch mispredictions.
In early 2024, Groq demonstrated LLaMA 2 inference at over 500 tokens per second on its LPU hardware, compared to roughly 40–60 tokens per second on comparable GPU setups — a speed difference visible to end users as near-instantaneous response versus noticeable generation delay. The tradeoff is flexibility: the LPU's static scheduling means it is less adaptable to novel architectures and workloads than a general-purpose GPU. This is the inference-training specialization trade-off made concrete.
Amazon's Inferentia chips, deployed within AWS, are inference-optimized ASICs offered to cloud customers at lower cost per query than equivalent GPU inference. The Inferentia 2 (2023) supports models up to 175B parameters with 384 GB total NeuronLink memory across a 16-chip inferentia system. AWS has used Inferentia internally for Alexa and recommendation systems, reporting significant cost reductions versus GPU-based inference — though model compatibility requires AWS Neuron SDK integration.
At 100 million daily active users generating an average of 10 responses each, and assuming a cost of $0.002 per response (a rough mid-2023 estimate for GPT-3.5-class inference), the daily inference compute cost is $2 million per day, $730 million per year. Cutting inference cost by 50% through specialized hardware saves $365 million annually — dwarfing most training costs. This is why inference hardware investment is accelerating rapidly.
Not all AI inference runs in data centers. Edge AI refers to running AI models on local devices — smartphones, cameras, vehicles, industrial sensors — without sending data to a central server. The motivations are latency (local inference has zero network round-trip), privacy (data never leaves the device), cost (no cloud inference fees), and reliability (functions without internet connectivity).
The leading edge AI chip program is Apple's Neural Engine, embedded in Apple Silicon (M and A series chips) since the A11 Bionic in 2017. The M4 chip (2024) includes a Neural Engine capable of 38 TOPS (trillion operations per second). This allows real-time features like Face ID, live transcription, and on-device AI assistants to run entirely on the device. Apple's tight integration of Neural Engine, CPU, and GPU on a single die, sharing a unified memory pool, gives it latency and power efficiency advantages that discrete GPU inference cannot match for on-device workloads.
Qualcomm's Hexagon NPU in the Snapdragon 8 Gen 3 delivers 45 TOPS for edge AI on Android devices. Google's Tensor G3 chip (Pixel 8, 2023) includes a dedicated TPU for on-device AI processing. These chips are driving a generation of features — real-time language translation, on-device image generation, voice assistance — that were previously impossible without cloud connectivity.
The edge AI trend has implications for data center demand: as more inference workloads move to devices, the growth of cloud inference may be partially offset. But current generative AI models — multimodal systems, long-context language models — remain too large for edge hardware. The boundary between edge-capable and cloud-only AI capabilities is one of the most dynamic frontiers in hardware development.
The edge-cloud split in inference hardware means that AI capability will increasingly differ depending on where it runs. Models that can run on-device will be widely accessible, private, and fast. Models that require data center inference will be more capable but more expensive, dependent on connectivity, and subject to the economics and policy constraints of cloud providers. Hardware determines not just what AI can do, but who gets access to it under what conditions.
The GPU-centric paradigm that has driven AI hardware since 2012 faces physical limits. Moore's Law — the observation that transistor density doubles roughly every two years — has slowed substantially at advanced nodes. TSMC's 3nm process delivers incremental improvements over 5nm; the gap between 3nm and the theoretical physical limit is narrowing. New approaches are being explored at various stages of maturity.
Neuromorphic computing, which attempts to replicate the spiking, event-driven nature of biological neurons rather than the continuous-valued matrix operations of current AI, offers potential energy efficiency gains. Intel's Loihi 2 (2021) is the most advanced commercial neuromorphic chip, demonstrating inference on certain sparse, temporal tasks with orders-of-magnitude better energy efficiency than GPUs. But neuromorphic computing requires fundamentally different training methodologies and has not yet demonstrated competitiveness on the transformer-architecture workloads that define current AI. It remains a long-term research direction rather than a near-term commercial alternative.
Photonic computing — using photons rather than electrons for computation — theoretically offers the speed of light for data transmission within a chip, with lower heat dissipation. Lightmatter and Luminous Computing are among the startups pursuing photonic AI accelerators. The challenges involve converting between optical and electronic signals efficiently and manufacturing optical components at the precision required. Commercial photonic AI systems are likely years away from competing with GPU clusters at scale.
Quantum computing, frequently mentioned in AI contexts, is the longest-horizon candidate. Current quantum computers are not suited to the large matrix operations of neural network training and inference. Quantum advantage for AI — if it materializes — is more likely to come in specific optimization or simulation problems than in direct replacement of GPU workloads. The most honest assessment of quantum computing's AI relevance is that it is a research area with long-term potential and near-zero near-term practical impact on AI capability.
The GPU will remain the dominant AI training hardware through at least the late 2020s. Specialized inference chips will take significant share of inference workloads over the same period. Edge AI chips will enable meaningful capability on devices. Neuromorphic and photonic computing are research bets, not near-term transitions. Quantum computing's AI relevance is more than a decade away in any commercially meaningful form.
Discuss inference hardware economics, edge AI implications, and the prospects for post-GPU architectures with your lab assistant.
You might ask: How will the economics of inference change as models become more widely deployed? What model architectures would benefit most from neuromorphic hardware? How should a product team decide between cloud and edge inference for a new AI feature?