When NVIDIA reported its fiscal Q4 2024 results, revenue had grown 265% year-over-year to $22.1 billion, almost entirely on data-center GPU sales. The H100 chip was back-ordered worldwide. Every major cloud, every AI lab, every nation-state building an AI strategy was essentially in the same queue. On the surface it looked like GPU dominance was total and permanent.
But inside those same data centers, engineers were logging something quieter: utilization rates on individual H100s were often below 40% for many real-world inference workloads. The chips were powerful — they just weren't always the right shape for the job.
The GPU's rise to AI dominance was not planned. NVIDIA's CUDA platform, launched in 2006, was originally aimed at scientific computing. When Alex Krizhevsky trained AlexNet on two GTX 580s in 2012 and crushed ImageNet, it confirmed that the GPU's fundamental architecture — thousands of small parallel cores executing the same instruction on different data — was a near-perfect match for matrix multiplication, the core operation of neural networks.
That match drove a decade of co-evolution. GPUs grew tensor cores designed specifically for mixed-precision matrix math. Memory bandwidth scaled from gigabytes to terabytes per second (the H100 SXM5 delivers 3.35 TB/s). Interconnects like NVLink allowed racks of GPUs to behave as a single logical accelerator. By 2023 the H100 could train a large language model in roughly one-tenth the time of its predecessor from five years earlier.
The architecture that enabled this is called SIMD — Single Instruction, Multiple Data. Thousands of cores execute identical operations simultaneously on different numbers. This is superb for training, where every layer of a network processes a large batch of examples identically. It is less superb for inference on a single request, where the batch size is often one and the operations are sequential.
Modern large language models are limited not by compute but by memory bandwidth. When GPT-class models run inference, most time is spent moving weight parameters from HBM memory to compute cores — not actually computing. A 70-billion-parameter model requires roughly 140 GB just to store weights at fp16. The GPU spends most of its time waiting for data to arrive, not multiplying it.
1. Power density. The H100 SXM5 draws 700 watts. A rack of eight draws 5.6 kilowatts — before networking, cooling, or power conversion losses. A hyperscale data center running tens of thousands of H100s faces power bills and cooling infrastructure costs that are rewriting the economics of cloud computing. Google, Microsoft, and Amazon have all announced or are building dedicated nuclear and advanced grid connections specifically for AI workloads. The chip has become almost too powerful to power.
2. Memory bandwidth vs. compute ratio. NVIDIA's own "roofline" analysis shows that for most transformer inference workloads, the H100 is memory-bound, not compute-bound. That means adding more FLOPS — the traditional GPU improvement axis — delivers diminishing returns unless memory bandwidth scales equally. Stacking more HBM (high-bandwidth memory) dies raises cost and yield risk.
3. General-purpose overhead. GPUs retain substantial infrastructure for graphics: rasterization units, display engines, driver stacks. Even in data-center-only SKUs like the H100, the architectural heritage imposes overhead. Purpose-built AI accelerators can eliminate this entirely and spend every transistor on AI operations.
The GPU's limitations are not fatal — they are architectural mismatches. As AI workloads diversify from training to inference, from research to production, from data centers to edge devices, the single-chip GPU paradigm faces competition from chips purpose-built for specific slices of that workload map. The next lessons examine who is building those alternatives and what they can actually do.
You've learned that GPUs face memory bandwidth walls, power density limits, and general-purpose overhead. In this lab, explore those constraints in conversation with the AI assistant. Consider real workloads, real numbers, and what they imply for the next generation of hardware.
In 2016, Google published the first public details of its Tensor Processing Unit, revealing that the chips had been deployed internally since 2015 — a full year before anyone outside knew they existed. The disclosure included a striking benchmark: a TPUv1 delivered 15–30× better performance-per-watt than contemporary server CPUs and GPUs for inference on Google's production neural networks. The chip was not general. It could not render a scene or run a physics simulation. It could, extremely efficiently, multiply matrices and push results through activation functions. That was enough.
The TPU's origin story is well-documented in Google's 2017 ISCA paper "In-Datacenter Performance Analysis of a Tensor Processing Unit." The chip was created because Google engineers estimated that if voice search users started using neural network-based speech recognition for just three minutes per day, Google would need to double its data center capacity to handle it — with conventional hardware. A purpose-built chip was cheaper than building more data centers.
TPUv1 used a systolic array architecture — a grid of multiply-accumulate units where data flows through the array in a wave pattern, allowing dense matrix multiplication with minimal memory re-reads. The chip had 65,536 8-bit multiply-accumulate units and 28 MB of on-chip memory (SRAM, not DRAM), eliminating most memory latency for weight access during inference.
Subsequent generations scaled dramatically. TPUv4, deployed in 2021, delivered 275 teraflops of bfloat16 performance per chip and was deployed in "pods" of 4,096 chips interconnected by Google's custom optical circuit switching fabric — enabling collective operations like all-reduce to run at near-memory-bandwidth speeds across thousands of chips. By TPUv5e (2023), Google had optimized the architecture specifically for the Transformer attention mechanism.
In a systolic array, weights are loaded once and flow through a grid of processing elements. Each element multiplies and accumulates as data passes through. This dramatically reduces memory traffic compared to a GPU, where cores frequently re-fetch weights from shared memory pools. For workloads where the same weights are applied to many inputs — exactly the pattern of neural network inference — the efficiency gain is substantial.
While Google went deep on systolic arrays, Cerebras Systems took a more radical approach: abandon the die-per-chip model entirely. The Cerebras WSE-3 (Wafer Scale Engine 3), announced in March 2024, is a single silicon die spanning an entire 300mm wafer. It contains 4 trillion transistors, 900,000 AI-optimized cores, and 44 GB of on-chip SRAM. A single WSE-3 contains more SRAM than the aggregate memory of roughly 200 H100 GPUs combined.
The key insight is that on-chip SRAM is orders of magnitude faster than HBM (high-bandwidth memory attached externally). By fitting an entire large model into on-chip memory, Cerebras eliminates the memory bandwidth wall entirely for models that fit. A 70-billion-parameter model at fp16 requires roughly 140 GB — too large for the current WSE-3. But a 13B model fits entirely on-chip, enabling inference without any off-chip memory access.
The manufacturing challenge is yield: a single defect on a conventional die ruins the chip. Cerebras uses redundancy fabric — spare cores that activate when nearby cores are found defective during testing — to achieve usable yields at wafer scale. The approach has been validated in production at Argonne National Laboratory, where Cerebras systems were used for scientific AI workloads beginning in 2022.
Groq (not to be confused with Grok, Elon Musk's chatbot) builds what it calls the Language Processing Unit (LPU). The fundamental design choice is determinism: unlike GPUs, which use dynamic scheduling and caches, the LPU uses a compiler that schedules every memory access and computation at compile time. There is no cache hierarchy, no speculative execution, no out-of-order processing. Instructions execute in exactly the order and time the compiler specifies.
This determinism eliminates variance. In December 2023, Groq demonstrated Llama-2 70B running at over 300 tokens per second on a single LPU node — roughly 10× faster than comparable H100 inference for that model. The tradeoff is flexibility: the LPU requires a specialized compiler and performs poorly on workloads the compiler cannot statically schedule. It is not a general accelerator. It is an inference engine for transformer models.
TPU, Cerebras, Groq — each chose a different axis of optimization: Google chose energy efficiency at scale, Cerebras chose eliminating memory latency, Groq chose deterministic throughput. None is a general-purpose chip. Each is a sharp instrument for a specific problem. The question for any organization deploying AI is whether its workload is sharp enough to benefit from a sharp instrument.
Google, Cerebras, and Groq each made radically different architectural choices to beat the GPU for specific AI workloads. In this lab, dig into those trade-offs. What workloads favor each approach? What are the failure modes?
In September 2021, Intel released its second-generation neuromorphic chip, Loihi 2, fabricated on Intel 4 (7nm-class) process technology. The chip contained 1 million programmable neurons and 120 million synapses — and consumed roughly 1 milliwatt per million synaptic operations per second. For comparison, a human brain's estimated energy budget for similar synaptic computation is in a comparable efficiency range. Intel wasn't claiming Loihi could replace a GPU for LLM training. It was claiming that for specific event-driven inference tasks — anomaly detection, sensory processing, sparse pattern recognition — it could do things nothing else could do at that energy level.
Conventional digital computers — including GPUs — operate on the von Neumann architecture: a processor fetches instructions from memory, executes them, writes results back. Data and instructions are treated as numbers in a synchronous clock-driven pipeline. This is efficient for many tasks but fundamentally different from how biological neural systems compute.
Biological neurons fire spikes — discrete events in time — only when their accumulated input crosses a threshold. They don't process data in synchronized clock cycles. They compute asynchronously, event-driven, and only expend energy when something changes. A retinal ganglion cell in the human eye doesn't transmit a frame 30 times per second; it transmits a spike when light intensity changes at its location.
Neuromorphic chips attempt to replicate this model in silicon. Intel's Loihi series uses spiking neural networks (SNNs) — networks where neurons communicate via sparse, asynchronous spikes rather than continuous floating-point values. IBM's TrueNorth chip (2014) had 4,096 neurosynaptic cores with 1 million neurons operating at 70 milliwatts in real-time inference mode — comparable to a hearing aid battery.
Standard deep neural networks are dense: every layer computation touches most or all weights. Spiking neural networks are sparse: most neurons are silent most of the time, and computation only occurs when a spike is fired. For natural sensory data — audio, video, LiDAR — which are themselves sparse in time, SNNs can deliver orders-of-magnitude better energy efficiency. The challenge is that training SNNs efficiently remains an open research problem.
A different approach is to perform computation directly in memory using physics rather than binary logic. Analog in-memory computing (AIMC) exploits the physical properties of memory devices — specifically, that the conductance (resistance to current) of memristors or phase-change memory cells can be set to analog values representing weights.
When you apply a voltage to a row of such cells, each cell passes a current proportional to its conductance multiplied by the input voltage. Sum the currents from all cells in a column, and you have computed a dot product — the core operation of matrix multiplication — using Kirchhoff's current law rather than digital logic. The computation happens in nanoseconds, consumes microjoules, and requires no data movement: the weights are the memory.
IBM Research demonstrated in 2023 that a phase-change memory analog chip could perform inference on a ResNet-50 image classification task at 92.81% top-1 accuracy on ImageNet — within 0.3% of the digital baseline — while consuming roughly 14× less energy than an equivalent digital implementation. The team published in Nature Electronics in October 2023.
The challenge for AIMC is precision. Analog cells drift over time — their conductance changes with temperature, read cycles, and age. Training-quality numerical precision (fp32 or bf16) is not achievable. Current demonstrations run at approximately 4-bit equivalent precision. For inference on models with some tolerance for quantization error, this is acceptable. For training or high-precision scientific computation, it is not.
Light travels faster than electrons and generates less heat. Photonic neural network accelerators use optical components — Mach-Zehnder interferometers, optical waveguides, photodetectors — to perform matrix-vector multiplications at the speed of light with near-zero resistive heating. MIT spinout Lightmatter shipped its first commercial photonic interconnect product (Passage) in 2022 and demonstrated photonic matrix multiplication units in its Envise chip, claiming 10× better energy efficiency than comparable digital chips for dense matrix operations.
The fundamental constraint is that photonic systems require analog-to-digital and digital-to-analog converters at their boundaries with conventional electronics, which consume significant power and introduce latency. As of 2024, photonic computing remains at an early commercial stage — compelling for specific workloads, not yet a general replacement for digital accelerators.
Neuromorphic chips are in production use for edge sensing and anomaly detection. Analog in-memory computing has demonstrated inference accuracy comparable to digital at far lower energy — but precision limits and device drift remain challenges for widespread deployment. Photonic computing has reached early commercial products but faces the DAC/ADC boundary problem. None of these is threatening the GPU's position for LLM training today. Each may become dominant for specific segments — edge, inference, sensory AI — within this decade.
Neuromorphic chips, analog in-memory computing, and photonic accelerators each represent a radical departure from conventional digital silicon. Explore their real-world status, limitations, and potential with the AI assistant.
On October 7, 2022, the U.S. Commerce Department published new export control rules that effectively prohibited the sale of advanced AI chips — and the equipment to manufacture them — to China. The rules targeted chips with performance above specific thresholds (the A100 and H100 fell above the line) and, more significantly, restricted exports of extreme ultraviolet lithography equipment from ASML. NVIDIA's China revenue, which had been approximately $4 billion annually, was abruptly cut off. Within weeks, NVIDIA announced a downgraded chip, the A800 and H800, designed to fall just below the control thresholds — then those were banned in October 2023 as the controls were tightened further.
Advanced semiconductor manufacturing is extraordinarily concentrated. As of 2024, Taiwan Semiconductor Manufacturing Company (TSMC) fabricates virtually all leading-edge AI chips below 5nm: NVIDIA's H100 and H200 (4nm), Apple's M-series (3nm), AMD's MI300X (5nm), Google's TPUv5 (5nm). TSMC's two advanced nodes — N3 (3nm) and N4 (4nm) — account for more than half of the world's leading-edge chip production capacity.
Samsung Foundry and Intel Foundry Services operate at comparable process nodes but have significantly lower yields, capacity, and customer trust for AI workloads. The practical consequence is that the entire global AI hardware industry — whether GPU-based, TPU-based, or any next-generation accelerator — depends on a relatively small number of fabs on a 14,000-square-mile island 110 miles from mainland China.
The U.S. CHIPS and Science Act of 2022 allocated $52.7 billion to rebuild domestic semiconductor manufacturing, with TSMC committing to build fabs in Arizona (Phoenix), Intel committing to Ohio, and Samsung to Texas. The Arizona fab began risk production in late 2024. Full-scale production at competitive advanced nodes is expected around 2026-2028, at costs 30-50% higher than equivalent Taiwan production.
Cut off from NVIDIA's most advanced chips, China's largest AI developers — Baidu, ByteDance, Alibaba — began transitioning to Huawei's Ascend 910B accelerator. Fabricated by SMIC on a 7nm-equivalent process (using older DUV rather than EUV lithography), the 910B delivers roughly 60-70% of H100 performance at comparable power. Baidu reported deploying Ascend 910B at scale for Ernie Bot training in 2023. The chip represents a remarkable engineering achievement given the equipment restrictions — but it also illustrates the efficiency tax that manufacturing constraints impose.
Hardware competition faces an asymmetric challenge: CUDA. NVIDIA's compute platform has been developed since 2006, has millions of lines of optimized kernels, and underlies virtually all major AI frameworks — PyTorch, TensorFlow, JAX all compile to CUDA by default. When Google, AMD, or an upstart accelerator company builds a chip that outperforms the H100 on raw specs, they still face the fact that every AI researcher knows CUDA and most training code is written to assume it.
AMD's ROCm platform is a CUDA-compatible alternative that reached functional parity for most PyTorch operations in 2023. Meta announced in early 2024 that it had achieved GPU-equivalent training performance on AMD MI300X accelerators for its Llama training runs — a significant milestone. Google's XLA compiler abstracts hardware for TPUs. Intel's oneAPI targets its Gaudi accelerators. Each is technically capable; none has broken CUDA's network effects in the broader research community.
The CUDA moat is potentially more durable than any single chip architecture. Even if a technically superior accelerator reaches the market, the ecosystem around it — libraries, documentation, community knowledge, debugging tools — requires years to build. This is why NVIDIA's competitors invest as heavily in developer platforms as in chip design.
The export control regime has accelerated a movement toward sovereign AI hardware — nations and large economies building or procuring AI chips independently of U.S.-controlled supply chains. The European Union's Chips Act (2023) committed €43 billion to semiconductor manufacturing. Japan's Rapidus project targets 2nm production by 2027. The UAE's G42 has built massive GPU clusters through relationships that were subject to U.S. government review in 2024.
India's India Semiconductor Mission approved its first fab projects in 2024. Each initiative faces the same challenge: chip design talent, EDA tool access, and the 15-year learning curve that separates leading-edge semiconductor manufacturing from everything else. Hardware geopolitics is a slow game — policy decisions made in 2024 produce manufacturing capacity in 2030.
The post-GPU future is being shaped simultaneously by physics (what silicon can do), economics (what custom chips cost to design and fabricate), software (what developers will actually write code for), and geopolitics (who is allowed to buy or build what). None of these vectors points to a single winner. The most likely near-term outcome is a fragmented landscape: GPUs for training, purpose-built accelerators for inference, neuromorphic and analog chips for edge sensing — and a geopolitical dividing line running through all of it.
Export controls, TSMC dependency, national chip programs, and the CUDA moat all intersect to shape who can build and deploy advanced AI hardware. In this lab, examine the strategic implications — for companies, nations, and the future of AI development.