Module 8 · Lesson 1

The Limits of the GPU

Why the chip that won deep learning may not win the next decade

What physical and architectural walls is the GPU now running into — and why does that matter?

When NVIDIA reported its fiscal Q4 2024 results, revenue had grown 265% year-over-year to $22.1 billion, almost entirely on data-center GPU sales. The H100 chip was back-ordered worldwide. Every major cloud, every AI lab, every nation-state building an AI strategy was essentially in the same queue. On the surface it looked like GPU dominance was total and permanent.

But inside those same data centers, engineers were logging something quieter: utilization rates on individual H100s were often below 40% for many real-world inference workloads. The chips were powerful — they just weren't always the right shape for the job.

Why GPUs Won the Deep Learning Era

The GPU's rise to AI dominance was not planned. NVIDIA's CUDA platform, launched in 2006, was originally aimed at scientific computing. When Alex Krizhevsky trained AlexNet on two GTX 580s in 2012 and crushed ImageNet, it confirmed that the GPU's fundamental architecture — thousands of small parallel cores executing the same instruction on different data — was a near-perfect match for matrix multiplication, the core operation of neural networks.

That match drove a decade of co-evolution. GPUs grew tensor cores designed specifically for mixed-precision matrix math. Memory bandwidth scaled from gigabytes to terabytes per second (the H100 SXM5 delivers 3.35 TB/s). Interconnects like NVLink allowed racks of GPUs to behave as a single logical accelerator. By 2023 the H100 could train a large language model in roughly one-tenth the time of its predecessor from five years earlier.

The architecture that enabled this is called SIMD — Single Instruction, Multiple Data. Thousands of cores execute identical operations simultaneously on different numbers. This is superb for training, where every layer of a network processes a large batch of examples identically. It is less superb for inference on a single request, where the batch size is often one and the operations are sequential.

The Memory Wall

Modern large language models are limited not by compute but by memory bandwidth. When GPT-class models run inference, most time is spent moving weight parameters from HBM memory to compute cores — not actually computing. A 70-billion-parameter model requires roughly 140 GB just to store weights at fp16. The GPU spends most of its time waiting for data to arrive, not multiplying it.

Three Ceilings the GPU Is Hitting

1. Power density. The H100 SXM5 draws 700 watts. A rack of eight draws 5.6 kilowatts — before networking, cooling, or power conversion losses. A hyperscale data center running tens of thousands of H100s faces power bills and cooling infrastructure costs that are rewriting the economics of cloud computing. Google, Microsoft, and Amazon have all announced or are building dedicated nuclear and advanced grid connections specifically for AI workloads. The chip has become almost too powerful to power.

2. Memory bandwidth vs. compute ratio. NVIDIA's own "roofline" analysis shows that for most transformer inference workloads, the H100 is memory-bound, not compute-bound. That means adding more FLOPS — the traditional GPU improvement axis — delivers diminishing returns unless memory bandwidth scales equally. Stacking more HBM (high-bandwidth memory) dies raises cost and yield risk.

3. General-purpose overhead. GPUs retain substantial infrastructure for graphics: rasterization units, display engines, driver stacks. Even in data-center-only SKUs like the H100, the architectural heritage imposes overhead. Purpose-built AI accelerators can eliminate this entirely and spend every transistor on AI operations.

Key Terms

SIMDSingle Instruction, Multiple Data — the parallel execution model at the heart of GPU architecture, where one instruction operates simultaneously on thousands of data elements.

HBMHigh-Bandwidth Memory — stacked DRAM dies connected directly to a GPU die via wide interfaces, delivering far higher memory bandwidth than conventional GDDR.

Roofline ModelAn analytical framework that identifies whether a workload is limited by compute throughput or memory bandwidth, guiding hardware selection.

Memory-BoundA workload state where the bottleneck is data movement speed rather than arithmetic throughput — common in LLM inference.

The Pivot Point

The GPU's limitations are not fatal — they are architectural mismatches. As AI workloads diversify from training to inference, from research to production, from data centers to edge devices, the single-chip GPU paradigm faces competition from chips purpose-built for specific slices of that workload map. The next lessons examine who is building those alternatives and what they can actually do.

Lesson 1 Quiz

The Limits of the GPU

What architectural model describes how a GPU executes the same instruction across thousands of data elements simultaneously?

Correct. SIMD is the core parallelism model that made GPUs excellent for matrix multiplication in neural networks.

Not quite. SIMD — Single Instruction, Multiple Data — describes the GPU's execution model, enabling thousands of cores to run the same operation on different data simultaneously.

For large language model inference, most GPU time is spent on which bottleneck?

Correct. LLM inference is memory-bound: the chip spends most cycles waiting for weight data to arrive from HBM, not computing.

Not quite. LLM inference is memory-bound — the dominant bottleneck is moving tens or hundreds of gigabytes of weights from HBM memory to compute cores.

Approximately how much power does a single NVIDIA H100 SXM5 GPU draw?

Correct. The H100 SXM5 TDP is 700W, making power density one of the primary constraints on dense GPU deployments.

Not quite. The H100 SXM5 draws approximately 700 watts — a figure that makes power density a major constraint in large-scale deployment.

Lab 1 — GPU Limitations Analysis

Discuss the architectural and physical limits of GPU-based AI compute

Your Task

You've learned that GPUs face memory bandwidth walls, power density limits, and general-purpose overhead. In this lab, explore those constraints in conversation with the AI assistant. Consider real workloads, real numbers, and what they imply for the next generation of hardware.

Start here: "Why does moving data from memory hurt LLM inference performance more than adding more FLOPS helps it?" — then explore from there.

AI Lab Assistant

GPU Limits · M8-L1

Welcome to Lab 1. We're examining why GPUs — despite their dominance — are hitting architectural walls for AI workloads. Ask me about memory bandwidth, power constraints, roofline analysis, or what specifically makes LLM inference different from training. What's your first question?

Module 8 · Lesson 2

Purpose-Built Accelerators

Google's TPUs, Cerebras's wafer-scale chips, and the wave of custom silicon

When you design a chip knowing exactly what it will compute, what do you gain — and what do you give up?

In 2016, Google published the first public details of its Tensor Processing Unit, revealing that the chips had been deployed internally since 2015 — a full year before anyone outside knew they existed. The disclosure included a striking benchmark: a TPUv1 delivered 15–30× better performance-per-watt than contemporary server CPUs and GPUs for inference on Google's production neural networks. The chip was not general. It could not render a scene or run a physics simulation. It could, extremely efficiently, multiply matrices and push results through activation functions. That was enough.

Google's TPU Program

The TPU's origin story is well-documented in Google's 2017 ISCA paper "In-Datacenter Performance Analysis of a Tensor Processing Unit." The chip was created because Google engineers estimated that if voice search users started using neural network-based speech recognition for just three minutes per day, Google would need to double its data center capacity to handle it — with conventional hardware. A purpose-built chip was cheaper than building more data centers.

TPUv1 used a systolic array architecture — a grid of multiply-accumulate units where data flows through the array in a wave pattern, allowing dense matrix multiplication with minimal memory re-reads. The chip had 65,536 8-bit multiply-accumulate units and 28 MB of on-chip memory (SRAM, not DRAM), eliminating most memory latency for weight access during inference.

Subsequent generations scaled dramatically. TPUv4, deployed in 2021, delivered 275 teraflops of bfloat16 performance per chip and was deployed in "pods" of 4,096 chips interconnected by Google's custom optical circuit switching fabric — enabling collective operations like all-reduce to run at near-memory-bandwidth speeds across thousands of chips. By TPUv5e (2023), Google had optimized the architecture specifically for the Transformer attention mechanism.

The Systolic Array Advantage

In a systolic array, weights are loaded once and flow through a grid of processing elements. Each element multiplies and accumulates as data passes through. This dramatically reduces memory traffic compared to a GPU, where cores frequently re-fetch weights from shared memory pools. For workloads where the same weights are applied to many inputs — exactly the pattern of neural network inference — the efficiency gain is substantial.

Cerebras and the Wafer-Scale Approach

While Google went deep on systolic arrays, Cerebras Systems took a more radical approach: abandon the die-per-chip model entirely. The Cerebras WSE-3 (Wafer Scale Engine 3), announced in March 2024, is a single silicon die spanning an entire 300mm wafer. It contains 4 trillion transistors, 900,000 AI-optimized cores, and 44 GB of on-chip SRAM. A single WSE-3 contains more SRAM than the aggregate memory of roughly 200 H100 GPUs combined.

The key insight is that on-chip SRAM is orders of magnitude faster than HBM (high-bandwidth memory attached externally). By fitting an entire large model into on-chip memory, Cerebras eliminates the memory bandwidth wall entirely for models that fit. A 70-billion-parameter model at fp16 requires roughly 140 GB — too large for the current WSE-3. But a 13B model fits entirely on-chip, enabling inference without any off-chip memory access.

The manufacturing challenge is yield: a single defect on a conventional die ruins the chip. Cerebras uses redundancy fabric — spare cores that activate when nearby cores are found defective during testing — to achieve usable yields at wafer scale. The approach has been validated in production at Argonne National Laboratory, where Cerebras systems were used for scientific AI workloads beginning in 2022.

Groq's Language Processing Unit

Groq (not to be confused with Grok, Elon Musk's chatbot) builds what it calls the Language Processing Unit (LPU). The fundamental design choice is determinism: unlike GPUs, which use dynamic scheduling and caches, the LPU uses a compiler that schedules every memory access and computation at compile time. There is no cache hierarchy, no speculative execution, no out-of-order processing. Instructions execute in exactly the order and time the compiler specifies.

This determinism eliminates variance. In December 2023, Groq demonstrated Llama-2 70B running at over 300 tokens per second on a single LPU node — roughly 10× faster than comparable H100 inference for that model. The tradeoff is flexibility: the LPU requires a specialized compiler and performs poorly on workloads the compiler cannot statically schedule. It is not a general accelerator. It is an inference engine for transformer models.

Key Terms

Systolic ArrayA grid of processing elements that pass data between neighbors in a regular, wave-like pattern — enabling high-throughput matrix multiplication with minimal memory re-reads.

SRAM vs HBMSRAM (Static RAM) is on-chip memory that is extremely fast but expensive per bit. HBM (High-Bandwidth Memory) is stacked off-chip DRAM that is cheaper per bit but slower.

Wafer-Scale IntegrationBuilding a single chip that spans an entire semiconductor wafer rather than dicing the wafer into many smaller dies — maximizing on-chip resources at the cost of manufacturing complexity.

LPULanguage Processing Unit — Groq's deterministic, statically-scheduled inference accelerator designed specifically for transformer model execution.

The Pattern

TPU, Cerebras, Groq — each chose a different axis of optimization: Google chose energy efficiency at scale, Cerebras chose eliminating memory latency, Groq chose deterministic throughput. None is a general-purpose chip. Each is a sharp instrument for a specific problem. The question for any organization deploying AI is whether its workload is sharp enough to benefit from a sharp instrument.

Lesson 2 Quiz

Purpose-Built Accelerators

What was the original business justification Google used internally to build the first TPU?

Correct. Google's 2017 ISCA paper documented that even modest neural network adoption in voice search would require doubling data center capacity — making custom silicon economically necessary.

Not quite. Google's documented rationale was that three minutes per day of neural speech recognition per user would require doubling data center capacity with conventional hardware.

How does the Cerebras WSE-3 solve the memory bandwidth wall for smaller models?

Correct. The WSE-3's 44 GB of on-chip SRAM can hold smaller models entirely in memory, eliminating the off-chip bandwidth wall completely.

Not quite. The WSE-3 provides 44 GB of on-chip SRAM across its wafer-scale die — enough to fit many models entirely on-chip, eliminating off-chip memory traffic entirely.

What makes Groq's LPU different from a GPU in its execution model?

Correct. Groq's deterministic, compiler-scheduled execution eliminates cache hierarchies and dynamic scheduling, delivering predictable high-throughput inference.

Not quite. Groq's LPU is defined by its deterministic execution model — a compiler pre-schedules every memory access and computation at compile time, eliminating runtime variance.

Lab 2 — Purpose-Built Accelerator Design

Explore trade-offs in custom AI chip architectures

Your Task

Google, Cerebras, and Groq each made radically different architectural choices to beat the GPU for specific AI workloads. In this lab, dig into those trade-offs. What workloads favor each approach? What are the failure modes?

Try: "If I'm running a service that does 10,000 LLM inference requests per second with small prompts, which architecture — TPU, Cerebras, or Groq LPU — should I consider first, and why?"

AI Lab Assistant

Custom Accelerators · M8-L2

Welcome to Lab 2. We're comparing purpose-built AI accelerator architectures: Google TPUs, Cerebras wafer-scale chips, and Groq's LPU. Each optimizes a different bottleneck. Ask me about systolic arrays, on-chip SRAM advantages, deterministic scheduling, or when you'd choose one over another. What would you like to explore?

Module 8 · Lesson 3

Neuromorphic and Analog Computing

What happens when chips stop pretending to be von Neumann machines

Could chips that compute the way neurons fire — or that use physics itself to multiply — be the hardware of the next AI era?

In September 2021, Intel released its second-generation neuromorphic chip, Loihi 2, fabricated on Intel 4 (7nm-class) process technology. The chip contained 1 million programmable neurons and 120 million synapses — and consumed roughly 1 milliwatt per million synaptic operations per second. For comparison, a human brain's estimated energy budget for similar synaptic computation is in a comparable efficiency range. Intel wasn't claiming Loihi could replace a GPU for LLM training. It was claiming that for specific event-driven inference tasks — anomaly detection, sensory processing, sparse pattern recognition — it could do things nothing else could do at that energy level.

Neuromorphic Computing: The Core Idea

Conventional digital computers — including GPUs — operate on the von Neumann architecture: a processor fetches instructions from memory, executes them, writes results back. Data and instructions are treated as numbers in a synchronous clock-driven pipeline. This is efficient for many tasks but fundamentally different from how biological neural systems compute.

Biological neurons fire spikes — discrete events in time — only when their accumulated input crosses a threshold. They don't process data in synchronized clock cycles. They compute asynchronously, event-driven, and only expend energy when something changes. A retinal ganglion cell in the human eye doesn't transmit a frame 30 times per second; it transmits a spike when light intensity changes at its location.

Neuromorphic chips attempt to replicate this model in silicon. Intel's Loihi series uses spiking neural networks (SNNs) — networks where neurons communicate via sparse, asynchronous spikes rather than continuous floating-point values. IBM's TrueNorth chip (2014) had 4,096 neurosynaptic cores with 1 million neurons operating at 70 milliwatts in real-time inference mode — comparable to a hearing aid battery.

The Sparsity Advantage

Standard deep neural networks are dense: every layer computation touches most or all weights. Spiking neural networks are sparse: most neurons are silent most of the time, and computation only occurs when a spike is fired. For natural sensory data — audio, video, LiDAR — which are themselves sparse in time, SNNs can deliver orders-of-magnitude better energy efficiency. The challenge is that training SNNs efficiently remains an open research problem.

Analog In-Memory Computing

A different approach is to perform computation directly in memory using physics rather than binary logic. Analog in-memory computing (AIMC) exploits the physical properties of memory devices — specifically, that the conductance (resistance to current) of memristors or phase-change memory cells can be set to analog values representing weights.

When you apply a voltage to a row of such cells, each cell passes a current proportional to its conductance multiplied by the input voltage. Sum the currents from all cells in a column, and you have computed a dot product — the core operation of matrix multiplication — using Kirchhoff's current law rather than digital logic. The computation happens in nanoseconds, consumes microjoules, and requires no data movement: the weights are the memory.

IBM Research demonstrated in 2023 that a phase-change memory analog chip could perform inference on a ResNet-50 image classification task at 92.81% top-1 accuracy on ImageNet — within 0.3% of the digital baseline — while consuming roughly 14× less energy than an equivalent digital implementation. The team published in Nature Electronics in October 2023.

The challenge for AIMC is precision. Analog cells drift over time — their conductance changes with temperature, read cycles, and age. Training-quality numerical precision (fp32 or bf16) is not achievable. Current demonstrations run at approximately 4-bit equivalent precision. For inference on models with some tolerance for quantization error, this is acceptable. For training or high-precision scientific computation, it is not.

Photonic Computing

Light travels faster than electrons and generates less heat. Photonic neural network accelerators use optical components — Mach-Zehnder interferometers, optical waveguides, photodetectors — to perform matrix-vector multiplications at the speed of light with near-zero resistive heating. MIT spinout Lightmatter shipped its first commercial photonic interconnect product (Passage) in 2022 and demonstrated photonic matrix multiplication units in its Envise chip, claiming 10× better energy efficiency than comparable digital chips for dense matrix operations.

The fundamental constraint is that photonic systems require analog-to-digital and digital-to-analog converters at their boundaries with conventional electronics, which consume significant power and introduce latency. As of 2024, photonic computing remains at an early commercial stage — compelling for specific workloads, not yet a general replacement for digital accelerators.

Key Terms

Spiking Neural Network (SNN)A neural network model where neurons communicate via discrete spike events rather than continuous values — enabling sparse, event-driven, low-energy computation.

Analog In-Memory ComputingComputing matrix operations using the physical conductance of memory cells — performing multiplication using Ohm's law and summation using Kirchhoff's current law.

MemristorA two-terminal electronic component whose resistance is a function of historical current flow — enabling analog weight storage and in-memory computation.

Photonic AcceleratorA computing device that performs matrix operations using light rather than electrons — offering potential speed and efficiency advantages for specific linear algebra workloads.

Where These Technologies Stand

Neuromorphic chips are in production use for edge sensing and anomaly detection. Analog in-memory computing has demonstrated inference accuracy comparable to digital at far lower energy — but precision limits and device drift remain challenges for widespread deployment. Photonic computing has reached early commercial products but faces the DAC/ADC boundary problem. None of these is threatening the GPU's position for LLM training today. Each may become dominant for specific segments — edge, inference, sensory AI — within this decade.

Lesson 3 Quiz

Neuromorphic and Analog Computing

What is the key computational advantage of spiking neural networks for natural sensory data?

Correct. Sparsity is the key: SNNs only fire and consume energy when inputs change, matching the natural sparsity of sensory data like audio and video.

Not quite. The advantage is sparsity — most SNN neurons are silent most of the time, so computation and energy use scale with the rate of change in inputs rather than a fixed clock rate.

How does analog in-memory computing perform matrix multiplication?

Correct. Applying voltage to memory cells with analog conductance values produces currents proportional to weight × input; column-summing those currents gives the dot product using physical laws.

Not quite. Analog in-memory computing applies input voltages to cells whose conductance represents weights. Each cell passes current = conductance × voltage (Ohm's law), and column currents sum (Kirchhoff's law) to compute dot products.

What was the primary result of IBM Research's 2023 Nature Electronics paper on analog inference?

Correct. IBM's demonstration showed 92.81% top-1 ImageNet accuracy on ResNet-50 — within 0.3% of digital — at roughly 14× better energy efficiency.

Not quite. IBM's 2023 Nature Electronics paper demonstrated ResNet-50 inference at 92.81% accuracy on ImageNet — within 0.3% of digital — while using approximately 14× less energy.

Lab 3 — Beyond Digital: Novel Computing Paradigms

Explore neuromorphic, analog, and photonic approaches to AI compute

Your Task

Neuromorphic chips, analog in-memory computing, and photonic accelerators each represent a radical departure from conventional digital silicon. Explore their real-world status, limitations, and potential with the AI assistant.

Try: "For an autonomous drone that needs to detect obstacles at the edge with a 50mW power budget, is neuromorphic computing actually viable today, or is it still research?"

AI Lab Assistant

Novel Computing · M8-L3

Welcome to Lab 3. We're examining computing paradigms beyond conventional digital silicon: spiking neural networks on neuromorphic chips, analog in-memory computing, and photonic accelerators. Ask me about Intel Loihi, IBM TrueNorth, analog precision limits, or what workloads these technologies are actually ready for today. What's your question?

Module 8 · Lesson 4

The Geopolitics of Next-Generation Silicon

How export controls, national champions, and the TSMC chokepoint shape the post-GPU race

Who controls the factories that build post-GPU chips — and does it matter as much as who designs them?

On October 7, 2022, the U.S. Commerce Department published new export control rules that effectively prohibited the sale of advanced AI chips — and the equipment to manufacture them — to China. The rules targeted chips with performance above specific thresholds (the A100 and H100 fell above the line) and, more significantly, restricted exports of extreme ultraviolet lithography equipment from ASML. NVIDIA's China revenue, which had been approximately $4 billion annually, was abruptly cut off. Within weeks, NVIDIA announced a downgraded chip, the A800 and H800, designed to fall just below the control thresholds — then those were banned in October 2023 as the controls were tightened further.

The TSMC Chokepoint

Advanced semiconductor manufacturing is extraordinarily concentrated. As of 2024, Taiwan Semiconductor Manufacturing Company (TSMC) fabricates virtually all leading-edge AI chips below 5nm: NVIDIA's H100 and H200 (4nm), Apple's M-series (3nm), AMD's MI300X (5nm), Google's TPUv5 (5nm). TSMC's two advanced nodes — N3 (3nm) and N4 (4nm) — account for more than half of the world's leading-edge chip production capacity.

Samsung Foundry and Intel Foundry Services operate at comparable process nodes but have significantly lower yields, capacity, and customer trust for AI workloads. The practical consequence is that the entire global AI hardware industry — whether GPU-based, TPU-based, or any next-generation accelerator — depends on a relatively small number of fabs on a 14,000-square-mile island 110 miles from mainland China.

The U.S. CHIPS and Science Act of 2022 allocated $52.7 billion to rebuild domestic semiconductor manufacturing, with TSMC committing to build fabs in Arizona (Phoenix), Intel committing to Ohio, and Samsung to Texas. The Arizona fab began risk production in late 2024. Full-scale production at competitive advanced nodes is expected around 2026-2028, at costs 30-50% higher than equivalent Taiwan production.

China's Response: Huawei Ascend

Cut off from NVIDIA's most advanced chips, China's largest AI developers — Baidu, ByteDance, Alibaba — began transitioning to Huawei's Ascend 910B accelerator. Fabricated by SMIC on a 7nm-equivalent process (using older DUV rather than EUV lithography), the 910B delivers roughly 60-70% of H100 performance at comparable power. Baidu reported deploying Ascend 910B at scale for Ernie Bot training in 2023. The chip represents a remarkable engineering achievement given the equipment restrictions — but it also illustrates the efficiency tax that manufacturing constraints impose.

The Software Moat Problem

Hardware competition faces an asymmetric challenge: CUDA. NVIDIA's compute platform has been developed since 2006, has millions of lines of optimized kernels, and underlies virtually all major AI frameworks — PyTorch, TensorFlow, JAX all compile to CUDA by default. When Google, AMD, or an upstart accelerator company builds a chip that outperforms the H100 on raw specs, they still face the fact that every AI researcher knows CUDA and most training code is written to assume it.

AMD's ROCm platform is a CUDA-compatible alternative that reached functional parity for most PyTorch operations in 2023. Meta announced in early 2024 that it had achieved GPU-equivalent training performance on AMD MI300X accelerators for its Llama training runs — a significant milestone. Google's XLA compiler abstracts hardware for TPUs. Intel's oneAPI targets its Gaudi accelerators. Each is technically capable; none has broken CUDA's network effects in the broader research community.

The CUDA moat is potentially more durable than any single chip architecture. Even if a technically superior accelerator reaches the market, the ecosystem around it — libraries, documentation, community knowledge, debugging tools — requires years to build. This is why NVIDIA's competitors invest as heavily in developer platforms as in chip design.

The Sovereign AI Hardware Movement

The export control regime has accelerated a movement toward sovereign AI hardware — nations and large economies building or procuring AI chips independently of U.S.-controlled supply chains. The European Union's Chips Act (2023) committed €43 billion to semiconductor manufacturing. Japan's Rapidus project targets 2nm production by 2027. The UAE's G42 has built massive GPU clusters through relationships that were subject to U.S. government review in 2024.

India's India Semiconductor Mission approved its first fab projects in 2024. Each initiative faces the same challenge: chip design talent, EDA tool access, and the 15-year learning curve that separates leading-edge semiconductor manufacturing from everything else. Hardware geopolitics is a slow game — policy decisions made in 2024 produce manufacturing capacity in 2030.

Key Terms

Export Controls (BIS)U.S. Bureau of Industry and Security regulations that restrict the sale of advanced chips and semiconductor equipment to designated countries — the primary policy tool shaping AI hardware geopolitics.

EUV LithographyExtreme Ultraviolet Lithography — the manufacturing process required for chips below ~7nm, supplied exclusively by ASML (Netherlands) and subject to U.S.-Dutch export control coordination.

CUDA MoatNVIDIA's competitive advantage derived not from hardware alone but from its 18-year-old software ecosystem — the dominant reason AI developers remain on NVIDIA hardware despite competitive alternatives.

Sovereign AI HardwareNational programs to develop and manufacture AI accelerators domestically, reducing dependence on foreign-controlled supply chains.

The Module's Central Tension

The post-GPU future is being shaped simultaneously by physics (what silicon can do), economics (what custom chips cost to design and fabricate), software (what developers will actually write code for), and geopolitics (who is allowed to buy or build what). None of these vectors points to a single winner. The most likely near-term outcome is a fragmented landscape: GPUs for training, purpose-built accelerators for inference, neuromorphic and analog chips for edge sensing — and a geopolitical dividing line running through all of it.

Lesson 4 Quiz

The Geopolitics of Next-Generation Silicon

What manufacturing technology is required for chips below approximately 7nm, and who is the sole supplier of the equipment to make it?

Correct. ASML is the sole manufacturer of EUV lithography systems, which are required for leading-edge nodes — making them a critical geopolitical chokepoint.

Not quite. EUV (Extreme Ultraviolet) lithography is required for sub-7nm manufacturing, and ASML is the only company in the world that produces EUV lithography systems.

What milestone did Meta report achieving with AMD MI300X accelerators in early 2024?

Correct. Meta's announcement that it achieved GPU-equivalent training on MI300X was a significant signal that AMD's ROCm platform had reached practical parity for large-scale training workloads.

Not quite. Meta announced achieving GPU-equivalent training performance on AMD MI300X for its Llama model training — an important milestone demonstrating AMD as a viable alternative.

Why does the "CUDA moat" make it difficult for technically superior AI chips to displace NVIDIA?

Correct. The CUDA ecosystem — libraries, documentation, community knowledge, debugging tools — has been built over nearly two decades and represents a network-effect moat that is very difficult to replicate quickly.

Not quite. The CUDA moat is a software ecosystem problem: millions of developers know CUDA, all major AI frameworks default to it, and optimized libraries like cuDNN took years to build. Raw hardware superiority doesn't overcome that overnight.

Lab 4 — AI Hardware Geopolitics

Analyze the strategic and policy dimensions of the post-GPU hardware race

Your Task

Export controls, TSMC dependency, national chip programs, and the CUDA moat all intersect to shape who can build and deploy advanced AI hardware. In this lab, examine the strategic implications — for companies, nations, and the future of AI development.

Try: "If the U.S. export controls on AI chips are meant to slow China's AI development, how should we evaluate whether they're actually working — and what are the unintended consequences?"

AI Lab Assistant

Hardware Geopolitics · M8-L4

Welcome to Lab 4. We're examining the geopolitical dimensions of the AI hardware race: U.S. export controls, TSMC's strategic position, China's Huawei Ascend response, the CHIPS Act, and the software moat around CUDA. Ask me about any of these — including their second-order effects on the future of AI development globally. What would you like to explore?

Module 8 Test

What Comes After GPUs — 15 questions · 80% to pass

1. What does the "roofline model" identify in an AI workload?

Correct.

The roofline model is an analytical tool that identifies whether a workload is limited by compute throughput or memory bandwidth.

2. What memory technology provides the highest bandwidth in current GPU architectures?

Correct.

HBM (High-Bandwidth Memory) — stacked DRAM dies connected via wide interfaces — provides the bandwidth used in data-center GPUs like the H100.

3. In what year was Google's TPUv1 actually deployed internally — before it was publicly disclosed?

Correct. TPUv1 was deployed in 2015, a full year before it was publicly disclosed at Google I/O 2016.

TPUv1 was deployed internally in 2015, one year before Google's public 2016 disclosure.

4. How many transistors does the Cerebras WSE-3 contain?

Correct. The WSE-3 contains 4 trillion transistors across its wafer-scale die.

The Cerebras WSE-3 contains 4 trillion transistors — made possible by spanning an entire 300mm silicon wafer.

5. What is the key mechanism by which Groq's LPU achieves high inference throughput?

Correct. The LPU's deterministic compiler-scheduled execution eliminates runtime overhead from caching and scheduling, delivering consistent high-throughput inference.

Groq's LPU achieves throughput through deterministic, statically-scheduled execution — the compiler pre-plans every memory access, eliminating cache hierarchies and dynamic scheduling overhead.

6. What performance was demonstrated for Llama-2 70B inference on a Groq LPU node in late 2023?

Correct. Groq demonstrated over 300 tokens per second on Llama-2 70B — roughly 10× faster than comparable H100 inference.

Groq demonstrated over 300 tokens/second on Llama-2 70B in December 2023 — approximately 10× faster than comparable H100 inference.

7. What is the fundamental difference between how conventional neurons in deep learning and spiking neurons communicate?

Correct. The core distinction is continuous vs. discrete communication — enabling sparsity and event-driven computation in spiking networks.

Standard deep learning neurons pass continuous floating-point activations at every layer. Spiking neurons communicate via discrete spike events — binary, asynchronous, and sparse.

8. What physical law enables analog in-memory computing to sum the outputs of multiple memory cells computing dot products?

Correct. Kirchhoff's Current Law states that currents sum at a node — enabling column-wise summation of individual cell currents to produce dot product results.

Kirchhoff's Current Law — which states that currents at a node sum — enables the column-wise accumulation of cell currents that computes the dot product in analog in-memory systems.

9. What was the primary limitation of analog in-memory computing identified in IBM's 2023 Nature Electronics research?

Correct. Analog cell conductance drifts with temperature, read cycles, and age — limiting precision to roughly 4-bit equivalent, which constrains applicability to inference-only workloads with quantization tolerance.

The key limitation is device drift: analog memory cell conductance changes over time with temperature and use, limiting achievable precision to approximately 4-bit equivalent.

10. What specific event triggered the U.S. October 2022 export control rules that affected AI chip exports to China?

Correct. The October 7, 2022 rules were a proactive policy decision by BIS targeting chips — including the A100 and H100 — above specific compute thresholds.

The October 7, 2022 rules were a U.S. Commerce Department / BIS policy decision — proactively restricting chips above defined performance thresholds, not a response to a specific incident.

11. Approximately what fraction of the H100 SXM5's rated memory bandwidth does the chip typically utilize during LLM inference on real workloads?

Correct. LLM inference on GPUs is memory-bound — compute units frequently wait for weight data to arrive from HBM, meaning raw FLOP counts overstate usable throughput.

LLM inference on GPUs is memory-bound: compute cores often sit idle waiting for weight data to arrive from HBM. Adding more FLOPS without more memory bandwidth delivers diminishing returns.

12. What is the Cerebras WSE-3's on-chip SRAM capacity?

Correct. The WSE-3 contains 44 GB of on-chip SRAM — more than the aggregate SRAM of roughly 200 H100 GPUs.

The Cerebras WSE-3 provides 44 GB of on-chip SRAM — a remarkable figure that exceeds the aggregate SRAM of ~200 H100 GPUs combined.

13. What was the Huawei Ascend 910B designed to address — and what is its key manufacturing limitation?

Correct. The Ascend 910B is China's primary response to GPU export controls, but SMIC's reliance on older DUV rather than EUV lithography imposes a performance tax compared to TSMC-fabricated chips.

The Ascend 910B was built to replace NVIDIA GPUs cut off by export controls. Its limitation is manufacturing: SMIC uses DUV (not EUV) lithography, yielding roughly 60-70% of H100 performance.

14. What did Meta's early 2024 announcement about AMD MI300X training demonstrate?

Correct. Meta's announcement was significant because it provided real-world validation that AMD ROCm had reached parity for large-scale LLM training workloads.

Meta's announcement demonstrated GPU-equivalent training performance on AMD MI300X — a significant milestone showing that ROCm had reached practical parity for production-scale AI training.

15. Which of the following best describes the likely near-term hardware landscape as post-GPU architectures mature?

Correct. The evidence points to specialization, not replacement: different workloads, different hardware, different supply chains — with geopolitics running through all of it.

The evidence supports fragmentation: GPUs for training, purpose-built accelerators for specific inference workloads, neuromorphic/analog chips for edge sensing, all within a geopolitically divided supply chain.