Module 5 · Lesson 1

The GPU Gold Rush

Why Nvidia went from a gaming chip maker to the backbone of the AI economy — and what happens when supply can't keep up with ambition.

What does it actually take, in silicon and watts, to run the AI systems everyone is talking about?

In March 2023, Meta disclosed it had ordered roughly 350,000 Nvidia H100 GPUs — a single purchase that would have cost somewhere north of $10 billion at list price. The chips were needed to train Llama 2 and power recommendation systems serving billions of users. Getting them was hard. The waiting list for H100s stretched six months or more. Microsoft, Google, and Amazon were all in the same queue.

Why AI Runs on GPUs

A central processing unit (CPU) is a versatile generalist — it has a handful of very powerful cores optimized for sequential logic. A graphics processing unit (GPU) is the opposite: thousands of smaller, simpler cores designed to do the same arithmetic operation on many data points simultaneously. Training a neural network is, at its core, trillions of matrix multiplications. GPUs are purpose-built for that workload.

Nvidia's CUDA programming platform, released in 2006, let developers write code that ran directly on GPU hardware without needing graphics expertise. By the time deep learning took off with AlexNet's ImageNet victory in 2012, researchers were already writing CUDA kernels to train networks. That decade-long head start created a software moat that competitors have struggled to cross.

~$30K

List price per Nvidia H100 SXM5 GPU (2023 launch)

700W

Thermal design power of a single H100 SXM5 under full AI load

80 GB

HBM3 memory per H100 — critical for fitting large model weights

3,958 TOPS

INT8 tensor performance — the measure that matters for inference

CUDA's Lock-in Effect

Nvidia's real product isn't the GPU die — it's the stack. CUDA provides the runtime, the math libraries (cuDNN, cuBLAS), the profiling tools, and the community of developers who have spent careers learning to optimize for it. PyTorch and TensorFlow both compile to CUDA by default. When AMD released its MI300X GPU in late 2023 with competitive raw specs, the challenge wasn't performance — it was ROCm, AMD's CUDA equivalent, which still had gaps in library support and debugging tooling.

Google's internal TPU program is the most serious alternative. The company has deployed five generations of Tensor Processing Units, with TPU v5e entering external availability in 2023. TPUs are custom ASICs optimized for specific matrix sizes used in Transformer models. Google uses them to train Gemini and to run Search's AI features. But TPUs are only available as Google Cloud services — you can't buy one.

Real Event

In January 2024, Nvidia CEO Jensen Huang announced the Blackwell B200 GPU at CES, promising 2.5x the training performance of H100 and a new 1,000W power envelope. Microsoft, Google, Meta, and Amazon all announced they had already committed to Blackwell orders before the chip was publicly revealed — illustrating how procurement decisions now happen years ahead of volume availability.

The Supply Constraint That Shaped 2023

All high-end AI GPUs are built on TSMC's most advanced process nodes — H100 on N4, Blackwell on N4P. TSMC has limited capacity at leading-edge nodes, shared among Apple (iPhone chips), AMD, Qualcomm, and AI GPU demand. In 2023, Nvidia's AI GPU allocation consumed a disproportionate share of TSMC's CoWoS advanced packaging capacity, which is needed to attach high-bandwidth memory to the GPU die. That packaging bottleneck — not the raw silicon — was what created the six-month wait lists.

The financial consequences were stark: Nvidia's data center revenue went from $3.6 billion in Q1 2023 to $18.4 billion in Q1 2024 — a 5x increase in 12 months. No technology company has ever grown that fast at that scale.

Key Insight

The GPU supply crunch revealed a structural reality: AI capability is not just about software or algorithms — it is physically constrained by semiconductor manufacturing. Whoever controls leading-edge fab capacity controls the pace of AI development. That is why hyperscalers are now designing their own chips, and why the US export controls placed on H100s to China in October 2022 were treated as a national security matter.

CoWoSChip-on-Wafer-on-Substrate — TSMC's advanced packaging technology that stacks HBM memory directly adjacent to the GPU die, enabling the extreme memory bandwidth AI models require. Capacity constraints here, not silicon yield, were the 2023 bottleneck.

HBM (High-Bandwidth Memory)3D-stacked DRAM manufactured by SK Hynix, Samsung, and Micron. H100 uses HBM3 delivering ~3.35 TB/s of memory bandwidth — roughly 10x a conventional DDR5 DIMM. Critical for keeping GPU cores fed during large model inference.

FLOPS vs TOPSFloating-point operations per second (FLOPS) measures training throughput; integer/tensor operations per second (TOPS) measures inference. Marketing figures often mix these — always check the precision (FP8, INT8, BF16) when comparing chips.

Lesson 1 Quiz

The GPU Gold Rush — 4 questions

Why did Nvidia's H100 GPU face supply shortages in 2023 despite strong TSMC production capacity?

Correct. The six-month wait lists were driven by CoWoS advanced packaging capacity — the process that attaches HBM memory stacks to the GPU die — not by any shortage of GPU silicon itself.

Not quite. The shortage was a packaging bottleneck, not a silicon yield or supply issue. CoWoS capacity at TSMC was the limiting factor throughout 2023.

What is the primary reason PyTorch and TensorFlow default to Nvidia CUDA rather than AMD's ROCm platform?

Correct. CUDA's moat is primarily software — cuDNN, cuBLAS, profiling tools, and a decade of developer habits make switching costs very high even when AMD hardware specs are competitive.

Not quite. The moat is software, not raw specs. CUDA's library ecosystem and developer familiarity, built since 2006, is the key lock-in factor.

What was notable about the Blackwell B200 GPU announcements at CES January 2024?

Correct. Microsoft, Google, Meta, and Amazon had all already committed to Blackwell orders before Jensen Huang's CES announcement — showing procurement timelines that precede public product launches by a year or more.

Incorrect. The significant detail was that hyperscalers had pre-committed to orders before the product was publicly revealed, illustrating how far in advance AI infrastructure procurement now operates.

How did Nvidia's data center revenue change between Q1 2023 and Q1 2024?

Correct. Nvidia's data center segment grew roughly 5x in 12 months — from $3.6B to $18.4B — a pace unprecedented at that revenue scale in technology hardware history.

Not quite. The growth was approximately 5x — $3.6 billion in Q1 2023 to $18.4 billion in Q1 2024 — driven almost entirely by AI GPU demand, not gaming.

Lab 1 — GPU Supply Chain Analysis

Discuss GPU bottlenecks and their strategic implications with the AI assistant

Your Mission

You are advising a mid-sized enterprise that wants to build an on-premises AI inference cluster for sensitive financial data. They cannot use public cloud. They need to understand whether buying Nvidia H100s now makes sense, or whether waiting for Blackwell or exploring AMD MI300X is better strategy.

Have at least 3 exchanges with the assistant to complete this lab. Ask about procurement timing, alternative chips, and the CUDA software lock-in risk.

Suggested opener: "Our CFO approved budget for 32 GPU nodes for on-prem AI inference. Should we buy H100s now or wait for Blackwell? What's the realistic timeline risk?"

AI Infrastructure Advisor

GPU Strategy

Welcome. I'm here to help you think through GPU procurement strategy for on-premises AI infrastructure. The H100 vs Blackwell timing question is genuinely tricky right now — there are real cost, availability, and compatibility factors at play. What's your primary use case: training, fine-tuning, or inference? And do you have a hard deadline for the cluster going live?

Module 5 · Lesson 2

Hyperscaler Custom Silicon

Google, Amazon, Microsoft, and Meta are all building their own AI chips. Why, and what does it mean when the biggest cloud customers stop buying from Nvidia?

Can any company replicate Google's decade-long TPU advantage, and why is every hyperscaler now trying?

In 2013, Google engineers ran a calculation that frightened the company's leadership: if every Gmail user issued voice searches for just three minutes a day using the neural network speech models then under development, Google would need to double its global data center capacity to handle the inference load. CPUs were too slow and too expensive per operation. Nvidia GPUs existed but were power-hungry and not optimized for the specific matrix sizes in Google's models. The decision was made to build a custom chip. The first Tensor Processing Unit went into production inside Google data centers in 2015 — a year before AlphaGo defeated Lee Sedol, which relied on TPU inference.

The TPU Genealogy

TPU v1 · 2015

Inference only. 8-bit integer arithmetic, 92 TOPS, 40W. Deployed internally in Google data centers. Used for RankBrain (search ranking) and Street View image processing. Never sold externally.

TPU v2 · 2017

Training + inference. Introduced bfloat16 format — a 16-bit float that fits in the same hardware as FP16 but preserves the dynamic range of FP32, critical for stable training. Available on Google Cloud. Used to train AlphaStar and AlphaFold.

TPU v4 · 2021

Pod-scale interconnect. 4,096 chips networked with optical circuit switching into a "pod" delivering 1.1 exaflops of bfloat16 compute. Used to train PaLM (540B parameters) and the original Gemini models. Google claims 1.2–1.7x better performance-per-dollar vs comparable Nvidia systems for Transformer training.

TPU v5e · 2023

Inference-optimized, cost-focused. Announced for external availability November 2023. Targets fine-tuning and serving mid-size models. Priced at roughly $2.20/chip-hour on Google Cloud — positioned against Nvidia A100s for inference workloads.

Amazon's Trainium and Inferentia

AWS took a two-chip approach. Inferentia (first generation 2019, second generation 2023) targets inference only, running at high throughput for production serving. Trainium (2021, Trainium 2 in 2023) targets training. Amazon uses both internally — Alexa's neural ranking models run on Inferentia — and offers them as EC2 instances (Inf2, Trn1) at pricing designed to undercut Nvidia GPU instances by 40–70% on per-inference-token cost for specific model families.

The catch: you must recompile your model using AWS Neuron SDK, which adds engineering overhead and doesn't support every layer type out of the box. For companies already invested in CUDA-optimized code, the migration cost can be substantial. For greenfield projects, particularly companies training proprietary models on AWS, Trainium is increasingly attractive.

Real Event

In November 2023, Amazon announced Trainium2 would power new EC2 Trn2 instances with 4x the performance of Trainium1. More significant: Amazon disclosed it was deploying Trainium2 clusters internally at scales that would make it one of the largest AI compute deployments in the world — suggesting AWS is now training foundation models at hyperscaler scale on its own silicon rather than on Nvidia GPUs.

Microsoft's Maia and Meta's MTIA

Microsoft revealed Azure Maia 100 in November 2023 — an AI accelerator chip designed specifically for running (not training) large language models in Azure data centers. It powers Copilot inference. Microsoft was explicit that Maia is not a replacement for Nvidia GPUs for general training workloads; it is an inference cost-reduction play for models Microsoft runs at enormous scale.

Meta announced MTIA (Meta Training and Inference Accelerator) in April 2023. The first-generation chip is focused on ranking and recommendation models — the workloads that run Facebook and Instagram feeds — rather than generative AI. Meta's generative AI (Llama training) still runs on Nvidia GPUs. MTIA represents a pragmatic split: custom silicon for the highest-volume, most predictable workloads; Nvidia for the frontier research that benefits from the software ecosystem.

Strategic Pattern

Every hyperscaler's custom silicon strategy follows the same logic: at sufficient scale, the margin on inference is large enough that a custom chip paying back its design cost over millions of chip-hours beats a general-purpose GPU. Google, Amazon, Microsoft, and Meta each run workloads at scales where even a 30% efficiency gain justifies a $500M chip design program. Smaller companies almost never reach that scale threshold — which is why the custom silicon trend does not threaten Nvidia's revenue from enterprises, only from the hyperscalers themselves.

bfloat16Brain floating-point 16-bit format invented at Google Brain. Same exponent range as FP32 (8 bits) but only 7 mantissa bits instead of 23. Training stays numerically stable while memory and compute cost drops by ~50% vs FP32. Now supported in Nvidia A100/H100 and all major custom chips.

ASICApplication-Specific Integrated Circuit. Unlike a GPU (which is programmable for any parallel workload), an ASIC is hardwired for specific operations. TPUs and Trainium are ASICs — they are faster and more efficient for the operations they target, but cannot be reprogrammed for arbitrary new workloads.

Lesson 2 Quiz

Hyperscaler Custom Silicon — 4 questions

What was the specific calculation that motivated Google to build the first TPU rather than continue using CPUs or Nvidia GPUs?

Correct. The 2013 calculation was specifically about voice search inference load — not training. If Gmail users used voice search for 3 minutes/day, the inference cost would require doubling Google's entire data center footprint.

Not quite. The trigger was an inference cost projection: three minutes per day of voice search by Gmail users would require doubling Google's global data center capacity with existing hardware.

What is the key advantage of bfloat16 over standard FP16 for neural network training?

Correct. bfloat16 keeps 8 exponent bits (same as FP32) while reducing mantissa to 7 bits. This preserves the dynamic range that matters for stable training while halving memory vs FP32.

Incorrect. The key advantage is numerical stability: bfloat16 keeps FP32's exponent range, preventing the gradient underflow problems that make standard FP16 tricky for training large models.

What is the practical limitation of Amazon's Inferentia/Trainium approach for companies already running Nvidia-based AI systems?

Correct. The migration cost is real — recompiling to Neuron SDK, testing layer compatibility, re-validating accuracy — which is why the 40–70% cost savings doesn't automatically win over teams with CUDA-optimized pipelines.

Not quite. The main barrier is the engineering overhead of recompiling to AWS Neuron SDK and dealing with incomplete layer support — a real migration cost that offsets the per-inference pricing advantage.

Why does the hyperscaler custom silicon trend NOT threaten Nvidia's revenue from enterprise customers?

Correct. The economics of custom silicon require enormous inference volume to amortize design cost. Google, Amazon, Meta, and Microsoft run workloads at scales that justify a $500M chip program; most enterprises do not.

Not quite. The key point is economics of scale: a custom ASIC program costing $500M+ only pays back if you're running it at hyperscaler inference volumes. Most enterprises never approach that scale threshold.

Lab 2 — Custom Silicon Strategy

Analyze when hyperscaler custom chip economics make sense for an organization

Your Mission

You work at a large bank that is evaluating whether to build an on-premises AI inference cluster or use cloud-based AI APIs. The bank's AI team is asking whether using Google TPU v5e or AWS Inferentia instances makes more sense than renting Nvidia A100 GPU instances — or whether on-prem Nvidia H100s are best for a regulated financial institution.

Have at least 3 exchanges. Ask about the break-even economics, regulatory considerations for cloud vs on-prem, and whether bfloat16 precision matters for financial models.

Suggested opener: "We run about 50 million inference calls per day on our credit risk scoring models. Currently on Nvidia A100 GPU instances on AWS. Should we migrate to Inferentia2, or is the migration cost not worth it at our scale?"

AI Infrastructure Advisor

Custom Silicon

Good question — 50 million inference calls per day is actually approaching the scale where the Inferentia2 migration economics start to look compelling, depending on your model architecture and latency requirements. Let me ask a few things before we run the numbers: What's the average latency budget for a credit risk score — are we talking sub-10ms, or is 100ms acceptable? And are these models transformer-based, or more traditional gradient-boosted tree ensembles?

Module 5 · Lesson 3

Data Center Power and the Energy Reckoning

AI training runs have already consumed more electricity than some countries. The constraint on future AI capability may not be chips — it may be power.

When a single training run consumes as much electricity as 500 US homes use in a year, where does the power come from — and who pays?

OpenAI has never published exact figures for GPT-4's training energy use. Independent estimates from researchers at the University of Washington, using disclosed model size and known GPU cluster specifications, put the training run somewhere between 50 and 150 gigawatt-hours. At the US average electricity price of roughly $0.08/kWh for industrial customers, that implies an electricity cost of $4M to $12M — just for the power bill, separate from hardware amortization. A single training run. One experiment.

The Scale of AI Power Demand

In early 2024, the International Energy Agency published projections that data center electricity consumption could double from 2022 levels by 2026, driven primarily by AI workloads. Goldman Sachs research estimated that AI data centers would require 160 terawatt-hours of additional electricity annually by 2030 — roughly equivalent to the entire electricity consumption of Sweden.

Microsoft disclosed in its 2024 Environmental Sustainability Report that its water consumption had increased 34% between 2021 and 2022, reaching 6.4 million cubic meters — primarily driven by data center cooling. The company acknowledged the tension between its 2030 carbon negative pledge and the explosive growth in AI infrastructure it was simultaneously building.

~700W

Power draw of a single Nvidia H100 SXM5 GPU under full AI load

~10MW

Typical power draw of a 1,000-GPU H100 training cluster

1–2 GW

Power capacity of a hyperscaler mega-campus now under planning (Microsoft Iowa, Google Oklahoma)

1.7×

PUE overhead: for every watt of compute, data centers spend ~0.7W more on cooling and distribution

Why Power Is Becoming the Real Constraint

Building a 1-gigawatt data center campus is not primarily an engineering challenge — it is a utility negotiation. Electrical grids in the US are regulated at the state level. Getting a new high-voltage transmission line permitted and built takes 7–12 years in most jurisdictions. Data centers want power in 18–36 months. The gap between those timelines is creating a hard ceiling on AI infrastructure expansion in many regions.

Northern Virginia (Loudoun County) hosts roughly 70% of the world's internet traffic passing through data centers. By 2023, Dominion Energy — the regional utility — had issued moratoriums on new large data center connections in some substations, citing grid capacity constraints. Microsoft, Google, and Amazon all faced connection delays. The constraint was not land, not chips, not capital — it was utility grid headroom.

Real Event

In September 2023, Microsoft and Constellation Energy signed a 20-year power purchase agreement for nuclear energy from the Three Mile Island Unit 1 reactor in Pennsylvania — a unit that had been shut down in 2019 due to economics. Constellation agreed to restart it specifically to supply Microsoft's AI data centers. The deal illustrated how seriously hyperscalers are treating the power constraint: they are now directly funding nuclear restarts rather than waiting for grid upgrades.

Nuclear, Renewables, and the Grid Reality

Solar and wind power are intermittent — they do not match data center demand curves, which are flat 24/7. Battery storage at gigawatt scale remains economically marginal. This is driving renewed interest in nuclear: firm, carbon-free, dispatchable power that data centers can actually rely on. Beyond Three Mile Island, Amazon signed a deal with Talen Energy for a 960-megawatt nuclear-powered data center campus directly adjacent to the Susquehanna nuclear plant in Pennsylvania in 2023.

Google has invested in small modular reactor (SMR) startups, signing agreements with Kairos Power in 2023 to purchase power from reactors expected to come online after 2030. The challenge: no commercial SMR has yet been built anywhere in the world at scale. These are bets on technology that is still years from proven deployment.

The PUE (Power Usage Effectiveness) metric tracks how efficiently a data center converts electrical input into useful compute. A PUE of 1.0 is theoretically perfect; real facilities run 1.1–1.5. Liquid cooling — running water or dielectric fluid directly over chip packages — is becoming standard for AI GPU clusters because air cooling cannot remove heat fast enough from 700W chips packed at density. Google's liquid-cooled TPU pods operate at PUE close to 1.1; older air-cooled facilities average 1.5–1.7.

Strategic Implication

The companies best positioned for the next phase of AI scaling are not just those with the most chips — they are those that have secured reliable, large-scale power contracts ahead of the grid constraints tightening. Power procurement is now a core AI competitive moat, alongside chip supply and software talent.

PUE (Power Usage Effectiveness)Total data center power consumption divided by IT equipment power consumption. PUE of 1.2 means for every 100W of compute, 20W goes to cooling, lighting, and distribution overhead. Lower is better. The industry average in 2023 was roughly 1.55; hyperscaler-built facilities average 1.1–1.2.

Power Purchase Agreement (PPA)Long-term contract (typically 10–25 years) between an energy generator and a buyer, fixing price and volume of electricity. Hyperscalers use PPAs to secure dedicated power from specific renewable or nuclear sources without owning the generation assets.

Lesson 3 Quiz

Data Center Power and the Energy Reckoning — 4 questions

What was significant about Microsoft's 2023 agreement with Constellation Energy regarding Three Mile Island?

Correct. Three Mile Island Unit 1 was shut down in 2019 for economic reasons. Constellation agreed to restart it under a 20-year power purchase agreement with Microsoft — the hyperscaler directly funding nuclear restarts to solve its power supply problem.

Not quite. The key fact: Three Mile Island Unit 1 had been shut down in 2019. Microsoft's 20-year PPA was specifically structured to fund its restart — a hyperscaler directly enabling nuclear generation to solve AI data center power needs.

Why is solar and wind power generally insufficient as the primary power source for AI data centers, despite being low-cost and carbon-free?

Correct. Data centers need firm, dispatchable power 24 hours a day, 365 days a year. Solar and wind are intermittent; battery storage at gigawatt scale is still not economically viable — which is why nuclear is attracting hyperscaler interest.

Incorrect. The issue is intermittency: data centers require 24/7 firm power, and solar and wind don't generate consistently on demand. Grid-scale battery storage remains too expensive to bridge the gaps.

What happened in Northern Virginia's Loudoun County data center corridor by 2023 that illustrates the power constraint problem?

Correct. Dominion Energy — the regional utility — issued moratoriums on new large data center connections in some substations. The constraint was grid headroom, not land or capital, demonstrating that power procurement is now a core AI infrastructure bottleneck.

Not quite. Dominion Energy issued connection moratoriums at some substations — blocking new large data center connections because the local grid lacked headroom. This happened in the region hosting roughly 70% of global internet traffic.

A data center has a PUE of 1.4. If its GPU cluster draws 10 megawatts of compute power, what is its total electricity consumption?

Correct. PUE = total facility power / IT equipment power. So total = PUE × IT power = 1.4 × 10MW = 14MW. The extra 4MW goes to cooling, lighting, UPS losses, and power distribution.

Not quite. PUE × IT equipment power = total facility power. 1.4 × 10MW = 14MW total. The additional 4MW above the compute draw goes to cooling systems, UPS losses, and power distribution overhead.

Lab 3 — AI Data Center Power Planning

Work through the energy economics of an AI infrastructure expansion

Your Mission

Your company is planning to build a dedicated AI training data center with 2,000 Nvidia H100 GPUs. You need to present a power strategy to the board that addresses total power demand, sourcing options, cooling approach, and cost implications.

Have at least 3 exchanges. Ask about total power draw calculations, renewable vs nuclear sourcing tradeoffs, liquid cooling requirements, and what PUE target is realistic.

Suggested opener: "We're planning a 2,000 H100 GPU training cluster. Walk me through how to calculate total power demand and what power sourcing options we should present to the board."

AI Infrastructure Advisor

Power Strategy

Good starting point. Let's build this up from first principles so you have defensible numbers for the board. The H100 SXM5 has a 700W thermal design power — that's per GPU under full training load. With 2,000 GPUs, you're looking at 1.4 megawatts of compute power before we account for networking switches, storage, CPUs in the host servers, or cooling. What geographic region are you targeting for the facility? That affects both grid availability and cooling options significantly.

Module 5 · Lesson 4

The Networking and Memory Frontier

The next bottleneck isn't the GPU — it's how chips talk to each other. InfiniBand, NVLink, and the race for memory bandwidth that will define which models are trainable.

If you can't move data between chips faster than you can compute with it, what does it matter how many chips you have?

When Microsoft and OpenAI trained GPT-3 in 2020, the model's 175 billion parameters required distributing training across hundreds of A100 GPUs simultaneously. Each parameter update required synchronizing gradient information across every GPU in the cluster. At 175 billion parameters in 16-bit precision, the model alone occupies roughly 350 gigabytes — far more than any single GPU's memory. The limiting factor was not compute. It was the speed and reliability of the interconnect between chips. A single faulty network switch in a 1,000-GPU cluster could stall the entire training run.

Why Interconnect Is Now the Bottleneck

Modern AI training is embarrassingly parallel within a single forward/backward pass, but at the boundaries between GPUs, communication overhead accumulates. There are three parallelism strategies used in large model training: data parallelism (each GPU gets a different batch of training data, gradients are averaged), tensor parallelism (individual matrix operations split across GPUs), and pipeline parallelism (different model layers on different GPUs). All three require fast, low-latency communication.

Standard Ethernet, even at 400 gigabits per second, introduces too much latency and CPU overhead for GPU-to-GPU gradient synchronization. This is why AI clusters use InfiniBand — a networking standard designed for high-performance computing that bypasses the CPU entirely through Remote Direct Memory Access (RDMA). Nvidia acquired Mellanox, the dominant InfiniBand vendor, in 2020 for $6.9 billion. That acquisition, in retrospect, was as strategically important as any chip design decision Nvidia has made.

Real Event

In March 2024, Nvidia announced NVLink Switch System, enabling 576 H100 or Blackwell GPUs to communicate with each other at 1.8 terabytes per second of bidirectional bandwidth per GPU — effectively making 576 chips behave like a single unified pool of compute and memory. This was a direct response to the scaling problem: as models exceed 1 trillion parameters, no amount of InfiniBand bandwidth between separate nodes is as fast as NVLink within a unified pod.

The Memory Wall Problem

GPT-4 is estimated to have roughly 1.8 trillion parameters in a mixture-of-experts architecture. At BF16 precision, that's approximately 3.6 terabytes of model weights. No single GPU has 3.6 terabytes of memory. The H100 has 80 gigabytes. Loading even a fraction of GPT-4's weights for a single inference request requires a complex dance of memory management across many chips, each contributing their 80GB to a shared logical address space.

HBM (High Bandwidth Memory) is the bridge between DRAM bandwidth and GPU compute. H100 uses HBM3 at 3.35 terabytes per second of bandwidth — compared to about 50 GB/s for DDR5 DRAM. But as model size grows and sequence lengths in Transformer attention mechanisms increase (attention scales quadratically with sequence length), even HBM3 bandwidth becomes limiting. The H200 replaces HBM3 with HBM3e at 4.8 TB/s and 141GB capacity — an incremental improvement, but not a solution to the fundamental memory wall.

Technology	Primary Role	Bandwidth / Capacity	Key Limitation
HBM3 (H100)	On-chip memory	3.35 TB/s, 80 GB	Capacity: 80GB can't fit large models
HBM3e (H200)	On-chip memory	4.8 TB/s, 141 GB	Still insufficient for 1T+ parameter models
NVLink 4.0	GPU-to-GPU (same node)	900 GB/s bidirectional	Limited to same physical chassis
InfiniBand NDR	Node-to-node networking	400 Gb/s per port	Latency vs NVLink; CPU overhead
NVLink Switch (2024)	Pod-scale unified memory	1.8 TB/s per GPU	Proprietary; requires full Nvidia stack

The Race for Networking Independence

Amazon's response to Nvidia's InfiniBand dominance was EFA (Elastic Fabric Adapter) — a custom network interface card built on its Annapurna Labs subsidiary that provides RDMA-like performance for AWS EC2 clusters. EFA connects to standard Ethernet infrastructure rather than InfiniBand switches, reducing dependency on Mellanox hardware. AWS uses EFA for its Trainium clusters and offers it to EC2 customers as an alternative to InfiniBand.

Meta designed its own networking fabric for its AI Research SuperCluster (RSC), announced in January 2022. RSC uses 200 Gb/s InfiniBand between nodes but employs a custom storage network architecture designed to prevent networking bottlenecks from stalling GPU utilization. Meta published that RSC achieves 96% GPU utilization across 16,000 A100 GPUs — a remarkable figure that reflects both the networking design and years of distributed training software engineering.

What This Means for the Next Five Years

The trajectory is clear: AI model training will increasingly be constrained not by raw GPU compute but by the ability to move data between chips fast enough to keep compute cores busy. The winners in AI infrastructure will be those who solve memory capacity (more HBM per chip or near-memory compute), interconnect bandwidth (NVLink-class speeds at rack scale), and networking reliability at tens of thousands of GPUs. These are mechanical and electrical engineering problems as much as semiconductor problems — and they are why hardware companies, not just chip designers, are becoming central to AI progress.

RDMA (Remote Direct Memory Access)A protocol allowing one computer to directly read or write another computer's memory without involving the remote CPU. Critical for GPU training clusters: it allows gradient synchronization without adding CPU latency to every communication event. InfiniBand and RoCE (RDMA over Converged Ethernet) both implement RDMA.

Mixture of Experts (MoE)A model architecture where only a fraction of parameters (the active "experts") are used for any given token. GPT-4 is believed to use MoE, allowing 1.8T total parameters with only ~200B active per forward pass. Inference cost scales with active parameters, not total — making MoE models cheaper to run than their total size suggests.

Tensor ParallelismA training strategy that splits individual matrix operations (weight matrices) across multiple GPUs. Unlike data parallelism, tensor parallelism requires extremely low-latency communication on every forward pass — making NVLink or InfiniBand a hard requirement rather than a nice-to-have.

Lesson 4 Quiz

The Networking and Memory Frontier — 4 questions

Why was Nvidia's 2020 acquisition of Mellanox for $6.9 billion considered strategically important for AI infrastructure?

Correct. Mellanox owned the dominant InfiniBand networking stack — the high-performance interconnect that large AI training clusters depend on for GPU-to-GPU communication. Controlling both the compute and the interconnect gave Nvidia extraordinary leverage over AI infrastructure design.

Not quite. Mellanox was the leading vendor of InfiniBand networking — the technology that enables fast, low-latency GPU-to-GPU communication in training clusters. Owning it gave Nvidia control over both the compute and the interconnect layer of AI infrastructure.

GPT-4 is estimated at ~1.8 trillion parameters in a mixture-of-experts architecture. Why does this not mean inference costs 9x as much as a 200B-parameter dense model?

Correct. Mixture-of-experts routes each token to only a subset of "expert" sub-networks. GPT-4 may have 1.8T total parameters but with MoE, only ~200B are active during any given forward pass — making inference cost comparable to a dense 200B model.

Not quite. The key is that MoE activates only a subset of parameters per token. Total parameters determine memory storage requirements; active parameters determine compute cost per inference. These scale independently in MoE architectures.

What problem does RDMA (Remote Direct Memory Access) solve in GPU training clusters?

Correct. Without RDMA, every gradient synchronization event between GPUs on different servers would require the remote CPU to handle the memory transfer — adding microseconds of latency that accumulate into serious throughput loss at scale. RDMA cuts the CPU out of the data path entirely.

Not quite. RDMA's value is bypassing the remote CPU: one machine can directly write to another machine's memory without interrupt or CPU involvement. At training scale, eliminating this CPU overhead on every gradient sync makes a significant throughput difference.

Meta's AI Research SuperCluster achieved 96% GPU utilization across 16,000 A100 GPUs. Why is this figure notable?

Correct. In large distributed training clusters, network latency, synchronization bubbles, and straggler nodes routinely push effective GPU utilization below 70–80%. Achieving 96% across 16,000 GPUs reflects both the custom networking fabric design and years of distributed training software optimization.

Not quite. GPU utilization in large clusters is typically well below 96% due to networking overhead, synchronization delays, and straggler effects. Meta's 96% figure reflects both custom network architecture and sophisticated distributed training software — not a common result.

Lab 4 — Cluster Architecture Design

Design the networking and memory strategy for a large model training cluster

Your Mission

You're the infrastructure lead at a research lab planning to train a 70-billion-parameter dense language model. Your compute budget covers 512 H100 SXM5 GPUs. You need to design the networking topology, decide on InfiniBand vs Ethernet, specify the parallelism strategy, and identify potential memory bottlenecks.

Have at least 3 exchanges. Ask about network topology choices (fat-tree vs dragonfly), which parallelism strategy fits your model size, and what the memory requirements look like across your cluster.

Suggested opener: "I have 512 H100 SXM5 GPUs to train a 70B parameter dense model. Each H100 has 80GB HBM3. Walk me through whether I have enough memory and what parallelism strategy makes sense."

AI Infrastructure Advisor

Cluster Architecture

Good, let's work through this systematically. First, the memory math: a 70B parameter model in BF16 (2 bytes per parameter) requires roughly 140 GB just for the weights. With optimizer states (Adam typically stores first and second moment — another 2× in FP32 terms) and gradients, your peak training memory per model replica is closer to 700–840 GB. That's comfortably distributed across 8–16 H100s at 80GB each. So yes, 512 GPUs gives you substantial headroom. What sequence length and batch size are you targeting? That determines your activation memory requirements, which can actually dominate weights at long contexts.

Module 5 Test

Hardware and Infrastructure Bets — 15 questions · 80% to pass

1. What was the primary supply bottleneck for Nvidia H100 GPUs in 2023 — not silicon yield, but what?

Correct. CoWoS packaging at TSMC was the binding constraint throughout 2023.

The binding constraint was CoWoS advanced packaging capacity at TSMC — not wafer starts or memory production.

2. Google's first TPU (v1) entered production in what year, and was used for which internal workload?

Correct. TPU v1 went into production in 2015, used for RankBrain and Street View — inference-only workloads.

TPU v1 entered production in 2015, used for RankBrain (search ranking) and Street View. TPU v2 (2017) was the first to support training.

3. Nvidia's data center revenue grew approximately how much between Q1 2023 and Q1 2024?

Correct. $3.6B to $18.4B — roughly 5× in 12 months.

Nvidia data center revenue grew approximately 5×: from $3.6B in Q1 2023 to $18.4B in Q1 2024.

4. What is bfloat16's key advantage over standard FP16 for neural network training?

Correct. bfloat16 keeps 8 exponent bits (same as FP32), preserving the dynamic range critical for stable training.

bfloat16's advantage is numerical stability: same 8-bit exponent as FP32 prevents the gradient underflow that makes standard FP16 training unreliable.

5. What was the original business trigger that motivated Google to develop the TPU in 2013?

Correct. The 2013 inference cost projection for voice search was the trigger — not training, but inference at consumer scale.

The trigger was an inference cost calculation: 3 minutes/day of Gmail voice search would require doubling Google's data center capacity using existing CPU/GPU hardware.

6. Microsoft's Azure Maia 100 chip, revealed in November 2023, was designed primarily for which purpose?

Correct. Maia 100 is an inference cost-reduction play for Copilot serving — Microsoft was explicit it does not replace Nvidia for general training.

Azure Maia 100 is an inference chip for running LLMs (specifically Copilot) in Azure data centers — not a training chip competing with Nvidia GPUs.

7. What was the significance of the Microsoft–Constellation Energy deal for Three Mile Island?

Correct. Constellation agreed to restart a previously closed nuclear reactor under a 20-year PPA — hyperscalers directly funding nuclear restarts to solve power supply problems.

The deal was a 20-year PPA where Microsoft's commitment funded the restart of Three Mile Island Unit 1 — shut in 2019 for economic reasons, restarted specifically for AI data center power.

8. A data center has 3,000 H100 GPUs running at full load (700W each) and a PUE of 1.35. What is the total facility power draw?

Correct. 3,000 × 700W = 2.1 MW GPU compute. But GPUs are not the only power draw — servers, networking, and storage add overhead before PUE. A realistic server overhead brings IT load to ~2.835 MW, × 1.35 PUE = ~3.83 MW total facility.

3,000 × 700W = 2.1 MW GPU power. With server overhead (CPUs, networking, storage), IT load is closer to ~2.835 MW. Multiplied by PUE 1.35 = ~3.83 MW total facility power.

9. Why did Amazon build EFA (Elastic Fabric Adapter) rather than simply using standard InfiniBand for its AI training clusters?

Correct. EFA provides RDMA-like performance over Ethernet, reducing AWS's dependence on Nvidia/Mellanox InfiniBand hardware — a strategic independence play.

EFA was built on standard Ethernet infrastructure to reduce dependency on Mellanox (now Nvidia) InfiniBand — while still delivering RDMA performance for GPU training workloads.

10. What is the main limitation of solar and wind power as primary energy sources for AI data centers?

Correct. Intermittency is the core problem: data centers need firm, dispatchable power 24/7, and battery storage at the required scale isn't economically viable yet.

The fundamental issue is intermittency. Data centers have flat 24/7 demand; solar and wind generate variably. Without viable grid-scale storage, renewable-only power doesn't work for always-on compute.

11. Meta's first-generation MTIA chip was optimized for which workload?

Correct. MTIA targets ranking/recommendation models — the highest-volume, most predictable workloads. Llama training still uses Nvidia GPUs.

MTIA was designed for ranking and recommendation — the workloads powering Facebook and Instagram feeds. Generative AI (Llama) training remains on Nvidia GPUs.

12. In mixture-of-experts (MoE) architecture, inference compute cost scales with which quantity?

Correct. MoE activates only a subset of parameters per token. Inference cost scales with active parameters, not total parameters — the key efficiency insight.

In MoE, only active experts participate in each forward pass. Inference compute scales with active parameters (e.g., ~200B for GPT-4), not the 1.8T total.

13. Why does tensor parallelism (splitting individual matrix operations across GPUs) require NVLink or InfiniBand rather than standard Ethernet?

Correct. Unlike data parallelism (one sync at the end of a batch), tensor parallelism requires GPU-to-GPU communication on every single layer of every forward pass — making low latency mandatory, not optional.

Tensor parallelism splits matrix operations within each layer, requiring inter-GPU communication on every forward pass through every layer. Standard Ethernet latency accumulates across thousands of layers per training step.

14. Which hyperscaler's AI training infrastructure achieved 96% GPU utilization across 16,000 A100 GPUs, and what made this notable?

Correct. Meta's RSC achieved 96% GPU utilization across 16,000 A100s — exceptional given that networking overhead and straggler effects typically push large clusters to 70–80%.

It was Meta's AI Research SuperCluster (RSC). The 96% figure is notable because large distributed clusters typically achieve 70–80% GPU utilization due to networking overhead, synchronization bubbles, and straggler nodes.

15. Northern Virginia's Loudoun County data center corridor illustrates what structural constraint on AI infrastructure expansion?

Correct. Dominion Energy's substation connection moratoriums showed that electrical grid capacity — not capital, land, or permitting — became the binding constraint on AI infrastructure expansion in the world's densest data center market.

The constraint was utility grid capacity: Dominion Energy issued connection moratoriums at substations that were out of headroom. Capital, land, and permits were all available — electricity grid capacity was the bottleneck.