Lesson 1 · Module 2

The CUDA Bet

How a graphics chip programming language became the foundation of modern AI

Why did NVIDIA's 2006 gamble on CUDA end up giving it a decade-long moat?

In 2006, NVIDIA was a profitable graphics card company with no obvious reason to court scientists. Its GPUs rendered polygons for video games. Then a small internal team shipped CUDA — Compute Unified Device Architecture — a programming toolkit that let researchers write general-purpose code for GPU hardware. Revenue from this decision was, initially, zero.

What CUDA Actually Did

Before CUDA, running code on a GPU required translating everything into graphics primitives — a tortuous hack that only specialists attempted. CUDA gave developers a C-like language and direct access to the GPU's thousands of parallel processing cores. The first users were physicists simulating protein folding and fluid dynamics. AI researchers were a niche afterthought.

The pivotal moment came in 2012. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton trained AlexNet on two NVIDIA GTX 580 GPUs. The network won the ImageNet competition by a margin so large — cutting the previous error rate from 26% to 15.3% — that it ended the debate about whether deep learning was viable. AlexNet ran on CUDA. There was no alternative framework at the time that could have matched it.

Documented Milestone

AlexNet's 2012 ImageNet victory was trained on NVIDIA GeForce GTX 580 GPUs using CUDA. Training took roughly six days on two GPUs. On a CPU cluster of the era, the same training run would have taken weeks. The choice of hardware directly enabled the result.

The Moat That Compounds

CUDA's advantage is not just performance — it is installed ecosystem. By 2012, thousands of university labs had already spent years writing CUDA code, debugging CUDA kernels, and building CUDA-optimized libraries. When deep learning exploded, those labs simply kept using the tools they knew. Switching to a competitor's platform would have meant rewriting years of carefully tuned code.

NVIDIA reinforced this by publishing cuDNN in 2014 — a library of hand-optimized GPU primitives specifically for neural networks: convolutions, pooling, normalization. cuDNN made NVIDIA GPUs significantly faster for deep learning than raw CUDA code alone, and it was free. Frameworks like TensorFlow and PyTorch were built on top of cuDNN from the start. The stack deepened.

CUDA NVIDIA's parallel computing platform and API, released 2006. Allows general-purpose programming on GPU hardware.

cuDNN CUDA Deep Neural Network library, released 2014. Provides hand-tuned GPU kernels for common neural network operations, used by TensorFlow, PyTorch, and most major frameworks.

Kernel In GPU computing, a function that runs in parallel across thousands of GPU cores simultaneously.

Network Effects in Tooling

By 2023, CUDA had been downloaded over 40 million times. Every major cloud provider — AWS, Google Cloud, Microsoft Azure — offers NVIDIA GPU instances as their default AI compute option. When a researcher writes a PyTorch tutorial, it assumes CUDA. When a company deploys a model, it trains on CUDA-enabled hardware. The ecosystem self-reinforces at every layer: tutorials, forums, job postings, pre-trained models, and benchmarks all assume the NVIDIA stack.

AMD's ROCm platform and Intel's oneAPI attempt to offer alternatives, but as of 2024 neither has closed the ecosystem gap. The software compatibility layer that maps CUDA calls to competitor hardware — HIP for AMD — works for many programs but introduces enough friction that most production workloads stay on NVIDIA.

Strategic Lens

CUDA represents a textbook case of a developer platform moat: the product (the GPU) gains value from the ecosystem built on top of it (CUDA libraries, frameworks, tutorials), and the ecosystem grows more valuable as more developers adopt it. Hardware performance matters, but ecosystem lock-in is the harder competitive barrier to overcome.

Key Takeaways

NVIDIA launched CUDA in 2006 with no immediate AI application in mind. AlexNet's 2012 ImageNet win — run on NVIDIA GPUs — demonstrated GPU acceleration for deep learning to the entire research community. cuDNN (2014) deepened the moat by providing framework-level optimizations. The result is a self-reinforcing ecosystem where switching costs compound over time, giving NVIDIA a structural advantage that pure hardware specifications cannot easily replicate.

Quiz · Lesson 1

The CUDA Bet

Three questions — select the best answer for each.

1. In what year did NVIDIA release CUDA?

Correct. CUDA launched in 2006, six years before AlexNet demonstrated its value for deep learning.

Not quite. CUDA was released in 2006. The long gap before its AI impact is part of what makes the story interesting — the bet paid off slowly, then all at once.

2. What made AlexNet's 2012 ImageNet result significant for GPU computing?

Correct. AlexNet's 15.3% error rate vs. the previous best of 26% was so decisive it validated the entire deep learning paradigm and demonstrated CUDA-based GPU training to the world.

Not quite. AlexNet's importance was its winning margin — cutting error rates from 26% to 15.3% — which validated deep learning and showed the research world that CUDA GPU training worked at scale.

3. What is cuDNN and when was it released?

Correct. cuDNN (2014) sits beneath major frameworks and provides optimized primitives for convolutions, pooling, and normalization — deepening NVIDIA's ecosystem moat.

Not quite. cuDNN is a free software library released in 2014 that provides hand-optimized GPU kernels for neural network operations. TensorFlow, PyTorch, and most frameworks depend on it.

Lab · Lesson 1

Interrogating the CUDA Moat

Discuss with the AI — 3 exchanges unlock completion.

Your Mission

The CUDA ecosystem moat is often called "unbreakable" — but is it? Ask the AI assistant to walk you through the specific mechanisms that keep developers locked in and what conditions might actually erode NVIDIA's software advantage.

Suggested opening: "Explain the switching costs that keep ML researchers on CUDA even when competitor hardware has comparable specs."

AI Lab Assistant

CUDA & Ecosystem

Welcome to the CUDA Moat lab. I can discuss CUDA's ecosystem lock-in mechanisms, the history of GPU computing for AI, switching costs for researchers and companies, and what it would actually take for AMD or Intel to erode NVIDIA's software dominance. What would you like to explore?

Lesson 2 · Module 2

Silicon Generations

From Volta to Hopper — how NVIDIA's chip roadmap locked in the AI industry

Why does NVIDIA release new GPU architectures every two years, and what does each generation actually add?

In 2017, NVIDIA's Volta architecture introduced the Tensor Core — a specialized processing unit designed exclusively for the matrix multiplications that dominate neural network training. A single V100 GPU contained 640 Tensor Cores. Training throughput for deep learning workloads jumped by roughly 12× over the previous Pascal generation. This was not incremental progress.

The Architecture Timeline

2016

Pascal (P100)

First GPU with HBM2 memory. NVLink interconnect for multi-GPU. Widely adopted by cloud providers for early deep learning workloads.

2017

Volta (V100)

Introduced Tensor Cores — matrix-multiply units optimized for FP16 deep learning. 640 Tensor Cores per chip. Google, AWS, and Azure all made V100 their flagship AI instance.

2020

Ampere (A100)

3rd-gen Tensor Cores with TF32, BF16, INT8, INT4 support. 5× faster than V100 on certain workloads. NVLink 3.0. Became the standard for GPT-3 scale training.

2022

Hopper (H100)

Transformer Engine — dynamically switches between FP8 and FP16 within a single layer. NVLink 4.0, 900 GB/s bandwidth. Up to 6× faster than A100 on transformer training. ChatGPT's training run used H100 clusters.

2024

Blackwell (B100/B200/GB200)

2nd-gen Transformer Engine. NVLink 5.0. GB200 NVL72 rack system: 72 GPUs with 130 TB/s total NVLink bandwidth. Targets trillion-parameter models and real-time inference at scale.

What Tensor Cores Actually Do

Neural network training consists overwhelmingly of one operation: matrix multiplication (GEMM — General Matrix Multiply). A forward pass through a transformer layer is mostly matrix multiply. A backward pass computing gradients is mostly matrix multiply. Tensor Cores are circuits hardwired to compute D = A×B + C in a single clock cycle for small matrix tiles, typically 4×4 or 16×16 depending on precision. Standard CUDA cores would need many cycles for the same result.

The Hopper H100 contains 528 Tensor Cores. Its peak FP8 tensor throughput is 3,958 TFLOPS — roughly 60× the peak of a contemporary CPU. This isn't a software optimization; it's a physical circuit designed for one job.

Documented Case — GPT-3 Training

OpenAI's GPT-3 (175 billion parameters, 2020) was trained on approximately 10,000 NVIDIA V100 GPUs on Microsoft Azure. The estimated compute cost was ~$4.6 million at cloud GPU rates. This was the training run that convinced the industry that large-scale GPU clusters — not custom silicon — were the baseline for frontier model development.

Precision Formats and Why They Matter

Each generation expanded the number of precision formats Tensor Cores support. Precision refers to how many bits represent each number during computation. FP32 (32-bit float) is accurate but slow and memory-hungry. FP16 (16-bit) doubles throughput. BF16 (brain float 16, introduced in Ampere) preserves the dynamic range of FP32 while halving the storage — better for training stability. FP8 (Hopper) doubles throughput again but requires careful numerical management.

Each new precision format requires new Tensor Core circuits and new software frameworks to exploit them. This means every generation creates fresh reasons for researchers to upgrade: not just raw speed, but access to new precision options that enable larger models to train on the same memory budget.

3,958

TFLOPS FP8

H100 SXM5 peak tensor throughput

80 GB

HBM3 Memory

H100 — 3.35 TB/s bandwidth

900 GB/s

NVLink 4.0

GPU-to-GPU bandwidth in H100 NVL configurations

6×

Speedup vs. A100

H100 on transformer training workloads (NVIDIA benchmark)

The NVLink Advantage

Training large models requires multiple GPUs communicating constantly — sharing gradient updates across every backward pass. Standard PCIe slots (the normal way GPUs connect to a server) top out at around 32 GB/s in each direction. NVLink 4.0 in the H100 delivers 900 GB/s total bandwidth between GPUs — roughly 28× faster. For a model distributed across 8 GPUs, this bandwidth difference dramatically changes how much time is spent waiting for data vs. computing.

Competitors cannot easily replicate NVLink. It requires NVIDIA-proprietary silicon on both the GPU and a custom switch chip (NVSwitch). AMD's GPU interconnect (Infinity Fabric) and Intel's Gaudi interconnects exist but have lower bandwidth and smaller deployment ecosystems as of 2024.

Key Insight

Each NVIDIA architecture generation is not just a faster version of the last — it introduces new computing primitives (Tensor Cores, Transformer Engine), new precision formats, and new interconnect capabilities. This means every generation creates new software features that only NVIDIA hardware can run, resetting the ecosystem advantage with each release cycle.

Quiz · Lesson 2

Silicon Generations

Three questions — select the best answer for each.

1. What did NVIDIA's Volta architecture (2017) introduce that made it significantly faster for deep learning?

Correct. Volta's Tensor Cores (640 per V100) were the key innovation — specialized circuits for the matrix multiplications that dominate neural network training.

Not quite. Volta's headline feature was Tensor Cores — hardwired matrix-multiply units. HBM appeared in Pascal; NVLink appeared in Volta but wasn't the primary innovation; the Transformer Engine came with Hopper.

2. What precision format did the Hopper H100 introduce that doubled throughput vs. BF16?

Correct. Hopper introduced FP8 (8-bit floating point) via the Transformer Engine, which dynamically switches between FP8 and FP16 within layers, roughly doubling throughput vs. BF16.

Not quite. Hopper's headline precision innovation was FP8, managed by the Transformer Engine. TF32 came with Ampere; INT4 also appeared in Ampere; FP16 has been available since Volta.

3. Approximately how much faster is NVLink 4.0 (H100) vs. PCIe for GPU-to-GPU bandwidth?

Correct. NVLink 4.0 delivers 900 GB/s total bandwidth vs. ~32 GB/s for PCIe — roughly 28× faster — which is critical for large model training that requires constant gradient sharing across GPUs.

Not quite. NVLink 4.0 in the H100 provides 900 GB/s, vs. ~32 GB/s for standard PCIe — that's approximately 28× faster. This bandwidth gap is a major reason multi-GPU training uses NVLink-connected systems.

Lab · Lesson 2

Benchmarking the Generations

Discuss with the AI — 3 exchanges unlock completion.

Your Mission

Each NVIDIA architecture generation introduced new primitives that reshaped what models researchers could build. Explore with the AI: which generation was the most consequential inflection point, and why does Blackwell represent a qualitatively different kind of product than Hopper?

Suggested opening: "Which NVIDIA architecture generation had the biggest real-world impact on AI capabilities, and what makes Blackwell different from what came before?"

AI Lab Assistant

GPU Architectures

Ready to dig into GPU architecture generations. I can discuss the specific innovations of Pascal, Volta, Ampere, Hopper, and Blackwell — including Tensor Cores, precision formats, NVLink bandwidth, and how each generation changed what models researchers could realistically train. What would you like to explore?

Lesson 3 · Module 2

Ninety Percent

How NVIDIA came to own roughly 90% of AI training compute — and what that actually means

What does market concentration at 90% look like in practice, and who are the real challengers?

In its fiscal year 2024, NVIDIA reported data center revenue of $47.5 billion — up from $15 billion the prior year. The quarter ending January 2024 alone produced $18.4 billion in data center revenue. For context: the entire global market for AI chips was estimated at roughly $50–60 billion in 2023. NVIDIA was capturing the majority of it.

What 80–90% Market Share Means

Analyst estimates of NVIDIA's share of the AI accelerator market vary — ranging from 70% to over 90% depending on the segment measured. The highest figures apply to AI training accelerators specifically, where NVIDIA's H100 and A100 have been the dominant products since 2021. In the broader "data center GPU" category including inference, AMD has a more significant (though still minority) presence.

The practical effect of this concentration: every major AI lab — OpenAI, Google DeepMind, Anthropic, Meta AI, Mistral — depends on NVIDIA hardware for their primary training clusters. Microsoft's Azure AI supercomputers that power OpenAI's models run on tens of thousands of A100 and H100 GPUs. Meta's infrastructure for training Llama 3 used ~49,000 H100 GPUs. Google is the notable exception, having built its own TPUs — discussed in Lesson 4.

$47.5B

Data Center Revenue FY2024

Up from $15B in FY2023

~80%

AI Accelerator Market Share

Conservative analyst estimate, 2023–2024

49,000

H100 GPUs

Used by Meta for Llama 3 training cluster

$40,000+

H100 List Price

Per GPU, 2023; spot market prices exceeded $60K during shortage

The 2023 GPU Shortage

Following the launch of ChatGPT in November 2022, demand for NVIDIA H100 GPUs — which had just begun shipping — surged far beyond supply. Lead times for H100 orders stretched to 6–12 months at major cloud providers. On spot markets, individual H100 GPUs sold for $40,000–$60,000 each, compared to NVIDIA's official list price of around $30,000–$40,000.

This shortage had concrete strategic consequences. Startups that could not secure GPU allocations were forced to either delay training runs, use slower A100s, or rent capacity from cloud providers at premium rates. Established players — Microsoft, Google, Amazon, Meta — had secured large H100 orders months in advance and gained significant competitive advantage simply from hardware access. The GPU became a strategic resource, not a commodity input.

Documented Case — CoreWeave

CoreWeave, a cloud infrastructure startup, bet early on GPU compute and acquired tens of thousands of NVIDIA GPUs before the 2023 shortage. By 2024, CoreWeave had secured a $1.1 billion funding round, an estimated $19 billion valuation, and a $1.6 billion contract with Microsoft — largely because it had GPU inventory that others didn't. The company's strategic value was primarily its secured hardware allocation.

AMD's Actual Position

AMD's MI300X accelerator — launched in late 2023 — is the most credible GPU competitor to NVIDIA's H100. It features 192 GB of HBM3 memory, compared to H100's 80 GB, which makes it better suited for inference on very large models (the 192 GB can hold a 70B-parameter model in FP16 without splitting across chips). Microsoft, Meta, and Oracle have all publicly committed to deploying MI300X at scale.

However, AMD's challenge is software, not silicon. ROCm — AMD's equivalent of CUDA — has improved significantly but still lacks the maturity of CUDA's library ecosystem. Many researchers report that getting models to run optimally on MI300X requires more manual optimization than on NVIDIA hardware. PyTorch supports ROCm but the integration has historically been less reliable for cutting-edge features.

MI300X AMD's flagship AI accelerator (2023), featuring 192 GB HBM3 memory — larger than H100's 80 GB. Positioned for large-model inference. Deployed by Microsoft, Meta, and Oracle.

ROCm AMD's open-source GPU computing platform, analogous to CUDA. Supports Python ML frameworks but has a smaller library ecosystem and historically less seamless framework support.

Supply Chain Concentration

NVIDIA designs its chips but does not fabricate them — all production runs through TSMC (Taiwan Semiconductor Manufacturing Company). NVIDIA's Hopper H100 uses TSMC's 4nm process node. TSMC manufactures roughly 90% of the world's most advanced semiconductors. This creates a concentrated supply chain: NVIDIA's ability to ship GPUs depends entirely on TSMC's capacity and geopolitical stability in Taiwan.

In 2024, NVIDIA was reported to be TSMC's largest customer by revenue. TSMC's 3nm and 2nm capacity — needed for future NVIDIA products — is being expanded partly to meet NVIDIA demand. The US-China trade restrictions on advanced chip exports (NVIDIA was required to create "export-controlled" A800 and H800 variants for China) further complicate the supply picture, with China representing a significant market that NVIDIA cannot fully serve with its best products.

Strategic Lens

NVIDIA's dominance is not just market share — it is control over a bottleneck resource. When GPUs become the limiting factor for AI development, whoever controls GPU supply exerts leverage over the pace of the entire industry. This is why governments, cloud providers, and AI labs treat GPU allocation as a strategic priority, not a procurement decision.

Quiz · Lesson 3

Ninety Percent

Three questions — select the best answer for each.

1. What was NVIDIA's data center revenue in fiscal year 2024?

Correct. NVIDIA's data center revenue was $47.5 billion in FY2024, up from $15 billion the prior year — a roughly 3× increase driven by AI training demand.

Not quite. NVIDIA's data center revenue reached $47.5 billion in FY2024 (ending January 2024), up from $15 billion in FY2023 — one of the most dramatic single-year revenue surges in semiconductor history.

2. What is AMD's MI300X's key hardware advantage over NVIDIA's H100?

Correct. The MI300X's 192 GB HBM3 memory is its headline hardware advantage — it can hold larger models in-memory for inference without multi-chip splitting, which H100's 80 GB cannot match on its own.

Not quite. The MI300X's key hardware differentiator is 192 GB of HBM3 memory, vs. H100's 80 GB. AMD's software ecosystem (ROCm) is actually less mature than CUDA — that's a disadvantage, not an advantage.

3. Which company fabricates NVIDIA's H100 GPUs, and what process node does it use?

Correct. NVIDIA's H100 is manufactured by TSMC on its 4nm process node. NVIDIA is a fabless chip designer — it designs the chips but outsources all fabrication to TSMC.

Not quite. The H100 is fabricated by TSMC on a 4nm process. NVIDIA is fabless — it designs chips but does not own fabs. TSMC's Taiwan-based manufacturing capacity is a key strategic dependency.

Lab · Lesson 3

The GPU Shortage Playbook

Discuss with the AI — 3 exchanges unlock completion.

Your Mission

The 2023 GPU shortage turned hardware access into a strategic weapon. Explore with the AI: if you were advising an AI startup in early 2023, what strategies could you recommend to navigate the shortage — and how does the CoreWeave model represent a new kind of AI infrastructure company?

Suggested opening: "What strategic options did AI startups have during the 2023 GPU shortage when they couldn't get H100 allocations?"

AI Lab Assistant

Market Dynamics

Ready to analyze the GPU shortage and market concentration. I can discuss strategies for navigating GPU scarcity, the CoreWeave model of GPU-as-infrastructure, AMD's competitive position, TSMC supply chain risks, and how hardware access shapes AI startup strategy. What would you like to dig into?

Lesson 4 · Module 2

The Challengers

Google's TPUs, custom silicon at the hyperscalers, and the startups trying to break NVIDIA's hold

Can any challenger realistically displace NVIDIA — and what would it actually take?

In 2016, Google published details of its Tensor Processing Unit — a custom ASIC designed entirely for one thing: running TensorFlow operations efficiently. Google had been quietly running TPUs internally since 2015, using them to power search ranking, Street View image processing, and AlphaGo's training. The paper announced that Google had been living in the future for over a year.

Google's TPU Strategy

Google's TPU (Tensor Processing Unit) is the most mature and widely deployed custom AI accelerator outside NVIDIA. As of 2024, Google is on its sixth generation — Trillium (TPU v6), announced in 2024. Each generation is designed to run Google's JAX and TensorFlow frameworks, and TPUs power virtually all of Google's own AI products: Search, Translate, Gmail smart compose, Bard/Gemini, and YouTube recommendations.

The TPU's architecture differs fundamentally from a GPU. It uses a systolic array — a grid of multiply-accumulate units that pass data between neighbors without the memory bus traffic that GPUs require. This makes TPUs extremely efficient for the matrix multiplications in neural networks, but less flexible for other compute tasks. You cannot run a video game on a TPU. You can run a transformer very efficiently.

Documented Case — AlphaGo

DeepMind's AlphaGo, which defeated world Go champion Lee Sedol in March 2016, was trained on Google TPUs. The training involved reinforcement learning over millions of self-play games — exactly the kind of repetitive matrix computation TPUs accelerate. AlphaGo's victory was the first public evidence that Google's custom silicon was viable for frontier AI.

The Hyperscaler Custom Silicon Wave

Following Google's TPU disclosure, every major cloud provider developed its own AI accelerator:

2015

Google TPU v1

Inference-only ASIC. 92 TOPS. Used internally for Search and AlphaGo. Not customer-accessible.

2017

Google TPU v2 / v3

Training-capable. Available via Google Cloud. Powers BERT training. 180 / 420 TFLOPS respectively.

2019

AWS Inferentia

Amazon's inference-optimized ASIC. Cost per inference lower than equivalent GPU. Powers AWS's own internal workloads including Alexa.

2020

AWS Trainium

Training-optimized. Claimed 50% cost savings vs. GPU instances for certain workloads. Anthropic announced a $4B commitment to AWS partly involving Trainium.

2023

Microsoft Maia 100

Azure's first custom AI accelerator. Designed for OpenAI workloads. Internal use at launch, customer availability roadmap announced.

2024

Google Trillium (TPU v6)

4.7× compute vs. TPU v5e. Powers Gemini training and inference. Not sold as hardware — Google Cloud access only.

Startup Challengers: Cerebras, Groq, SambaNova

Cerebras Systems took a radical approach: instead of fitting a chip on a die, they used the entire 300mm silicon wafer as one chip. The Wafer Scale Engine 3 (WSE-3, 2024) contains 4 trillion transistors and 900,000 AI-optimized cores. For models that fit within its enormous on-chip memory, inference is extraordinarily fast. Cerebras reported inference speeds of over 1,000 tokens per second for a 70B-parameter model — far exceeding GPU-based systems. The constraint: the wafer chip is extremely expensive and doesn't support the distributed multi-chip training that frontier models require.

Groq built a Language Processing Unit (LPU) optimized for deterministic, low-latency inference. By 2024, Groq's cloud inference API was demonstrating Llama 3 70B inference at over 800 tokens/second — catching significant attention from developers building latency-sensitive applications. Groq's chip uses a streaming dataflow architecture that avoids the memory bandwidth bottlenecks that limit GPU inference speed.

SambaNova uses a reconfigurable dataflow architecture — the chip's compute graph is programmable at the circuit level, allowing efficient mapping of different model architectures. SambaNova has sold systems to national labs including Argonne and Lawrence Livermore for scientific AI workloads.

Documented Case — Anthropic + AWS Trainium

In September 2023, Anthropic announced a commitment of up to $4 billion in AWS investment, explicitly including use of AWS Trainium chips for training future Claude models. This was a direct signal that a leading AI lab was willing to invest in non-NVIDIA training hardware — though Anthropic also confirmed continued use of NVIDIA GPUs. The deal represents a hedge, not a replacement.

What Would It Actually Take?

To displace NVIDIA in AI training, a challenger would need: (1) hardware at comparable or better performance-per-dollar; (2) a software ecosystem with PyTorch-level compatibility and library depth; (3) supply at scale (tens of thousands of chips per customer); and (4) a track record across diverse model types. No current challenger satisfies all four simultaneously.

The most realistic displacement scenario is workload segmentation: NVIDIA retains the cutting-edge training market, while inference and specific workloads migrate to cheaper, purpose-built alternatives. Google's TPUs already represent this for Google internally. AWS Inferentia has gained traction for production inference workloads. The question is whether any challenger can build enough ecosystem to compete for the training market where NVIDIA's lock-in is deepest.

Looking Ahead

The pattern across all challengers — Google, AWS, Microsoft, Cerebras, Groq — is task specialization. No one is building a general GPU replacement. Each challenger finds a specific workload (inference, scientific simulation, low-latency serving) where it can beat NVIDIA on cost or latency, then expands from that beachhead. NVIDIA's response has been to make its own GPUs better at those specific tasks with each generation — a moving target competitors have to continually chase.

Quiz · Lesson 4

The Challengers

Three questions — select the best answer for each.

1. What architectural feature makes Google's TPU fundamentally different from a GPU?

Correct. The systolic array makes TPUs extremely efficient for the matrix multiplications in neural networks by passing data between adjacent compute units rather than repeatedly accessing shared memory.

Not quite. TPUs use a systolic array architecture — a grid of multiply-accumulate units that pass results directly between neighbors. This eliminates much of the memory bus traffic that limits GPU efficiency on pure matrix-multiply workloads.

2. What is Cerebras's radical approach to chip design?

Correct. Cerebras's Wafer Scale Engine uses the entire silicon wafer as one massive chip. WSE-3 has 4 trillion transistors and 900,000 AI cores — delivering extreme inference speed for models that fit in its on-chip memory.

Not quite. Cerebras's key innovation is using the entire 300mm wafer as one chip — the Wafer Scale Engine. With 4 trillion transistors in WSE-3, it avoids chip-to-chip communication entirely for workloads that fit within its memory.

3. Which of the following best describes the realistic path for NVIDIA challengers to gain market share?

Correct. Workload segmentation is the realistic challenger strategy: win a specific use case where specialization provides an advantage, establish a customer base and ecosystem there, then potentially expand. Google's TPUs, Groq, Cerebras, and AWS Inferentia all follow this pattern.

Not quite. The realistic path is workload segmentation — finding specific tasks (inference latency, cost per token, scientific computation) where a specialized chip beats NVIDIA, then building from that beachhead. A full general replacement requires matching CUDA's ecosystem depth, which no current challenger has done.

Lab · Lesson 4

Challenger Strategy Workshop

Discuss with the AI — 3 exchanges unlock completion.

Your Mission

You're advising a well-funded AI chip startup in 2025. You've seen Google's TPU strategy, Groq's inference speed play, Cerebras's wafer-scale bet. The question: which segment of the AI hardware market has the most realistic path to displacing NVIDIA, and what does the winning strategy actually look like?

Suggested opening: "If I'm a new AI chip startup in 2025 with $500M in funding, which market segment offers the best chance to establish a durable position against NVIDIA — and why?"

AI Lab Assistant

Competitive Strategy

Welcome to the Challenger Strategy lab. I can discuss Google TPU's approach, Groq's LPU inference speed play, Cerebras's wafer-scale bet, AWS's Trainium/Inferentia ecosystem, startup positioning in AI hardware, and the conditions under which any challenger could realistically build a durable business competing with NVIDIA. Where would you like to start?

Module Test · Module 2

NVIDIA's Dominance

15 questions — score 80% or above to pass the module.

1. CUDA was released by NVIDIA in what year?

Correct — CUDA launched in 2006.

CUDA was released in 2006, six years before AlexNet validated its importance for deep learning.

2. What did AlexNet achieve at the 2012 ImageNet competition?

Correct — AlexNet's margin of victory was what made the result definitive.

AlexNet reduced ImageNet error from 26% to 15.3% — a decisive margin that ended skepticism about deep learning.

3. What is cuDNN?

Correct — cuDNN provides the layer of hand-tuned primitives that frameworks build on.

cuDNN (2014) is NVIDIA's free library of GPU-optimized neural network operations used by PyTorch, TensorFlow, and most major frameworks.

4. Which NVIDIA architecture first introduced Tensor Cores?

Correct — Volta (2017) introduced Tensor Cores with the V100.

Tensor Cores debuted in the Volta architecture (2017), with 640 per V100 GPU.

5. What is the Transformer Engine in NVIDIA's Hopper architecture?

Correct — the Transformer Engine is Hopper's key innovation for accelerating transformer-based models.

The Transformer Engine is an H100 hardware feature that dynamically switches between FP8 and FP16 precision to maximize throughput on transformer layers.

6. How many H100 GPUs did Meta use for its Llama 3 training cluster?

Correct — Meta's Llama 3 training used approximately 49,000 H100 GPUs.

Meta used approximately 49,000 H100 GPUs for its Llama 3 training cluster — illustrating the scale of GPU demand at leading AI labs.

7. What was NVIDIA's data center revenue in fiscal year 2024?

Correct — $47.5B in FY2024, up from $15B the prior year.

NVIDIA's data center revenue was $47.5 billion in FY2024 — up ~3× from $15B in FY2023.

8. What is AMD's MI300X's headline hardware advantage over the H100?

Correct — MI300X's 192 GB memory is its key differentiator, especially for inference on large models.

The MI300X's key advantage is 192 GB of HBM3 memory — more than double the H100's 80 GB — which enables larger models to run in-memory without splitting across chips.

9. Who fabricates NVIDIA's H100 GPUs?

Correct — NVIDIA is fabless. All H100 production runs through TSMC's 4nm process.

NVIDIA is a fabless chipmaker. H100 GPUs are manufactured by TSMC in Taiwan using the 4nm process node.

10. What architectural feature makes Google's TPU efficient for neural networks?

Correct — the systolic array is the core architectural innovation that makes TPUs efficient for matrix multiplication-heavy workloads.

TPUs use a systolic array — a grid of multiply-accumulate units that pass results to neighbors rather than going through shared memory, making them very efficient for neural network matrix multiplications.

11. What was CoreWeave's primary strategic asset that drove its $19 billion valuation?

Correct — CoreWeave's value was largely its GPU inventory secured before the shortage made H100s scarce and expensive.

CoreWeave's $19B valuation was driven primarily by its pre-shortage NVIDIA GPU inventory — hardware scarcity turned physical GPU stockpiles into strategic assets.

12. What is Groq's Language Processing Unit optimized for?

Correct — Groq's LPU is a streaming dataflow architecture designed specifically for fast, predictable inference throughput.

Groq's LPU is optimized for fast, low-latency inference. By 2024 it was demonstrating Llama 3 70B inference at over 800 tokens/second — a latency advantage over GPU-based inference.

13. NVLink 4.0 in the H100 provides approximately what bandwidth advantage over PCIe?

Correct — NVLink 4.0 at 900 GB/s vs. PCIe's ~32 GB/s is approximately 28× faster, critical for distributed model training.

NVLink 4.0 delivers 900 GB/s vs. ~32 GB/s for PCIe — roughly 28×. This bandwidth gap is essential for training large models distributed across many GPUs.

14. Cerebras's Wafer Scale Engine uses what unusual approach to chip design?

Correct — Cerebras treats the entire wafer as one chip, eliminating the inter-die communication that limits traditional multi-chip approaches.

Cerebras's WSE-3 uses the entire 300mm silicon wafer as a single chip — 4 trillion transistors, 900,000 AI cores. Expensive, but extremely fast for workloads that fit in its on-chip memory.

15. Which of the following best describes the realistic challenger strategy against NVIDIA's AI training dominance?

Correct — workload segmentation is the viable challenger path. No competitor has yet attempted a head-on general-purpose replacement for NVIDIA with success.

The realistic challenger strategy is workload segmentation: find a specific use case where specialization wins (Google TPU for Google's workloads, Groq for inference latency, Cerebras for certain model sizes), establish a position, and expand. Full general replacement requires matching CUDA's ecosystem depth, which no challenger has done.