In 2006, NVIDIA was a profitable graphics card company with no obvious reason to court scientists. Its GPUs rendered polygons for video games. Then a small internal team shipped CUDA — Compute Unified Device Architecture — a programming toolkit that let researchers write general-purpose code for GPU hardware. Revenue from this decision was, initially, zero.
Before CUDA, running code on a GPU required translating everything into graphics primitives — a tortuous hack that only specialists attempted. CUDA gave developers a C-like language and direct access to the GPU's thousands of parallel processing cores. The first users were physicists simulating protein folding and fluid dynamics. AI researchers were a niche afterthought.
The pivotal moment came in 2012. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton trained AlexNet on two NVIDIA GTX 580 GPUs. The network won the ImageNet competition by a margin so large — cutting the previous error rate from 26% to 15.3% — that it ended the debate about whether deep learning was viable. AlexNet ran on CUDA. There was no alternative framework at the time that could have matched it.
AlexNet's 2012 ImageNet victory was trained on NVIDIA GeForce GTX 580 GPUs using CUDA. Training took roughly six days on two GPUs. On a CPU cluster of the era, the same training run would have taken weeks. The choice of hardware directly enabled the result.
CUDA's advantage is not just performance — it is installed ecosystem. By 2012, thousands of university labs had already spent years writing CUDA code, debugging CUDA kernels, and building CUDA-optimized libraries. When deep learning exploded, those labs simply kept using the tools they knew. Switching to a competitor's platform would have meant rewriting years of carefully tuned code.
NVIDIA reinforced this by publishing cuDNN in 2014 — a library of hand-optimized GPU primitives specifically for neural networks: convolutions, pooling, normalization. cuDNN made NVIDIA GPUs significantly faster for deep learning than raw CUDA code alone, and it was free. Frameworks like TensorFlow and PyTorch were built on top of cuDNN from the start. The stack deepened.
By 2023, CUDA had been downloaded over 40 million times. Every major cloud provider — AWS, Google Cloud, Microsoft Azure — offers NVIDIA GPU instances as their default AI compute option. When a researcher writes a PyTorch tutorial, it assumes CUDA. When a company deploys a model, it trains on CUDA-enabled hardware. The ecosystem self-reinforces at every layer: tutorials, forums, job postings, pre-trained models, and benchmarks all assume the NVIDIA stack.
AMD's ROCm platform and Intel's oneAPI attempt to offer alternatives, but as of 2024 neither has closed the ecosystem gap. The software compatibility layer that maps CUDA calls to competitor hardware — HIP for AMD — works for many programs but introduces enough friction that most production workloads stay on NVIDIA.
CUDA represents a textbook case of a developer platform moat: the product (the GPU) gains value from the ecosystem built on top of it (CUDA libraries, frameworks, tutorials), and the ecosystem grows more valuable as more developers adopt it. Hardware performance matters, but ecosystem lock-in is the harder competitive barrier to overcome.
NVIDIA launched CUDA in 2006 with no immediate AI application in mind. AlexNet's 2012 ImageNet win — run on NVIDIA GPUs — demonstrated GPU acceleration for deep learning to the entire research community. cuDNN (2014) deepened the moat by providing framework-level optimizations. The result is a self-reinforcing ecosystem where switching costs compound over time, giving NVIDIA a structural advantage that pure hardware specifications cannot easily replicate.
The CUDA ecosystem moat is often called "unbreakable" — but is it? Ask the AI assistant to walk you through the specific mechanisms that keep developers locked in and what conditions might actually erode NVIDIA's software advantage.
In 2017, NVIDIA's Volta architecture introduced the Tensor Core — a specialized processing unit designed exclusively for the matrix multiplications that dominate neural network training. A single V100 GPU contained 640 Tensor Cores. Training throughput for deep learning workloads jumped by roughly 12× over the previous Pascal generation. This was not incremental progress.
Neural network training consists overwhelmingly of one operation: matrix multiplication (GEMM — General Matrix Multiply). A forward pass through a transformer layer is mostly matrix multiply. A backward pass computing gradients is mostly matrix multiply. Tensor Cores are circuits hardwired to compute D = A×B + C in a single clock cycle for small matrix tiles, typically 4×4 or 16×16 depending on precision. Standard CUDA cores would need many cycles for the same result.
The Hopper H100 contains 528 Tensor Cores. Its peak FP8 tensor throughput is 3,958 TFLOPS — roughly 60× the peak of a contemporary CPU. This isn't a software optimization; it's a physical circuit designed for one job.
OpenAI's GPT-3 (175 billion parameters, 2020) was trained on approximately 10,000 NVIDIA V100 GPUs on Microsoft Azure. The estimated compute cost was ~$4.6 million at cloud GPU rates. This was the training run that convinced the industry that large-scale GPU clusters — not custom silicon — were the baseline for frontier model development.
Each generation expanded the number of precision formats Tensor Cores support. Precision refers to how many bits represent each number during computation. FP32 (32-bit float) is accurate but slow and memory-hungry. FP16 (16-bit) doubles throughput. BF16 (brain float 16, introduced in Ampere) preserves the dynamic range of FP32 while halving the storage — better for training stability. FP8 (Hopper) doubles throughput again but requires careful numerical management.
Each new precision format requires new Tensor Core circuits and new software frameworks to exploit them. This means every generation creates fresh reasons for researchers to upgrade: not just raw speed, but access to new precision options that enable larger models to train on the same memory budget.
Training large models requires multiple GPUs communicating constantly — sharing gradient updates across every backward pass. Standard PCIe slots (the normal way GPUs connect to a server) top out at around 32 GB/s in each direction. NVLink 4.0 in the H100 delivers 900 GB/s total bandwidth between GPUs — roughly 28× faster. For a model distributed across 8 GPUs, this bandwidth difference dramatically changes how much time is spent waiting for data vs. computing.
Competitors cannot easily replicate NVLink. It requires NVIDIA-proprietary silicon on both the GPU and a custom switch chip (NVSwitch). AMD's GPU interconnect (Infinity Fabric) and Intel's Gaudi interconnects exist but have lower bandwidth and smaller deployment ecosystems as of 2024.
Each NVIDIA architecture generation is not just a faster version of the last — it introduces new computing primitives (Tensor Cores, Transformer Engine), new precision formats, and new interconnect capabilities. This means every generation creates new software features that only NVIDIA hardware can run, resetting the ecosystem advantage with each release cycle.
Each NVIDIA architecture generation introduced new primitives that reshaped what models researchers could build. Explore with the AI: which generation was the most consequential inflection point, and why does Blackwell represent a qualitatively different kind of product than Hopper?
In its fiscal year 2024, NVIDIA reported data center revenue of $47.5 billion — up from $15 billion the prior year. The quarter ending January 2024 alone produced $18.4 billion in data center revenue. For context: the entire global market for AI chips was estimated at roughly $50–60 billion in 2023. NVIDIA was capturing the majority of it.
Analyst estimates of NVIDIA's share of the AI accelerator market vary — ranging from 70% to over 90% depending on the segment measured. The highest figures apply to AI training accelerators specifically, where NVIDIA's H100 and A100 have been the dominant products since 2021. In the broader "data center GPU" category including inference, AMD has a more significant (though still minority) presence.
The practical effect of this concentration: every major AI lab — OpenAI, Google DeepMind, Anthropic, Meta AI, Mistral — depends on NVIDIA hardware for their primary training clusters. Microsoft's Azure AI supercomputers that power OpenAI's models run on tens of thousands of A100 and H100 GPUs. Meta's infrastructure for training Llama 3 used ~49,000 H100 GPUs. Google is the notable exception, having built its own TPUs — discussed in Lesson 4.
Following the launch of ChatGPT in November 2022, demand for NVIDIA H100 GPUs — which had just begun shipping — surged far beyond supply. Lead times for H100 orders stretched to 6–12 months at major cloud providers. On spot markets, individual H100 GPUs sold for $40,000–$60,000 each, compared to NVIDIA's official list price of around $30,000–$40,000.
This shortage had concrete strategic consequences. Startups that could not secure GPU allocations were forced to either delay training runs, use slower A100s, or rent capacity from cloud providers at premium rates. Established players — Microsoft, Google, Amazon, Meta — had secured large H100 orders months in advance and gained significant competitive advantage simply from hardware access. The GPU became a strategic resource, not a commodity input.
CoreWeave, a cloud infrastructure startup, bet early on GPU compute and acquired tens of thousands of NVIDIA GPUs before the 2023 shortage. By 2024, CoreWeave had secured a $1.1 billion funding round, an estimated $19 billion valuation, and a $1.6 billion contract with Microsoft — largely because it had GPU inventory that others didn't. The company's strategic value was primarily its secured hardware allocation.
AMD's MI300X accelerator — launched in late 2023 — is the most credible GPU competitor to NVIDIA's H100. It features 192 GB of HBM3 memory, compared to H100's 80 GB, which makes it better suited for inference on very large models (the 192 GB can hold a 70B-parameter model in FP16 without splitting across chips). Microsoft, Meta, and Oracle have all publicly committed to deploying MI300X at scale.
However, AMD's challenge is software, not silicon. ROCm — AMD's equivalent of CUDA — has improved significantly but still lacks the maturity of CUDA's library ecosystem. Many researchers report that getting models to run optimally on MI300X requires more manual optimization than on NVIDIA hardware. PyTorch supports ROCm but the integration has historically been less reliable for cutting-edge features.
NVIDIA designs its chips but does not fabricate them — all production runs through TSMC (Taiwan Semiconductor Manufacturing Company). NVIDIA's Hopper H100 uses TSMC's 4nm process node. TSMC manufactures roughly 90% of the world's most advanced semiconductors. This creates a concentrated supply chain: NVIDIA's ability to ship GPUs depends entirely on TSMC's capacity and geopolitical stability in Taiwan.
In 2024, NVIDIA was reported to be TSMC's largest customer by revenue. TSMC's 3nm and 2nm capacity — needed for future NVIDIA products — is being expanded partly to meet NVIDIA demand. The US-China trade restrictions on advanced chip exports (NVIDIA was required to create "export-controlled" A800 and H800 variants for China) further complicate the supply picture, with China representing a significant market that NVIDIA cannot fully serve with its best products.
NVIDIA's dominance is not just market share — it is control over a bottleneck resource. When GPUs become the limiting factor for AI development, whoever controls GPU supply exerts leverage over the pace of the entire industry. This is why governments, cloud providers, and AI labs treat GPU allocation as a strategic priority, not a procurement decision.
The 2023 GPU shortage turned hardware access into a strategic weapon. Explore with the AI: if you were advising an AI startup in early 2023, what strategies could you recommend to navigate the shortage — and how does the CoreWeave model represent a new kind of AI infrastructure company?
In 2016, Google published details of its Tensor Processing Unit — a custom ASIC designed entirely for one thing: running TensorFlow operations efficiently. Google had been quietly running TPUs internally since 2015, using them to power search ranking, Street View image processing, and AlphaGo's training. The paper announced that Google had been living in the future for over a year.
Google's TPU (Tensor Processing Unit) is the most mature and widely deployed custom AI accelerator outside NVIDIA. As of 2024, Google is on its sixth generation — Trillium (TPU v6), announced in 2024. Each generation is designed to run Google's JAX and TensorFlow frameworks, and TPUs power virtually all of Google's own AI products: Search, Translate, Gmail smart compose, Bard/Gemini, and YouTube recommendations.
The TPU's architecture differs fundamentally from a GPU. It uses a systolic array — a grid of multiply-accumulate units that pass data between neighbors without the memory bus traffic that GPUs require. This makes TPUs extremely efficient for the matrix multiplications in neural networks, but less flexible for other compute tasks. You cannot run a video game on a TPU. You can run a transformer very efficiently.
DeepMind's AlphaGo, which defeated world Go champion Lee Sedol in March 2016, was trained on Google TPUs. The training involved reinforcement learning over millions of self-play games — exactly the kind of repetitive matrix computation TPUs accelerate. AlphaGo's victory was the first public evidence that Google's custom silicon was viable for frontier AI.
Following Google's TPU disclosure, every major cloud provider developed its own AI accelerator:
Cerebras Systems took a radical approach: instead of fitting a chip on a die, they used the entire 300mm silicon wafer as one chip. The Wafer Scale Engine 3 (WSE-3, 2024) contains 4 trillion transistors and 900,000 AI-optimized cores. For models that fit within its enormous on-chip memory, inference is extraordinarily fast. Cerebras reported inference speeds of over 1,000 tokens per second for a 70B-parameter model — far exceeding GPU-based systems. The constraint: the wafer chip is extremely expensive and doesn't support the distributed multi-chip training that frontier models require.
Groq built a Language Processing Unit (LPU) optimized for deterministic, low-latency inference. By 2024, Groq's cloud inference API was demonstrating Llama 3 70B inference at over 800 tokens/second — catching significant attention from developers building latency-sensitive applications. Groq's chip uses a streaming dataflow architecture that avoids the memory bandwidth bottlenecks that limit GPU inference speed.
SambaNova uses a reconfigurable dataflow architecture — the chip's compute graph is programmable at the circuit level, allowing efficient mapping of different model architectures. SambaNova has sold systems to national labs including Argonne and Lawrence Livermore for scientific AI workloads.
In September 2023, Anthropic announced a commitment of up to $4 billion in AWS investment, explicitly including use of AWS Trainium chips for training future Claude models. This was a direct signal that a leading AI lab was willing to invest in non-NVIDIA training hardware — though Anthropic also confirmed continued use of NVIDIA GPUs. The deal represents a hedge, not a replacement.
To displace NVIDIA in AI training, a challenger would need: (1) hardware at comparable or better performance-per-dollar; (2) a software ecosystem with PyTorch-level compatibility and library depth; (3) supply at scale (tens of thousands of chips per customer); and (4) a track record across diverse model types. No current challenger satisfies all four simultaneously.
The most realistic displacement scenario is workload segmentation: NVIDIA retains the cutting-edge training market, while inference and specific workloads migrate to cheaper, purpose-built alternatives. Google's TPUs already represent this for Google internally. AWS Inferentia has gained traction for production inference workloads. The question is whether any challenger can build enough ecosystem to compete for the training market where NVIDIA's lock-in is deepest.
The pattern across all challengers — Google, AWS, Microsoft, Cerebras, Groq — is task specialization. No one is building a general GPU replacement. Each challenger finds a specific workload (inference, scientific simulation, low-latency serving) where it can beat NVIDIA on cost or latency, then expands from that beachhead. NVIDIA's response has been to make its own GPUs better at those specific tasks with each generation — a moving target competitors have to continually chase.
You're advising a well-funded AI chip startup in 2025. You've seen Google's TPU strategy, Groq's inference speed play, Cerebras's wafer-scale bet. The question: which segment of the AI hardware market has the most realistic path to displacing NVIDIA, and what does the winning strategy actually look like?