L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 4 · Lesson 1

Google's TPU Gambit

How the search giant quietly built the world's largest custom AI chip fleet — and why it changed everything.
What happens when a $1.5 trillion company decides that buying chips is no longer good enough?

By 2013, Google's engineers had a problem. Their new voice recognition system — the one powering Google Now — required so much GPU compute that deploying it at full scale would have doubled the company's entire data center footprint. The math was brutal. Either the AI stayed narrow, or the infrastructure bill became absurd.

Jeff Dean's infrastructure team had already been sketching a different answer on whiteboards for months. Instead of buying more of someone else's chips, Google would design its own — purpose-built for the matrix multiplications that neural networks demanded, and nothing else.

The TPU Origin Story

In 2013, Google began a secret project to build what it would eventually call the Tensor Processing Unit — a custom ASIC (application-specific integrated circuit) designed exclusively for the inference workloads underpinning its AI products. The project was classified internally because Google did not want NVIDIA or Intel to know what was coming.

The first-generation TPU (TPU v1) was deployed in Google's data centers in 2015 — a full year before Google publicly acknowledged its existence at Google I/O in May 2016. During that undisclosed year, Google was running AlphaGo's training computations and Google Search ranking on hardware no competitor knew existed.

TPU v1 was inference-only: it could run neural networks but not train them. Its key innovation was the systolic array architecture — a grid of multiply-accumulate units that could pass data from cell to cell without repeatedly fetching from off-chip memory. This slashed latency and power consumption by roughly 10x versus the GPU alternative for inference workloads.

The Paper That Changed Everything

Google published "In-Datacenter Performance Analysis of a Tensor Processing Unit" at ISCA 2017. It reported that the TPU v1 delivered 15–30x higher performance-per-watt than contemporary CPUs and GPUs on neural network inference. This was the first time a hyperscaler had publicly disclosed a custom AI ASIC's internal benchmarks — and the numbers shocked the semiconductor industry.

The TPU Generations: A Rapid Progression

Google did not stop at inference. The TPU v2 (2017) added training capability and was made available to external researchers through Google Cloud TPU in early access. TPU v2 pods — clusters of 64 TPU v2 chips linked by a custom high-speed interconnect — could train large ResNet models in under an hour.

TPU v3 (2018) introduced liquid cooling and doubled the chip count per pod to 1,024. Google reported internally that its TPU v3 pods outperformed the best GPU clusters it could otherwise procure for large-scale language model training.

TPU v4 (2021) was the generation that trained PaLM — Google's 540-billion-parameter language model. A TPU v4 pod contained 4,096 chips connected via Google's proprietary optical interconnect, achieving what Google called the largest tightly-coupled AI supercomputer in operation at that time.

In 2024, Google announced Trillium (TPU v6), claiming a 4.7x improvement in compute performance per chip over TPU v5e, with substantially improved energy efficiency. By 2024, Google had deployed more custom AI silicon by total transistor count than any other organization on Earth.

2015
TPU v1 Deployment
One year before public disclosure
4,096
Chips per TPU v4 Pod
Used to train PaLM 540B
4.7×
Trillium Compute Gain
vs. TPU v5e per chip
10×
Perf/Watt vs. GPU
TPU v1 inference benchmark
Why Google Could Do This and Others Couldn't

Custom silicon requires three things most companies lack: enough volume to amortize design costs, deep software co-design capability, and the patience to wait 18–24 months for a chip to go from tape-out to deployed silicon. Google had all three.

Google's annual hardware capital expenditure was already in the billions by 2013. The incremental cost of funding an ASIC design team — roughly $200–500 million over several years for a first chip — was manageable against a budget that large. More critically, Google controlled both the workloads (TensorFlow, later JAX) and the infrastructure, enabling the tight hardware-software co-design that produces genuine gains over general-purpose chips.

The broader lesson: vertical integration — owning the chip, the software stack, the compiler, and the datacenter — is the structural advantage that makes custom silicon worth the investment. Companies that only own one or two layers rarely recoup the cost.

Strategic Implication

By building its own silicon, Google transformed NVIDIA from a vendor into an optional supplier. It also created a moat: external AI developers using Google Cloud TPUs are dependent on Google's roadmap, pricing, and availability — the same leverage dynamic that previously gave NVIDIA power over Google.

ASIC
Application-Specific Integrated Circuit. A chip designed for exactly one type of task. Faster and more efficient than general-purpose chips for that task, but useless for anything else.
Systolic Array
A hardware architecture where data flows rhythmically through a grid of processors, each performing a simple computation and passing the result to its neighbor — eliminating repeated memory accesses.
Tape-out
The final step in chip design when the layout is sent to the fab for manufacturing. After tape-out, design changes are impossible; corrections require an entirely new chip revision.

Lesson 1 Quiz

Google's TPU Gambit — four questions
1. Google's TPU v1 was deployed in data centers in what year — before public announcement?
Correct. TPU v1 was silently deployed in 2015 and ran AlphaGo's computations before Google disclosed the chip at I/O 2016.
Not quite. Google deployed TPU v1 in 2015 — a full year before announcing it publicly at Google I/O 2016.
2. What hardware architecture gave TPU v1 its efficiency advantage over GPUs for inference?
Correct. The systolic array passes data cell-to-cell, eliminating repeated off-chip memory fetches and delivering ~10x better performance-per-watt on inference.
Incorrect. The key architectural innovation in TPU v1 was the systolic array — a grid of multiply-accumulate units that pass data rhythmically without memory round-trips.
3. Which Google language model was trained on TPU v4 pods?
Correct. PaLM — Google's 540-billion-parameter model — was trained on TPU v4 pods containing 4,096 chips each.
Incorrect. PaLM (540B parameters) was the flagship model trained on TPU v4 pods. BERT and T5 used earlier infrastructure.
4. What is the primary strategic advantage Google gained by building its own TPUs?
Correct. Google transformed NVIDIA from a critical vendor into an optional one, while simultaneously locking cloud customers into Google's own TPU roadmap.
Incorrect. The key strategic gain was vendor independence from NVIDIA and creating customer dependency on Google's TPU ecosystem through Cloud TPU access.

Lab 1 — The Vertical Integration Decision

Discuss Google's TPU strategy with your AI lab assistant

Your Mission

You are advising a large cloud company that currently buys all its AI chips from NVIDIA. The CEO has seen Google's TPU results and wants to know if building custom silicon makes sense for your company too. Use this lab to think through the decision.

Starter: "What are the three most important factors a company must assess before committing to building a custom AI chip?"
AI Lab Assistant
TPU Strategy
Welcome to Lab 1. We're examining Google's decision to build the TPU and what it means for companies considering custom silicon. Ask me anything about the strategic, technical, or economic dimensions of this choice — or use the starter prompt above to begin.
Module 4 · Lesson 2

Amazon's Inferentia & Trainium

AWS's two-chip strategy: one for training, one for inference — and the hyperscaler calculus behind both.
When AWS accounts for a third of global cloud revenue, even a 10% reduction in chip costs represents billions saved. How did that math justify building two entirely separate chips?

Andy Jassy took the stage at re:Invent 2018 with an announcement that caught most analysts off guard. Amazon Web Services was releasing AWS Inferentia — a custom chip for machine learning inference designed entirely in-house. The announcement was notable not for what Inferentia could do, but for what it signalled: AWS no longer intended to be a passive reseller of other companies' silicon.

Three years later, at re:Invent 2021, AWS doubled down with Trainium — its custom training chip. The message was unambiguous. The world's largest cloud provider was building a parallel AI compute stack, from silicon to compiler to managed service, with the explicit goal of offering the lowest cost per training job in public cloud.

Inferentia: The Economics of Inference at Scale

AWS Inferentia was announced in December 2018 and made generally available in December 2019 via Amazon EC2 Inf1 instances. The chip was fabricated by TSMC on a 16nm process and contained four NeuronCores — custom inference engines each capable of performing large tensor operations in a single clock cycle.

AWS's internal case study for Inferentia's launch was Amazon's own Alexa service — a workload processing hundreds of millions of voice queries daily. Running Alexa inference on GPU instances had cost tens of millions of dollars annually. Moving to Inferentia reduced that cost by approximately 70% per inference request, according to figures AWS disclosed at re:Invent 2019.

The Inf1 instances were priced at up to 40% lower cost per inference compared to equivalent GPU-based instances, AWS claimed at launch. Independent benchmarks from MLPerf Inference 2021 subsequently placed Inferentia chips among the top performers in the datacenter inference category on several standard models including BERT and ResNet-50.

AWS Neuron SDK

AWS built the Neuron SDK — a compiler and runtime toolchain that converts models trained in PyTorch, TensorFlow, or MXNet into NeuronCore-optimized binaries. The SDK handles operator partitioning, memory layout, and on-chip data movement automatically. This software layer is what makes Inferentia usable without low-level chip expertise — a critical adoption requirement for AWS's customer base.

Inferentia 2: The Architecture Matures

AWS Inferentia2 was announced at re:Invent 2022 and became available via Inf2 instances in 2023. The second generation moved to TSMC's 7nm node, quadrupled NeuronCore count to 12 per chip, and introduced direct chip-to-chip NeuronLink interconnects enabling multi-chip inference for models too large to fit on a single device.

AWS reported that Inf2 instances delivered up to 4x higher throughput and up to 10x lower latency for large language model inference compared to Inf1, at broadly comparable or lower per-hour pricing. The architectural improvement was driven by NeuronLink, which allowed a 175-billion-parameter model like GPT-3 to be served across two Inferentia2 chips with coherent memory access — eliminating the PCIe bottleneck that limits GPU-based multi-card inference.

Trainium: Entering the Training Market

AWS Trainium was announced at re:Invent 2021 and became available via Trn1 instances in 2022. Trainium was a more ambitious chip than Inferentia — it had to compete directly with NVIDIA A100 for the training market, where customer loyalty and software ecosystem lock-in (CUDA) were formidably entrenched.

AWS's approach was cost disruption. Trn1 instances were priced to offer training compute at roughly 50% lower cost per FLOP than comparable GPU instances on AWS, according to AWS pricing disclosures at re:Invent 2022. The actual performance-per-dollar advantage depended heavily on model architecture and how well the Neuron compiler could optimize the specific training graph.

Trainium2, announced in 2023, scales the architecture dramatically: up to 65,536 chips in a single EC2 UltraCluster, connected by AWS's custom EFA (Elastic Fabric Adapter) network. AWS claimed this configuration could train a 300-billion-parameter model in approximately 2 weeks — competitive with the largest GPU clusters, at a lower hourly rate.

70%
Alexa Cost Reduction
Inferentia vs. GPU instances
Inf2 Throughput Gain
vs. Inf1 for LLM inference
65,536
Trainium2 Cluster Scale
Max chips per UltraCluster
50%
Training Cost Savings
Trn1 vs. GPU on AWS (claimed)
The Structural Difference from Google's Approach

Google built TPUs primarily for internal consumption — to power its own products and reduce its own chip bill, with Cloud TPU as a secondary external offering. Amazon's strategy is fundamentally different: Inferentia and Trainium are cloud products first. The chips only make business sense if AWS customers adopt them in volume.

This means AWS has to solve a harder problem than Google: the software ecosystem. NVIDIA's CUDA has a decade of optimized libraries, tutorials, and trained engineers. AWS's Neuron SDK is newer and covers a narrower surface area. AWS has addressed this partly through compatibility layers and partly through partnerships — notably a 2023 agreement with Anthropic to have Claude models run natively on Trainium, providing a marquee workload to validate the platform.

The Anthropic–AWS Deal

In September 2023, AWS announced a $4 billion investment in Anthropic with a key provision: Anthropic would use AWS Trainium and Inferentia as its primary training and inference platforms. This was not just financial — it was a validation strategy. By anchoring the leading AI safety lab's workloads to its custom silicon, AWS gained a credible proof point that Trainium can train frontier-class models.

NeuronCore
AWS's custom compute engine within Inferentia chips, optimized for the tensor operations at the heart of neural network inference. Each Inferentia2 chip contains 12 NeuronCores.
NeuronLink
AWS's chip-to-chip interconnect in Inferentia2, enabling coherent memory access across multiple chips for serving large models without PCIe latency bottlenecks.
EFA (Elastic Fabric Adapter)
AWS's custom high-bandwidth, low-latency network interface used to link Trainium chips across multiple instances in a training cluster, replacing standard TCP/IP with RDMA-style communication.

Lesson 2 Quiz

Amazon's Inferentia & Trainium — four questions
1. AWS's first custom AI chip, Inferentia, was announced at which event?
Correct. Andy Jassy announced AWS Inferentia at re:Invent 2018, making it generally available a year later in December 2019.
Incorrect. Inferentia was announced by Andy Jassy at AWS re:Invent 2018, with general availability following in December 2019.
2. What was the primary real-world workload AWS used to validate Inferentia's cost advantages internally?
Correct. AWS disclosed at re:Invent 2019 that moving Alexa inference to Inferentia reduced per-query costs by approximately 70%.
Incorrect. The primary validation workload was Alexa — AWS disclosed ~70% cost reduction per inference query at re:Invent 2019.
3. What key architectural feature in Inferentia2 enabled large language model inference without PCIe bottlenecks?
Correct. NeuronLink provides coherent memory access across Inferentia2 chips, enabling multi-chip LLM serving without the PCIe latency that limits GPU-based setups.
Incorrect. The architectural key was NeuronLink — a direct chip-to-chip interconnect providing coherent memory access for multi-chip inference.
4. What was the strategic significance of the 2023 AWS–Anthropic investment agreement regarding chip adoption?
Correct. The Anthropic partnership was a validation strategy — anchoring a frontier AI lab's workloads to AWS custom silicon provided a credible public proof point for Trainium.
Incorrect. The deal's chip significance was Anthropic's commitment to use Trainium and Inferentia as primary platforms, giving AWS a credible frontier AI proof point for its custom silicon.

Lab 2 — The Two-Chip Strategy

Explore AWS's decision to build separate chips for training and inference

Your Mission

AWS made a distinctive choice: build two separate chips — Inferentia for inference and Trainium for training — rather than a single unified chip. Most competitors build one chip that handles both tasks. Explore the trade-offs with your lab assistant.

Starter: "Why would AWS choose to build two separate chips rather than one chip that handles both training and inference? What are the trade-offs?"
AI Lab Assistant
AWS Silicon Strategy
Welcome to Lab 2. We're examining AWS's two-chip silicon strategy — the engineering and economic logic behind building dedicated chips for inference and training separately. Ask me about Inferentia, Trainium, their architectures, or how this two-chip approach compares to NVIDIA's unified GPU approach.
Module 4 · Lesson 3

Microsoft's Maia & Cobalt

How the OpenAI partnership reshaped Azure's silicon roadmap — and what Microsoft's chip ambitions reveal about the future of cloud AI infrastructure.
When your most important cloud customer is also your biggest AI partner, how do you build chips that serve both of you without creating a conflict of interest?

At Microsoft Ignite 2023, CEO Satya Nadella unveiled two chips simultaneously: Azure Maia 100, a custom AI accelerator for training and inference, and Azure Cobalt 100, a custom Arm-based CPU for general cloud workloads. The announcements were notable for their timing — they came just weeks before the OpenAI governance crisis of November 2023, and they signalled that Microsoft was not content to remain a passive buyer of NVIDIA hardware even as it deepened its OpenAI relationship.

The strategic message was explicit: Microsoft intended to have custom silicon at every layer of Azure's AI stack — custom CPUs, custom AI accelerators, and custom networking — by 2025.

Azure Maia 100: Architecture and Targets

Azure Maia 100 was designed in partnership with TSMC and fabricated on TSMC's 5nm process node. Microsoft disclosed that it contains 105 billion transistors — among the largest transistor counts in any commercially deployed AI chip at time of announcement. It is designed specifically for large transformer model workloads: training and inference of GPT-class models.

The chip's architecture prioritizes memory bandwidth over raw FLOP count. Microsoft engineers concluded that most large model operations are memory-bandwidth-bound rather than compute-bound — the bottleneck is moving weights around, not performing the arithmetic. Maia 100 incorporates a large on-chip SRAM cache and low-latency off-chip DRAM access optimized for the sequential access patterns of autoregressive transformer inference.

Microsoft also designed a custom liquid cooling system — called the "cold plate" system — that mounts directly to the Maia chip package. This allows Maia to run at higher sustained clock speeds in dense rack configurations than air-cooled alternatives, a prerequisite for Azure's high-density AI data center buildout.

Microsoft's Stated Primary Use Case

Microsoft disclosed that Maia 100's first major workload would be powering Microsoft Copilot — its AI assistant embedded across Office 365, Teams, Bing, and Windows. Copilot processes tens of billions of tokens per day across these products. By running inference on Maia rather than NVIDIA GPUs, Microsoft projected substantial savings on its own AI infrastructure bill before offering the chips to external Azure customers.

Azure Cobalt 100: The CPU Side of the Equation

Alongside Maia, Microsoft announced Azure Cobalt 100 — a 128-core Arm-based CPU built on TSMC's 5nm process, designed to replace Intel Xeon and AMD EPYC processors in a large fraction of Azure's general-purpose compute fleet. Cobalt 100 is based on the Arm Neoverse CSS (Compute Subsystem) N2 architecture, though Microsoft added custom optimizations for Azure workloads.

Microsoft claimed Cobalt 100 delivers 40% better performance than the best Azure Arm-based instances available at time of announcement. Cobalt 100 instances became available in public preview on Azure in 2024, initially targeting compute-intensive non-AI workloads — web serving, data processing, and containerized applications — where the power efficiency of Arm versus x86 provides direct cost benefits.

The relevance of Cobalt to the AI hardware race is indirect but important: by reducing CPU costs, Microsoft frees capital budget for more Maia and NVIDIA GPU procurement. A hyperscaler's capex is a finite pool; efficiency in one area enables investment in another.

The OpenAI Complication

Microsoft's Maia strategy is intertwined with its OpenAI relationship in ways that create structural tensions. Microsoft has invested approximately $13 billion in OpenAI through multiple funding rounds. OpenAI trains its models — including GPT-4o and future GPT-series models — on Azure infrastructure, primarily on NVIDIA H100 clusters that Microsoft procures on OpenAI's behalf.

If Maia matures to the point where it can train frontier models at competitive cost, Microsoft faces a choice: push OpenAI to use Maia (saving Microsoft money and validating the chip), or allow OpenAI to continue specifying NVIDIA hardware (preserving the NVIDIA relationship and reducing transition risk). OpenAI has its own separate silicon ambitions — its fundraising and partnership conversations in 2023–2024 included discussions about custom chip development, potentially in partnership with SoftBank or through OpenAI's own ASIC project.

As of 2024, Maia has not been publicly confirmed as a primary training substrate for any OpenAI model. Its primary disclosed workload remains Microsoft's own Copilot products.

105B
Maia 100 Transistors
TSMC 5nm process
128
Cobalt 100 CPU Cores
Arm Neoverse N2-based
40%
Cobalt Performance Gain
vs. prior Azure Arm instances
$13B
OpenAI Investment
Microsoft total through 2023
Microsoft vs. Google vs. AWS: A Comparative View

All three hyperscalers are building custom AI silicon, but their motivations and architectures differ meaningfully:

Company Chip(s) Primary Internal Use External Sales? Distinctive Feature
Google TPU v4 / Trillium Search, PaLM, Gemini Yes (Cloud TPU) Optical inter-chip network; 4,096-chip pods
Amazon Inferentia2 / Trainium2 Alexa, AWS services Yes (Inf2, Trn1 instances) Two-chip strategy; 65,536-chip UltraCluster
Microsoft Maia 100 / Cobalt 100 Copilot, Azure services Planned (preview 2024) 105B transistors; custom liquid cooling
The NVIDIA Dependency Paradox

All three hyperscalers are simultaneously building custom silicon to reduce NVIDIA dependence while also spending more on NVIDIA GPUs than ever before. In 2023, Microsoft, Google, and Amazon collectively spent an estimated $30+ billion on NVIDIA H100s. Custom chips reduce marginal costs on specific workloads but cannot yet replace NVIDIA across the full stack. The hyperscalers are playing a long game: replacing NVIDIA gradually, workload by workload, as their custom chips mature.

Autoregressive Inference
The process by which a language model generates output one token at a time, each token depending on all previous tokens. Memory-bandwidth-intensive because the full model weight set must be accessed repeatedly.
Cold Plate Cooling
A liquid cooling method where a metal plate containing microchannels for coolant flow is mounted directly on a chip package. More efficient than air cooling for high-TDP AI chips in dense racks.

Lesson 3 Quiz

Microsoft's Maia & Cobalt — four questions
1. At what event did Microsoft simultaneously announce both Maia 100 and Cobalt 100?
Correct. Satya Nadella announced both Maia 100 and Cobalt 100 at Microsoft Ignite in November 2023.
Incorrect. Both chips were announced by Satya Nadella at Microsoft Ignite in November 2023.
2. What is the disclosed transistor count of Azure Maia 100?
Correct. Microsoft disclosed Maia 100 contains 105 billion transistors, among the largest counts in a deployed AI chip at announcement.
Incorrect. Microsoft disclosed Maia 100 at 105 billion transistors, fabricated on TSMC's 5nm process.
3. Which Microsoft product was identified as the primary initial workload for Azure Maia 100?
Correct. Microsoft disclosed Copilot — powering AI across Office 365, Teams, Bing, and Windows — as Maia's primary initial workload.
Incorrect. Microsoft Copilot — the AI assistant across Office, Teams, Bing, and Windows — was identified as Maia's primary initial workload.
4. Why does Maia 100's architecture prioritize memory bandwidth over raw FLOP count?
Correct. Microsoft's engineers determined that autoregressive transformer inference is memory-bandwidth-bound — the bottleneck is weight movement, making memory architecture the priority.
Incorrect. Maia prioritizes memory bandwidth because large transformer inference is memory-bandwidth-bound — moving billions of weight parameters per token generation is the actual bottleneck.

Lab 3 — The OpenAI Complication

Explore the tension between Microsoft's chip ambitions and its OpenAI partnership

Your Mission

Microsoft has invested $13 billion in OpenAI and built Maia 100 to reduce its NVIDIA dependency. But OpenAI still trains on NVIDIA GPUs via Azure. You're a Microsoft strategy analyst asked to assess: should Microsoft push harder for OpenAI to adopt Maia, or would that create risks the company can't afford?

Starter: "What are the risks to Microsoft if OpenAI refuses to use Maia for training? And what are the risks if they push too hard and OpenAI walks?"
AI Lab Assistant
Microsoft–OpenAI Strategy
Welcome to Lab 3. We're exploring the complex relationship between Microsoft's custom chip ambitions (Maia) and its $13 billion investment in OpenAI — and the structural tensions that creates. Ask me about the strategic, financial, or technical dimensions of this situation.
Module 4 · Lesson 4

The Startup Challengers

Cerebras, Groq, SambaNova, and Tenstorrent: four different bets on what comes after the GPU era.
If NVIDIA's dominance rests on CUDA and programmability, can a startup win by being faster at a narrower task — or does specialization always lose to the ecosystem?

Every startup in the AI chip space is making the same fundamental bet: that there exists a regime where a purpose-built architecture outperforms a general-purpose GPU enough that customers will pay a premium, tolerate software migration costs, and accept vendor concentration risk. The history of the semiconductor industry suggests this is very hard to achieve. The history of the last five years suggests it might be possible — if the workload is right.

Cerebras: The Wafer-Scale Engine

Cerebras Systems, founded in 2016 by Andrew Feldman (previously CEO of SeaMicro, acquired by AMD), made the most audacious architectural bet in the AI chip industry: build a chip the size of an entire silicon wafer.

The Cerebras WSE-2 (Wafer-Scale Engine 2), announced in 2021, contains 2.6 trillion transistors across 46,225 mm² of silicon — roughly 56x the area of an NVIDIA A100. It contains 850,000 AI-optimized cores and 40 GB of on-chip SRAM (compared to the A100's 40 MB). This eliminates off-chip memory bottlenecks entirely for models that fit within 40 GB — and Cerebras has developed techniques to spread larger models across multiple WSE-2 chips.

The WSE-3, announced at SC23 in November 2023, scaled further: 4 trillion transistors, 900,000 cores, and 44 GB of on-chip SRAM. It is fabricated by TSMC on a 5nm process. A single WSE-3 can train a GPT-3-sized (175B parameter) model without model parallelism — a feat that requires hundreds of GPU nodes.

Cerebras's real-world deployments include partnerships with Argonne National Laboratory and Lawrence Livermore National Laboratory for scientific AI workloads. In 2023, Cerebras launched Cerebras Inference — a cloud API service offering token generation at over 1,800 tokens per second for Llama 3 models, compared to roughly 60–80 tokens/second on equivalent GPU setups. In independent tests, this represented the fastest publicly available LLM inference service by a significant margin.

The Wafer-Scale Risk

Building chips at wafer scale creates a fundamental yield problem: a single defect anywhere on the 46,000 mm² wafer could theoretically render the entire chip unusable. Cerebras solved this through redundant cores — enough extra compute units that the chip remains fully functional even with a statistically expected number of manufacturing defects. The yield math works at scale, but it required significant custom fab process development with TSMC.

Groq: The Language Processing Unit

Groq was founded in 2016 by Jonathan Ross — the Google engineer who first proposed building the TPU. Ross left Google after TPU v1 shipped and founded Groq to pursue a different architectural vision: deterministic, software-controlled data movement with zero hardware caches.

The Groq LPU (Language Processing Unit) uses a tensor streaming processor architecture where all data movement is explicitly scheduled by the compiler at compile time. There is no dynamic memory controller, no speculative execution, and no cache hierarchy. The result is perfectly predictable latency — every inference run takes exactly the same time as every other — and extremely high memory bandwidth utilization (near 100%, versus 60–80% typical for GPUs).

In February 2024, Groq launched GroqCloud — a public inference API. Within days of launch, independent users reported Llama 2 70B inference at over 500 tokens per second, faster than any GPU-based service at equivalent model size. Groq's internal benchmark claimed 750 tokens/second for Mixtral 8x7B. The service went viral in AI developer communities as the first genuinely sub-100ms time-to-first-token experience for large models.

Groq raised a $640 million Series D round in August 2024 at a reported $2.8 billion valuation, with investors including Samsung, BlackRock, and the government of Saudi Arabia (via the Public Investment Fund).

SambaNova: The Reconfigurable Approach

SambaNova Systems, founded in 2017 by former Stanford professors Kunle Olukotun and Chris Ré alongside engineer Rodrigo Liang, took a different architectural direction: the reconfigurable dataflow unit (RDU). Rather than designing a fixed-function chip, SambaNova built a chip whose data movement paths can be reprogrammed at runtime — a middle ground between an ASIC and an FPGA.

SambaNova's primary commercial traction has been in the enterprise on-premises market. The company sells "SambaNova Suite" — a full-stack AI appliance containing RDU chips, software, and pre-trained models — directly to large enterprises and national labs that cannot or will not send sensitive data to public cloud. Customers include Argonne National Laboratory, Oak Ridge National Laboratory, and several large financial institutions.

SambaNova raised $676 million over several rounds, reaching a valuation above $5 billion in 2021. It has been more conservative about publishing public benchmarks than Cerebras or Groq, reflecting its enterprise sales model where customer-specific performance demonstrations replace public leaderboards.

Tenstorrent: Keller's Bet on Open Architecture

Tenstorrent was founded in 2016 and is led by Jim Keller — arguably the most celebrated CPU architect in the industry, responsible for AMD's Zen architecture, Apple's A-series chips, and Tesla's Dojo neural network chip. Keller joined as CEO in 2021.

Tenstorrent's chips — the Grayskull (2021) and Wormhole (2022) architectures — use a RISC-V-based tile mesh where each tile contains a RISC-V processor plus a matrix math unit. The tiles are connected by a 2D torus network-on-chip. Critically, Tenstorrent has open-sourced significant portions of its software stack and published its architecture specifications — a bet that an open ecosystem will attract developers faster than any proprietary moat.

In June 2023, Tenstorrent raised $100 million from investors including Hyundai Motor Group and Samsung Ventures. The Korean industrial conglomerate investment reflects a growing thesis: as AI moves into automotive and industrial applications, on-device AI chips designed in a RISC-V ecosystem become attractive to OEMs who want silicon independence from both NVIDIA and US hyperscalers.

4T
WSE-3 Transistors
Cerebras, TSMC 5nm
750
Groq Tokens/Sec
Mixtral 8x7B benchmark
$2.8B
Groq Valuation
Series D, August 2024
$5B+
SambaNova Peak Valuation
2021 funding round
The Common Thread: Latency Over Throughput

All four startups, despite wildly different architectures, are targeting the same gap: NVIDIA optimizes for throughput — tokens per second per dollar across large batch jobs. The startup challengers are attacking latency — time-to-first-token, single-request response time, and the interactive AI use cases where a 200ms response feels alive and a 2,000ms response feels broken.

This is a coherent market segmentation bet. As AI inference moves from batch processing to real-time conversational applications — voice AI, coding assistants, autonomous agents — the relevant metric shifts from "how many requests can I process per hour?" to "how fast does this individual response arrive?" NVIDIA's multi-billion-dollar CUDA ecosystem is optimized for the former. Cerebras, Groq, and others are betting the future rewards the latter.

The Survival Question

All four startups face the same existential risk: NVIDIA is not standing still. The H200 and upcoming Blackwell Ultra architectures have dramatically improved inference latency, and NVIDIA's NIM microservices platform is specifically targeting the low-latency inference market. A startup that wins on latency today may find its advantage eroded within 18 months by a new NVIDIA SKU. Sustainable differentiation requires either an architectural moat (Cerebras's wafer scale, Groq's deterministic scheduling) or a customer moat (SambaNova's enterprise lock-in, Tenstorrent's open-source ecosystem) that NVIDIA cannot easily replicate.

Wafer-Scale Integration
Manufacturing a single chip that spans an entire silicon wafer, rather than dicing the wafer into individual chips. Eliminates off-chip communication latency at the cost of yield management complexity.
Tensor Streaming Processor
Groq's architecture where data movement is scheduled at compile time with zero hardware caches. Enables perfectly predictable latency and near-100% memory bandwidth utilization.
Reconfigurable Dataflow Unit (RDU)
SambaNova's programmable chip architecture where data movement paths can be reconfigured at runtime, enabling optimization for specific model architectures without a full ASIC redesign.

Lesson 4 Quiz

The Startup Challengers — four questions
1. What is the defining architectural feature of Cerebras's WSE-3 chip?
Correct. The WSE-3's defining feature is wafer-scale integration — a single chip spanning a full silicon wafer with 4 trillion transistors and 900,000 cores.
Incorrect. Cerebras WSE-3's defining feature is wafer-scale integration — a chip spanning an entire silicon wafer with 4 trillion transistors, 900,000 cores, and 44 GB of on-chip SRAM.
2. Groq was founded by someone with a significant prior role in AI chip history. Who is that founder?
Correct. Jonathan Ross proposed the original TPU at Google, then left after TPU v1 shipped to found Groq and build a fundamentally different AI chip architecture.
Incorrect. Groq was founded by Jonathan Ross — the Google engineer who first proposed the TPU project. Jim Keller leads Tenstorrent; Andrew Feldman founded Cerebras; Kunle Olukotun co-founded SambaNova.
3. What market segment has SambaNova primarily pursued, distinguishing it from Cerebras and Groq?
Correct. SambaNova targets enterprises and national labs (Argonne, Oak Ridge) that need on-premises AI for data sensitivity or sovereignty reasons — a segment where cloud-based competitors have limited reach.
Incorrect. SambaNova's primary focus is enterprise on-premises AI appliances — selling to national labs and financial institutions that cannot send sensitive data to public cloud.
4. What common competitive strategy unites Cerebras, Groq, and the other AI chip startups against NVIDIA?
Correct. The startups are all betting that as AI shifts to real-time interactive applications, latency becomes the key metric — an area where NVIDIA's throughput-optimized architecture is less competitive.
Incorrect. The common thread is a latency bet: NVIDIA optimizes throughput for batch jobs; the startups target time-to-first-token for interactive AI applications like voice and coding assistants.

Lab 4 — The Startup Survival Analysis

Can any AI chip startup survive NVIDIA's inevitable response?

Your Mission

You are a venture capital analyst evaluating AI chip startups. Your fund has been asked to assess whether Cerebras, Groq, SambaNova, or Tenstorrent has a defensible long-term position — or whether NVIDIA will inevitably erode their advantages as it has done to GPU challengers historically.

Starter: "Of Cerebras, Groq, SambaNova, and Tenstorrent, which one has the most defensible long-term position against NVIDIA, and why?"
AI Lab Assistant
AI Chip Startup Analysis
Welcome to Lab 4. We're analyzing the strategic positioning of AI chip startups — Cerebras, Groq, SambaNova, and Tenstorrent — against NVIDIA and against each other. Ask me about their architectures, business models, customer traction, or competitive vulnerabilities. Let's think through who actually has a defensible position.

Module 4 Test

The New Entrants — 15 questions · 80% to pass
1. Google's TPU v1 was a secret for approximately how long before public disclosure?
Correct. TPU v1 was deployed in 2015 and not disclosed until Google I/O May 2016.
Incorrect. TPU v1 was deployed in 2015 and disclosed one year later at Google I/O 2016.
2. The ISCA 2017 TPU paper reported what performance advantage over contemporary GPUs for neural network inference?
Correct. Google's ISCA 2017 paper reported 15–30x higher performance-per-watt versus CPUs/GPUs on neural net inference.
Incorrect. The ISCA 2017 paper reported 15–30x performance-per-watt advantage for the TPU v1 over contemporary CPUs and GPUs.
3. How many chips does a Google TPU v4 pod contain?
Correct. TPU v4 pods contain 4,096 chips connected via Google's proprietary optical interconnect.
Incorrect. TPU v4 pods contain 4,096 chips — the configuration used to train PaLM 540B.
4. AWS's Inferentia first became generally available through which EC2 instance family?
Correct. Inf1 instances launched in December 2019 were the first general availability path for AWS Inferentia.
Incorrect. AWS Inferentia became generally available via Inf1 instances in December 2019. Trn1 is the Trainium instance family.
5. What AWS-proprietary network technology links Trainium chips across multiple instances in a training cluster?
Correct. EFA (Elastic Fabric Adapter) is AWS's custom RDMA-style network for linking Trainium chips across Trn1 instances in training clusters.
Incorrect. EFA — Elastic Fabric Adapter — is AWS's custom high-bandwidth network used to cluster Trainium chips. NeuronLink is the on-chip interconnect within Inferentia2.
6. The Trainium2 UltraCluster can scale to how many chips maximum?
Correct. AWS announced Trainium2 UltraClusters scaling to 65,536 chips — the configuration claimed to train a 300B parameter model in roughly 2 weeks.
Incorrect. Trainium2 UltraClusters scale to 65,536 chips, announced at re:Invent 2023.
7. What fabrication process node was used for both Azure Maia 100 and Azure Cobalt 100?
Correct. Both Maia 100 and Cobalt 100 were fabricated by TSMC on their 5nm process node.
Incorrect. Both Azure Maia 100 and Cobalt 100 were built on TSMC's 5nm process.
8. Why does Azure Maia 100 prioritize memory bandwidth over raw FLOP count?
Correct. Autoregressive transformer inference requires repeatedly reading billions of weight parameters — the bottleneck is memory access, not arithmetic.
Incorrect. Maia prioritizes memory bandwidth because large model inference is memory-bandwidth-bound — accessing weight data is the real bottleneck for token generation.
9. Cerebras WSE-3 contains how many AI-optimized cores?
Correct. WSE-3 contains 900,000 AI-optimized cores across its wafer-scale design.
Incorrect. Cerebras WSE-3 contains 900,000 AI-optimized cores with 44 GB of on-chip SRAM.
10. What distinguishes Groq's LPU architecture from a GPU in terms of memory hierarchy?
Correct. Groq's tensor streaming processor has no hardware caches — the compiler schedules all data movement at compile time, producing perfectly deterministic latency.
Incorrect. Groq's defining feature is zero hardware caches — its compiler schedules all data movement explicitly, eliminating cache overhead and producing deterministic inference latency.
11. Which AI chip startup founder originally proposed building the TPU at Google?
Correct. Jonathan Ross was the Google engineer who first proposed the TPU, then left to found Groq after TPU v1 shipped.
Incorrect. Jonathan Ross (Groq) originally proposed the TPU at Google. Jim Keller leads Tenstorrent; Andrew Feldman founded Cerebras; Kunle Olukotun co-founded SambaNova.
12. Which of the following companies invested in Groq's Series D round in 2024?
Correct. Groq's $640M Series D (August 2024) included Samsung, BlackRock, and Saudi Arabia's Public Investment Fund at a $2.8B valuation.
Incorrect. Groq's Series D investors included Samsung, BlackRock, and Saudi Arabia's Public Investment Fund — a notably geopolitically diverse investor group.
13. What is SambaNova's primary differentiating market focus versus Cerebras and Groq?
Correct. SambaNova targets enterprises and national laboratories that need on-premises AI — customers like Argonne and Oak Ridge National Laboratories.
Incorrect. SambaNova focuses on on-premises enterprise AI appliances — selling full-stack systems to national labs and financial institutions unable to use public cloud for sensitive workloads.
14. Which aspect of Tenstorrent's strategy did investments from Hyundai and Samsung primarily reflect interest in?
Correct. Korean industrial investors were attracted by Tenstorrent's RISC-V architecture and open-source ecosystem for AI in automotive and industrial applications.
Incorrect. Hyundai and Samsung's investment reflected interest in Tenstorrent's RISC-V-based chips for on-device AI in automotive and industrial applications — markets where these companies operate.
15. What shared competitive strategy unites Cerebras, Groq, and other AI chip startups against NVIDIA?
Correct. The startup challengers are collectively betting that real-time interactive AI applications will reward latency optimization over throughput — a different performance axis than NVIDIA's historical strength.
Incorrect. The common thread is a latency bet: as AI shifts to interactive voice, coding, and agent applications, time-to-first-token becomes the key metric — an area where the startups claim architectural advantages over NVIDIA's throughput-optimized GPUs.