Lesson 1 · Google's TPU Strategy

The Problem That Forced Google's Hand

How a single internal projection in 2013 changed the trajectory of AI hardware forever.

Why would the world's largest cloud computing company decide it needed to design its own chip — and what did that decision cost, and gain?

In 2013, a team inside Google ran a quiet calculation. If the company's users started using voice search with speech recognition powered by deep neural networks for just three minutes per day, Google would need to double its entire global data center capacity to keep up — just to handle the inference workload. The hardware they had, built for search and ads, was the wrong tool entirely.

The team brought the finding to Jeff Dean, Google's most senior engineer, and the direction that emerged was unusual for a software company: build the chip yourself.

The Specific Gap GPUs Could Not Fill

In 2013, GPUs were already well-established for training neural networks. NVIDIA's Kepler-generation cards were the workhorses of academic deep learning labs. But Google's problem was not training — it was inference: running already-trained models millions of times per second, at low latency, across data centers that consumed electricity at the scale of small cities.

GPUs are optimized for throughput across a wide range of workloads. They carry significant die area devoted to graphics rendering logic, texture units, and rasterization pipelines that are completely irrelevant to matrix multiplication for neural nets. For inference at Google's scale, this meant paying for energy and silicon that delivered no useful compute.

A custom chip purpose-built for the narrow mathematical operations that neural networks actually use — primarily large matrix multiplications and additions — could in principle be far more efficient per operation. The trade-off was specialization: a chip that could do one thing extremely well but little else.

The 2013 Trigger

Google's internal projection estimated that three minutes of daily neural-network-powered voice search per user would require doubling data center capacity globally. This single calculation justified the entire TPU program's existence.

The Decision to Go Internal

Rather than waiting for an external vendor to build what it needed, Google made the call to design in-house. The project was code-named TPU — Tensor Processing Unit — referencing TensorFlow's core data structure. The first-generation chip was designed, fabricated, and deployed in Google data centers in roughly 22 months, an extraordinary timeline for custom silicon, achieved partly by using an existing memory and interface design rather than building everything from scratch.

The TPU v1 was not a training chip at all. It was a pure inference accelerator: an 8-bit integer matrix multiply unit with 65,536 multiply-accumulate units arranged in a systolic array. It had 28MB of on-chip memory and a 256GB/s memory bandwidth to DRAM. Google deployed it quietly starting in 2015. Users interacting with Google Search, Street View, and Google Photos were already running on TPU hardware before the chip was publicly disclosed.

The first public disclosure came in May 2016 when Google CEO Sundar Pichai mentioned TPUs briefly at Google I/O. The detailed technical paper — "In-Datacenter Performance Analysis of a Tensor Processing Unit" — appeared in April 2017, authored by Norman Jouppi and colleagues, revealing performance figures that surprised the hardware community.

What the Numbers Actually Showed

The 2017 paper compared TPU v1 against contemporary server-class CPUs (Haswell Xeon) and GPUs (NVIDIA K80) on Google's six most inference-heavy production workloads. The TPU v1 delivered 15–30× higher performance per watt compared to those contemporaries on these workloads. On raw throughput measured in TOPS (tera operations per second), TPU v1 achieved 92 TOPS at 40W TDP — roughly 2.3 TOPS per watt versus the K80's roughly 0.3 TOPS per watt at inference precision.

These figures came with important caveats: the comparison was on Google's specific production workloads, at 8-bit integer precision, where the TPU's systolic array had a structural advantage. For training, or for workloads requiring floating-point flexibility, the comparison would have looked very different. But for the specific problem Google needed to solve — inference at massive scale — the numbers were decisive.

Key Terms

Systolic ArrayA grid of processing elements that pass data through in a wave-like rhythm, allowing large matrix multiplications to be computed with very few memory accesses. TPU v1's 256×256 systolic array is its defining structural feature.

Inference vs. TrainingTraining adjusts a model's weights from data — computationally intensive, done once or periodically. Inference runs the trained model on new inputs — done billions of times. TPU v1 was designed exclusively for inference.

TOPS per WattTera Operations Per Second divided by power consumption. The efficiency metric that matters most for large-scale inference deployments where electricity is a primary operating cost.

The Strategic Implication

The decision to build TPU v1 was not primarily about competitive advantage over other AI companies — in 2013 there were few. It was about Google's own operational economics. At Google's scale, even a 10× improvement in compute efficiency translates to billions of dollars in avoided capital expenditure on servers, cooling, and power infrastructure over a multi-year horizon.

But the secondary effect proved more significant: once Google had the internal capability to design AI chips, it could iterate. TPU v2, announced in 2017, added training capability and floating-point support. TPU v3, announced in 2018, doubled the performance of v2. TPU v4, deployed at scale in 2021, was housed in 4,096-chip "pods" capable of an aggregate 1 exaFLOP. The 2013 inference problem, and the organizational response to it, seeded a decade-long hardware program that became a foundational competitive asset.

Months from greenlight to data center deployment for TPU v1

15–30×

Performance-per-watt advantage over contemporary CPUs/GPUs on inference workloads (2017 paper)

2015

Year Google began running production traffic on TPUs — before any public announcement

Lesson 1 Quiz

The Problem That Forced Google's Hand

What specific internal finding in 2013 triggered Google's decision to build a custom AI chip?

Correct. The 2013 projection about voice search inference load was the direct catalyst. It made clear that general-purpose hardware could not scale economically to meet Google's neural network workloads.

Not quite. The driver was Google's own internal operational economics — specifically the inference compute cost of neural-network-powered voice search at scale.

TPU v1 was designed primarily as a what?

Correct. TPU v1 was exclusively an inference chip. It could not train models. Its 8-bit integer systolic array was optimized for running already-trained networks at high throughput.

Not correct. TPU v1 had no training capability. It was a purpose-built inference accelerator — training capability only came with TPU v2.

What is a systolic array, as used in TPU v1?

Correct. The systolic array is the core architectural innovation of TPU v1 — data flows through the array of multiply-accumulate units in a rhythmic wave, dramatically reducing the number of times data must be fetched from memory.

That's not it. A systolic array is a computational structure — a grid of multiply-accumulate units that process matrix operations efficiently by passing data through in a wave, minimizing memory accesses.

According to Google's 2017 technical paper, what performance advantage did TPU v1 show over contemporary CPUs and GPUs on production inference workloads?

Correct. The 2017 Jouppi et al. paper reported 15–30× better performance per watt versus Haswell CPUs and K80 GPUs on Google's six heaviest inference workloads — a figure that surprised the hardware research community.

Not quite. The number was much larger: 15–30× — on Google's specific production inference workloads at 8-bit integer precision. This was the figure that made the TPU program impossible to ignore.

Lab 1: The Build-vs-Buy Decision

Explore the economics and trade-offs of custom silicon with your AI tutor

Your Mission

Google's 2013 decision to build its own inference chip rather than buy more GPUs was a classic build-vs-buy inflection point. In this lab, you'll interrogate that decision with your AI tutor — examining what factors made custom silicon viable for Google but not for most organizations.

Starter questions: What scale of inference workload justifies building a custom chip? What did Google have that a startup in 2013 didn't? How did the 22-month timeline compare to typical silicon design cycles?

AI Tutor

TPU Strategy · L1

Welcome to Lab 1. We're examining Google's 2013 decision to design the TPU rather than continue buying GPU and CPU capacity. This is one of the most consequential build-vs-buy decisions in tech history — what aspect would you like to dig into first? The economics, the technical trade-offs, or what made Google uniquely positioned to pull this off?

Lesson 2 · Google's TPU Strategy

From Inference Engine to AI Supercomputer

How each TPU generation redefined what Google could do — and what the competition had to answer.

What specific architectural changes made TPU v2, v3, and v4 progressively more powerful — and what problems did each generation exist to solve?

When Sundar Pichai stepped onto the stage at Google I/O in May 2017 and showed a slide labeled "TPU v2," the AI hardware community understood immediately: Google had not built a one-off chip. It had built a program. The v2 announcement came the same day as the detailed v1 technical paper, a coordinated disclosure designed to show an arc — inference chip, then training chip, and next, by implication, whatever came after that.

TPU v1: The Baseline (2015–2017)

TPU v1's design choices reflected its narrow mandate: inference only, at high throughput, at low power. Its 256×256 systolic array of 8-bit integer multiply-accumulate units could perform 92 trillion 8-bit operations per second. It connected to the host server via PCIe (like a traditional add-in card) rather than requiring a redesigned server architecture. Memory bandwidth was 256 GB/s to 28MB of on-chip SRAM.

The constraint was inflexibility: 8-bit integer precision is sufficient for inference on most trained models but completely inadequate for training, which requires the gradient calculations that need floating-point precision. TPU v1 could not backpropagate. It could only run forward passes.

TPU v2: Adding Training Capability (2017)

TPU v2 was a fundamental redesign. Its key changes:

Floating-point support. The v2 added bfloat16 (brain float 16) — a 16-bit floating-point format Google invented specifically for neural network training. Unlike IEEE float16, bfloat16 preserves the full 8-bit exponent range of float32, avoiding overflow/underflow problems during gradient descent while halving memory bandwidth requirements vs. float32.

High-bandwidth memory. The v2 introduced HBM (High Bandwidth Memory) — stacked DRAM integrated close to the compute die — giving 600 GB/s of memory bandwidth, more than double the v1's 256 GB/s and far above what PCIe-attached DRAM could provide.

Custom interconnect. The v2 was no longer a PCIe card. It was a board-level component designed to be interconnected with other TPU v2 boards via a custom 2D torus network. Four TPU v2 chips formed a "TPU v2 board," and up to 256 boards (1,024 chips) could be networked into a "TPU v2 Pod" with a peak performance of 11.5 petaFLOPS.

Google made TPU v2 available to external researchers via Google Cloud in 2017, marking the first time TPU compute was accessible outside of Google's internal workloads. The price was $6.50 per TPU-hour at launch.

TPU v3: Scale and Liquid Cooling (2018)

TPU v3, announced at Google I/O 2018, doubled the FLOPS-per-chip of v2 and doubled the HBM bandwidth per chip to 900 GB/s. The performance increase came partly from a larger chip and partly from higher clock frequencies — which required a significant engineering departure: liquid cooling. TPU v3 was the first TPU generation to require liquid cooling rather than air cooling, constraining where and how pods could be deployed but enabling performance that air cooling could not sustain.

TPU v3 pods also scaled to 1,024 chips, matching v2's maximum pod configuration but at doubled per-chip performance, yielding pods capable of ~100 petaFLOPS. The v3 was used heavily for Google Brain's research work and for training production models including BERT (Bidirectional Encoder Representations from Transformers), published by Jacob Devlin and colleagues at Google in October 2018.

TPU v4: The Exascale Era (2021–2023)

TPU v4 represented the most significant architectural advance since v2. Google disclosed details in a June 2023 paper in Nature. Key characteristics:

Optical circuit switching. TPU v4 pods used optical circuit switches (OCS) instead of fixed copper interconnects to connect chips. This allowed the network topology to be reconfigured in software — any chip could be connected to any other chip within the pod without physical rewiring. The flexibility enabled Google to route traffic around failed chips, dramatically improving pod-level utilization compared to fixed topologies where a single failed chip can strand an entire segment.

Scale. TPU v4 pods contained 4,096 chips and achieved approximately 1 exaFLOP of aggregate performance (at bfloat16 precision). Google noted in the 2023 paper that these were the largest deployed AI computing systems at the time of their use.

Production training workloads. The 2023 Nature paper disclosed that TPU v4 was used to train the PaLM (Pathways Language Model) 540-billion-parameter model, as well as its successors. PaLM training used two TPU v4 pods run in parallel across Google data centers, connected by Google's Jupiter data center networking fabric.

2015

TPU v1 deployed in Google data centers. 92 TOPS at 8-bit integer precision. Inference only. PCIe interface.

2017

TPU v2 announced at Google I/O. Adds bfloat16 training, HBM, custom 2D torus interconnect. 11.5 petaFLOP pods. Available on Google Cloud.

2018

TPU v3 announced. Doubles v2 FLOPS per chip. Requires liquid cooling. Used to train BERT.

2021

TPU v4 deployed. Optical circuit switching. 4,096-chip pods. ~1 exaFLOP per pod. Used to train PaLM 540B.

2023

TPU v5e and v5p announced. v5e targets cost-efficient inference and fine-tuning; v5p targets large-scale training with higher FLOPS and HBM per chip.

bfloat16: A Format Google Invented

bfloat16 (Brain Float 16) is a 16-bit floating-point format developed by Google Brain. It uses the same 8-bit exponent as float32 but only 7 mantissa bits (versus 23). This preserves dynamic range — critical during training — while cutting memory and bandwidth requirements in half. bfloat16 has since been adopted by Intel, ARM, NVIDIA (in Ampere and later architectures), and AMD. A format Google invented for internal use became an industry standard.

What the Evolution Reveals About Google's Strategy

The progression from TPU v1 to v4 follows a clear logic: each generation expanded the scope of what Google could do in-house without buying external compute. v1 eliminated the inference bottleneck. v2 allowed Google to train at scale without NVIDIA hardware. v3 accelerated the research cycle. v4 enabled models — like PaLM — that would have been economically impractical on any commercially available hardware at the time.

The optical circuit switching in v4 is particularly revealing. It represents a willingness to invest in infrastructure that is genuinely novel — not just a faster version of existing designs — because Google's workload scale justifies the engineering cost. Few organizations could amortize the development cost of custom optical networking across enough TPU deployments to make it worthwhile. Google could.

Lesson 2 Quiz

From Inference Engine to AI Supercomputer

What was the key architectural addition in TPU v2 that TPU v1 completely lacked?

Correct. TPU v1 was 8-bit integer only — sufficient for inference but unable to perform the gradient calculations that training requires. TPU v2 added bfloat16 floating-point support, making training possible.

Not quite. The systolic array was in v1. Optical switching came in v4. Liquid cooling was added in v3. The defining new capability of v2 was bfloat16 floating-point for training.

Why did Google invent bfloat16 rather than using the existing IEEE float16 standard?

Correct. The critical issue is the exponent: float32 has an 8-bit exponent; IEEE float16 has only a 5-bit exponent, which causes numerical instability (overflow and underflow) during gradient descent. bfloat16 keeps float32's 8-bit exponent, trading mantissa bits instead.

Not correct. The reason is numerical range. IEEE float16's 5-bit exponent is too narrow for the gradient magnitudes encountered during training, causing underflow and overflow. bfloat16 uses float32's 8-bit exponent to avoid this problem.

What was the defining new hardware feature of TPU v4 that distinguished it from v3?

Correct. TPU v4's optical circuit switches (OCS) allowed the network connecting 4,096 chips to be reconfigured in software, enabling traffic to route around failed chips and dramatically improving pod-level utilization.

Not correct. Liquid cooling came in v3. HBM came in v2. 8-bit inference was in v1. The defining innovation of v4 was optical circuit switching — software-reconfigurable interconnect topology.

Which major Google AI model was disclosed in the 2023 Nature paper as having been trained on TPU v4 pods?

Correct. Google's 2023 Nature paper on TPU v4 disclosed that PaLM — its 540-billion-parameter language model — was trained on two TPU v4 pods run in parallel, connected via Google's Jupiter data center networking fabric.

Not correct. BERT was trained on TPU v3. AlphaFold 2 used TPUs but is not the disclosed v4 training case. PaLM, Google's 540B-parameter model, was explicitly identified as the flagship TPU v4 training workload in the 2023 Nature paper.

Lab 2: Architecture Trade-offs Across Generations

Interrogate the design decisions behind each TPU generation with your AI tutor

Your Mission

Each TPU generation made specific architectural trade-offs — precision vs. flexibility, per-chip performance vs. power, fixed vs. reconfigurable interconnect. In this lab, explore why those specific choices were made and what they reveal about how Google's priorities evolved from 2015 to 2023.

Starter questions: Why is bfloat16 better for training than float16? What problem does optical circuit switching solve that copper interconnect can't? Why did training capability not matter for TPU v1 but matter enormously for v2?

AI Tutor

TPU Strategy · L2

Welcome to Lab 2. We're examining the architectural choices across TPU generations — each generation solved a specific problem that the previous one couldn't. What would you like to explore first? The precision formats, the interconnect evolution, or why the shift from inference-only to training-capable was such a significant step?

Lesson 3 · Google's TPU Strategy

TPU vs. GPU: The Actual Competition

NVIDIA was not asleep. How the GPU maker responded to Google's chip program — and where the real boundaries between the two technologies lie.

Is the TPU really "better" than the GPU for AI — or is that the wrong question entirely? What do the actual head-to-head comparisons reveal?

When the first MLPerf Training benchmarks were published in 2019, the AI hardware community finally had a standardized, apples-to-apples comparison framework. Google submitted results with TPU v3. NVIDIA submitted with A100. The headline numbers were close enough to make the question complicated: neither chip dominated the other across all tasks. Performance depended on batch size, model architecture, precision, and the specific benchmark configuration.

The real competitive dynamic, it turned out, was not about raw benchmark scores — it was about ecosystem lock-in, pricing power, and the question of who controls the software stack.

Where TPUs Have a Structural Advantage

TPUs are optimized for a specific computational pattern: large matrix multiplications at high throughput with high memory bandwidth. This pattern dominates transformer-based models — the architecture underlying GPT, BERT, PaLM, and virtually every large language model since 2017. For workloads that fit this pattern and can be expressed in JAX or TensorFlow XLA (the compiler that generates TPU-native code), TPUs can deliver superior performance per dollar on Google Cloud versus equivalent NVIDIA configurations.

The memory bandwidth advantage is significant. TPU v4 has 1.2 TB/s of HBM bandwidth per chip. The NVIDIA A100 (the dominant training GPU during the period TPU v4 was deployed) has 2 TB/s in its 80GB HBM2e variant — numerically higher, but A100 clusters also carry more overhead per chip for non-matrix operations. For pure transformer forward-backward passes, both chips are substantially memory-bandwidth-bound rather than compute-bound, making bandwidth the relevant figure.

At the pod level, however, the comparison changes. A TPU v4 pod (4,096 chips) with its optical circuit switching and custom interconnect achieves collective communication efficiency (the speed at which gradient updates can be averaged across all chips) that exceeds what NVIDIA's NVLink and InfiniBand configurations achieved at equivalent scales during the same period, according to Google's 2023 Nature paper.

Where NVIDIA GPUs Retain Clear Advantages

The GPU's fundamental advantage is flexibility and ecosystem breadth. CUDA, NVIDIA's programming platform, has been developed since 2006 and has an installed base of developers, libraries, and frameworks that simply does not exist for TPUs. PyTorch — which became the dominant deep learning research framework by approximately 2020 — runs natively on CUDA and required significant adaptation to run on TPUs via PyTorch/XLA (an integration that remains less seamless than native CUDA as of this writing).

This ecosystem gap has real consequences. A researcher or startup that writes PyTorch code can run it on any NVIDIA GPU without modification. Running the same code on a TPU requires either porting to JAX/TensorFlow or using PyTorch/XLA with its additional complexity. For the vast majority of the AI development community outside Google, this makes NVIDIA the default choice regardless of raw performance comparisons.

Additionally, GPUs are general-purpose enough to handle workloads that don't fit the TPU's matrix-multiplication-heavy sweet spot: irregular compute patterns, sparse models, reinforcement learning with non-differentiable operations, computer graphics and simulation tasks that may be co-located with AI workloads. The TPU has no competitive answer to these use cases.

Finally, GPUs are available everywhere. A researcher at any university can rent NVIDIA GPU time on AWS, Azure, or Google Cloud. TPUs are exclusively available on Google Cloud — there is no other vendor that sells TPU access. This captivity is by design from Google's perspective (it drives cloud adoption) but it is a genuine limitation for portability.

NVIDIA's Response: The H100 and NVLink Switch

NVIDIA did not ignore the TPU program. The H100 (announced March 2022, shipping mid-2023) introduced the Transformer Engine — dedicated hardware logic for detecting and accelerating transformer attention operations in FP8 precision — directly responding to the workload pattern where TPUs had excelled. The NVLink Switch System, introduced with H100, allows up to 256 H100 GPUs to be connected in an all-reduce topology with 900 GB/s bidirectional bandwidth per GPU — competing with TPU pod-level collective communication efficiency. NVIDIA has explicitly acknowledged transformer workloads as the design target for H100 architecture decisions.

The MLCommons / MLPerf Evidence

MLPerf Training benchmarks provide the most systematic public comparison. In the MLPerf Training v3.1 results (released November 2023), both Google (TPU v5p) and NVIDIA (H100) submitted results. On the GPT-3 175B training benchmark — the most relevant large-model task — Google submitted a result using 12,288 TPU v5p chips. NVIDIA submitted using configurations of H100s in NVIDIA DGX H100 systems. Both achieved competitive time-to-train figures; neither was decisively faster on a per-chip or per-dollar basis in the public comparison.

The nuance that MLPerf reveals: at the frontier model scale (hundreds of billions of parameters), the performance gap between optimally-configured TPU pods and optimally-configured H100 clusters is measured in percentages rather than multiples. The 15–30× advantage of TPU v1 over 2013-era GPUs existed because 2013-era GPUs were not designed for the task at all. Modern NVIDIA GPUs are explicitly designed for exactly the same workloads TPUs target.

Google's Actual Market Position

The TPU program has not displaced NVIDIA in the broader AI hardware market. NVIDIA's data center GPU revenue for fiscal year 2024 (ending January 2024) was approximately $47 billion. Google does not break out TPU revenue separately, but analysts estimate Google Cloud's AI infrastructure segment (including TPU-related revenue from Cloud TPU) at low single-digit billions annually. The mass market for AI compute remains NVIDIA's.

What Google has achieved is independence for its own AI workloads. Google does not pay NVIDIA for the compute that runs Google Search's AI features, Google Translate, Google Photos, Bard/Gemini, or the research that produced PaLM and Gemini. At Google's scale — tens of millions of TPU-hours per day of internal compute — the cost avoidance from not paying NVIDIA's margins on that compute is enormous, even if it cannot be precisely quantified from outside.

2006

Year NVIDIA launched CUDA — giving it a ~9-year ecosystem head start over the first public TPU disclosure

$47B

NVIDIA data center GPU revenue, fiscal year 2024 — the market Google chose not to compete in externally

TPU only

Google Cloud is the sole vendor offering TPU access — no AWS, Azure, or on-premise option exists

Lesson 3 Quiz

TPU vs. GPU: The Actual Competition

What is the primary structural advantage GPUs retain over TPUs for the broader AI development community?

Correct. CUDA's 18-year head start in developer adoption, combined with PyTorch's dominance as a CUDA-native framework, means most AI researchers can run their code on any NVIDIA GPU without modification — a portability advantage TPUs fundamentally cannot match.

Not exactly. The raw FLOPS comparison is context-dependent. The ecosystem gap is the more durable and consequential advantage: CUDA + PyTorch gives GPUs reach that TPUs simply do not have in the broader development community.

What specific hardware feature did NVIDIA introduce in the H100 that directly targeted the workload where TPUs had previously excelled?

Correct. The H100's Transformer Engine is NVIDIA's direct architectural response to the transformer workload. It detects attention operations at runtime and executes them in FP8 precision using dedicated hardware — targeting exactly the pattern where TPUs had a structural advantage.

Not correct. Optical switching is a TPU v4 feature. bfloat16 came in TPU v2 and was later adopted by NVIDIA. Systolic arrays are in TPUs. The H100's specific answer to transformer workloads was the Transformer Engine with FP8 precision support.

What does the MLPerf Training benchmark evidence suggest about the TPU vs. GPU performance gap at frontier model scale as of 2023?

Correct. The 15–30× advantage was real in 2017 because 2013-era GPUs weren't designed for inference workloads. By 2023, NVIDIA explicitly targets the same workloads TPUs target, and MLPerf results show both as competitive — with differences measured in percentages, not multiples.

Not supported by the evidence. Google continues to participate in MLPerf. The data from MLPerf v3.1 shows both systems competitive on GPT-3 training with neither decisively dominant — the era of TPU's double-digit-multiple performance edge over equivalent NVIDIA hardware has passed.

What is the most accurate characterization of what the TPU program actually achieved for Google, given NVIDIA's continued market dominance?

Correct. The TPU's strategic success is not measured in market share — Google never tried to sell TPUs as chips. It's measured in operational independence: Google's own AI products run on hardware Google controls, at a cost structure Google sets, without paying NVIDIA margins on tens of millions of TPU-hours per day.

Not quite. Google never pursued external chip sales and NVIDIA's market dominance is clear. The TPU's value is internal: it made Google self-sufficient for its own AI compute — Search, Translate, Photos, Gemini — at a cost advantage that compounds over years of operation.

Lab 3: The Competitive Landscape

Analyze the real boundaries between TPU and GPU dominance with your AI tutor

Your Mission

The TPU vs. GPU debate is often framed as a simple performance race. The reality is more nuanced: both chips are competitive at frontier scale, the real differences lie in ecosystem, portability, and use-case fit. In this lab, dig into what the competition actually looks like — and what it means for an organization choosing AI infrastructure.

Starter questions: If you were advising a startup building a large language model, would you recommend TPUs or GPUs — and why? What would it take for the TPU ecosystem to close the PyTorch/CUDA gap? How did NVIDIA's H100 specifically address the gaps that made TPUs competitive?

AI Tutor

TPU Strategy · L3

Welcome to Lab 3. We're looking at the real competitive dynamics between Google's TPUs and NVIDIA's GPUs — not the marketing version, but the actual decision factors. Where would you like to start: the ecosystem gap, the MLPerf evidence, NVIDIA's H100 architectural response, or what an organization outside Google should actually choose for AI infrastructure?

Lesson 4 · Google's TPU Strategy

The Cloud Business, Axion, and What Google Is Actually Building

The TPU is not Google's only chip bet. Understanding the full picture of Google's silicon strategy — from Cloud TPU pricing to the Axion CPU and the Trillium generation.

If the TPU is primarily for internal use, why does Google rent TPU time to external customers — and what does Google's expansion into custom CPUs reveal about where the hardware race is actually heading?

At Google Cloud Next in April 2024, Thomas Kurian, Google Cloud's CEO, announced two products on the same stage: TPU v5p — the highest-performance TPU to date — and Axion, Google's first custom ARM-based CPU for data center workloads. The pairing was not accidental. Google was telling a story about full-stack compute ownership: training chips, inference chips, and now the general-purpose processors that run everything else in the data center.

Cloud TPU: The External Revenue Layer

Google began offering Cloud TPU access to external customers in 2017, starting with TPU v2 at $6.50 per chip-hour. By 2024, the Cloud TPU catalog included v5e (optimized for cost-efficient inference and fine-tuning) and v5p (optimized for large-scale pre-training). The pricing structure reflects the market positioning:

TPU v5e is priced to compete with mid-tier GPU instances for inference workloads — particularly for organizations already invested in Google Cloud and using Google's Vertex AI platform. A single TPU v5e chip was priced at approximately $1.20 per chip-hour on demand in 2024, below the comparable cost of an H100 GPU instance on Google Cloud ($4.13/hr for a single H100 equivalent on GKE).

TPU v5p targets research organizations and large enterprises that need pre-training scale. Pricing is available primarily through committed-use contracts rather than on-demand, reflecting the pod-scale infrastructure commitment required.

The external Cloud TPU business serves multiple strategic purposes beyond direct revenue: it attracts AI research organizations to Google Cloud (creating switching costs and data gravity), it generates public benchmark data that validates TPU performance, and it gives Google visibility into what workloads external customers actually run — informing the roadmap for future TPU generations.

TPU v5 Family: Specialization Within a Generation

The TPU v5 generation marked a shift in how Google structures its chip lineup. Rather than a single chip covering all workloads, Google released two variants with different design points:

TPU v5e (announced August 2023): Optimized for efficiency. Lower per-chip FLOPS than v5p but significantly better FLOPS-per-dollar. Designed for serving, fine-tuning, and distillation workloads where cost matters more than raw throughput. Google claimed 2× better performance-per-dollar versus TPU v4 on inference workloads in its product announcement.

TPU v5p (announced December 2023): Optimized for performance. Highest per-chip FLOPS in any Google TPU, highest HBM capacity and bandwidth per chip. Designed for pre-training frontier models. Google claimed 4.7× better FLOPs/chip versus TPU v4 (at bfloat16). Available in pods of up to 8,960 chips.

This bifurcation mirrors a broader industry trend: the workloads of pre-training, fine-tuning, and inference have different enough compute and memory profiles that a single chip design is increasingly a compromise. NVIDIA made the same observation with the H100 (training) vs. L40S (inference) vs. L4 (cost-efficient inference) product segmentation.

Trillium: The Next Generation

In May 2024, Google announced "Trillium" — the sixth-generation TPU — at Google I/O. Trillium is also referred to as TPU v6. Google disclosed that Trillium delivers 4.7× the peak compute performance of TPU v5e per chip, with 2× the HBM capacity and 2× the inter-chip interconnect bandwidth. The chip uses an updated systolic array design and an upgraded matrix multiply unit (MXU) that supports a wider range of data types.

Trillium was described as available in Google's own infrastructure for training Gemini models, with Cloud TPU availability to follow. The timeline continues Google's approximately annual cadence of major TPU architectural announcements since v2 in 2017.

Axion: Google's Custom CPU

The April 2024 announcement of Axion introduced a new dimension to Google's silicon strategy. Axion is a custom ARM Neoverse V2-based CPU designed by Google for its data centers. It is not an AI accelerator — it runs operating systems, databases, web servers, and the "host" software that manages TPU workloads.

Google claimed Axion delivers 30% better performance than the best available general-purpose ARM-based instances in the cloud and 50% better performance versus comparable x86-based instances at equivalent workloads. Axion entered public preview on Google Kubernetes Engine (GKE) and Cloud Run in Q2 2024.

The strategic logic: a data center training job on TPU v5p still runs Linux, Python, orchestration software, data pipelines, and monitoring on conventional CPUs. If those CPUs are also custom-designed by Google for efficiency, the total cost and performance of the combined system improves. Google is not the first to follow this logic — AWS Graviton (ARM-based CPU) has been a significant cost advantage for Amazon Web Services since 2018 — but Axion signals Google's commitment to the same full-stack approach.

The Vertex AI Integration

Google's TPU strategy does not exist in isolation from its software stack. Vertex AI — Google Cloud's managed ML platform — is tightly integrated with both JAX and TensorFlow, the frameworks that compile to TPU-native code via XLA. A customer using Vertex AI for model training is steered toward TPU instances by the platform's default configurations. This vertical integration — chip, compiler, framework, managed platform — is Google's answer to NVIDIA's CUDA + cuDNN + TensorRT vertical stack. Neither is vendor-neutral; both are designed to maximize friction for customers who consider switching.

What Google's Silicon Strategy Actually Means

Taken together — TPU generations for AI acceleration, Axion for general compute, the Vertex AI software stack, and the tight integration with Google's global fiber network — Google's hardware program is an attempt to own the entire compute stack for AI workloads within its own infrastructure.

This is not a strategy aimed at selling chips. Google has no current plans to license TPU designs or sell TPU hardware to third parties (unlike Intel, which sells Gaudi AI accelerators, or Amazon, which sells Trainium/Inferentia instances across AWS). The TPU program is structurally defensive: it insulates Google's core AI products from NVIDIA's pricing power, gives Google research teams hardware they can modify for specific model architectures, and creates infrastructure differentiation that is difficult for competitors to replicate without equivalent investments in silicon design capability.

The risk in this strategy is also visible: Google's AI products are more dependent on JAX and XLA than any other company's products are on a proprietary framework. If the AI field moves toward model architectures that the systolic array design handles poorly, or if PyTorch's dominance further constrains the talent pool that can use Google's tools, the internal optimization advantage could become a limiting factor. These are not hypothetical risks — they are active concerns that appear in Google's own research publications on making TPUs more accessible to external developers.

2017

Cloud TPU launch — TPU v2 available externally at $6.50/chip-hr. First public competition with NVIDIA on cloud AI compute.

2021

TPU v4 pods deployed — 4,096 chips, ~1 exaFLOP, optical circuit switching. Used for PaLM 540B training.

Aug 2023

TPU v5e launched — cost-optimized inference chip. 2× better perf/dollar vs TPU v4 on inference workloads.

Dec 2023

TPU v5p launched — 4.7× FLOPs/chip vs v4. Up to 8,960-chip pods. Frontier pre-training target.

Apr 2024

Axion CPU announced — ARM Neoverse V2-based, 30% better than best ARM cloud instances. Full-stack compute ownership.

May 2024

Trillium (TPU v6) announced — 4.7× peak compute vs v5e per chip. Used internally for Gemini training.

Lesson 4 Quiz

The Cloud Business, Axion, and What Google Is Actually Building

What is Axion, announced by Google in April 2024, and how does it fit into Google's silicon strategy?

Correct. Axion is a general-purpose CPU — not an AI accelerator. It runs the operating systems, databases, and orchestration software that surround TPU workloads. Its goal is to extend Google's cost and performance advantages beyond the AI accelerator layer to the entire compute stack.

Not correct. Axion is a CPU, not an AI accelerator. It targets general data center workloads — the "host" compute alongside TPU jobs — completing Google's vision of owning the full compute stack rather than just the AI acceleration layer.

What is the key design difference between TPU v5e and TPU v5p, and what does this split reveal about the AI hardware market?

Correct. The v5e/v5p bifurcation reflects a maturation of the AI hardware market: different phases of the ML lifecycle (pre-training, fine-tuning, inference) have different compute and memory profiles, making a single all-purpose chip increasingly a design compromise.

Not quite. Both v5e and v5p support training and inference at floating-point precision, and both are available commercially. The difference is optimization target: v5e maximizes performance-per-dollar for inference and fine-tuning; v5p maximizes raw training throughput for frontier model pre-training.

Why does Google offer Cloud TPU access to external customers, given that the TPU program was originally designed to solve Google's own internal compute problems?

Correct. The external Cloud TPU business is strategically multi-purpose: it drives Google Cloud adoption (platform lock-in via JAX and Vertex AI), produces public performance data, and gives Google insight into diverse external workloads. Direct chip revenue is secondary to these cloud strategy goals.

Not correct. Google has no stated intent to become a primary external chip supplier and has not announced plans to sell or license TPU hardware independently. The Cloud TPU business primarily serves Google's cloud platform strategy — using TPU access as a differentiating draw for Google Cloud.

What is "Trillium" and what was announced about it at Google I/O in May 2024?

Correct. Trillium (TPU v6) was announced at Google I/O May 2024, offering 4.7× the compute performance of TPU v5e per chip along with doubled HBM and inter-chip bandwidth. Google disclosed it was already in use internally for Gemini model training.

Not correct. Trillium is the sixth-generation TPU chip. It was announced at Google I/O in May 2024, offering a substantial per-chip performance jump over TPU v5e and described as already deployed for Google's Gemini model training workloads.

Lab 4: The Full-Stack Silicon Strategy

Examine what Google's chip roadmap — TPU, Axion, Trillium — reveals about the future of AI infrastructure with your AI tutor

Your Mission

Google now designs custom chips at every layer of the stack: AI accelerators (TPU), general CPUs (Axion), and network infrastructure. This lab explores what this vertical integration strategy means — for Google's competitive position, for its cloud customers, and for the broader AI hardware industry.

Starter questions: What are the risks of Google's deep bet on JAX and XLA as the software foundation for TPUs? How does the v5e/v5p split compare to NVIDIA's H100/L40S split? If you were a hyperscaler competitor — AWS, Microsoft, or Meta — how would you respond to Google's full-stack silicon strategy?

AI Tutor

TPU Strategy · L4

Welcome to Lab 4 — the final lab for this module. We're looking at the big picture: Google now designs the accelerator, the CPU, the interconnect, and the software compiler stack in-house. That's a remarkable degree of vertical integration. What aspect would you like to explore? The strategic risks, the competitive implications for AWS and Microsoft, how Axion changes the economics, or where Trillium fits in the roadmap?

Module Test

Google's TPU Strategy · 15 questions · Pass at 80%

1. In what year did Google begin running production workloads on TPU v1 hardware inside its data centers?

Correct. Google deployed TPU v1 in its data centers in 2015, running production traffic for Search, Street View, and Photos — over a year before any public announcement.

Not correct. TPU v1 was deployed in Google data centers in 2015, before the June 2016 public mention by Sundar Pichai at Google I/O.

2. What is the fundamental computational structure at the heart of TPU v1 that enables its inference efficiency?

Correct. TPU v1's defining structure is its 256×256 systolic array — 65,536 multiply-accumulate units that process matrix multiplications with minimal memory accesses by passing data in a wave-like rhythm through the array.

Not correct. The systolic array — specifically a 256×256 grid of 8-bit integer multiply-accumulate units — is the core structural feature of TPU v1 that enables its efficiency advantage for inference workloads.

3. The 2017 Google paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" reported what performance advantage for TPU v1 on production inference workloads?

Correct. The Jouppi et al. 2017 paper reported 15–30× better performance-per-watt on Google's six production inference workloads versus contemporary Haswell CPUs and NVIDIA K80 GPUs.

Not correct. The reported figure was 15–30× — a result that surprised the academic hardware community and helped drive the broader industry's interest in purpose-built AI inference accelerators.

4. bfloat16 was developed by Google and differs from IEEE float16 primarily because:

Correct. The exponent difference is the key: float32 has an 8-bit exponent; IEEE float16 has only 5 bits, which means its dynamic range is insufficient for the gradient magnitudes in neural network training. bfloat16 keeps the 8-bit exponent, sacrificing mantissa bits instead.

Not correct. Both are 16-bit floating-point formats. The critical difference is the exponent: bfloat16 keeps float32's 8-bit exponent (preserving dynamic range for gradients), while IEEE float16's 5-bit exponent causes overflow and underflow during training.

5. What type of memory did TPU v2 introduce that significantly increased memory bandwidth versus TPU v1's design?

Correct. HBM (High Bandwidth Memory) was introduced in TPU v2, providing 600 GB/s of memory bandwidth versus TPU v1's 256 GB/s. HBM's stacked architecture places DRAM very close to the compute die, dramatically shortening the data path.

Not correct. TPU v2 introduced HBM (High Bandwidth Memory) — a stacked DRAM format that places memory dies directly on or adjacent to the compute die, reaching 600 GB/s and more than doubling v1's 256 GB/s PCIe-attached DRAM bandwidth.

6. TPU v3 required what significant infrastructure change compared to v2, as a direct consequence of its performance increase?

Correct. TPU v3 was the first TPU generation to require liquid cooling. The thermal envelope from higher clock frequencies and larger die area made air cooling insufficient — a constraint that limits where and how TPU v3 pods can be deployed.

Not correct. The infrastructure change for v3 was liquid cooling. The doubled FLOPS/chip required higher clock frequencies that generated too much heat for air cooling, making v3 the first liquid-cooled TPU — and constraining its deployment flexibility compared to v2.

7. What was the key innovation in TPU v4's pod interconnect that distinguished it from all previous TPU generations?

Correct. TPU v4's optical circuit switches (OCS) allow the interconnect topology to be reconfigured in software. Traffic can be dynamically rerouted around failed chips, improving pod-level utilization — a capability impossible with fixed copper topologies.

Not correct. Optical circuit switching is the v4 interconnect innovation. The OCS allows the topology connecting 4,096 chips to be reconfigured in software, enabling traffic rerouting around chip failures — something fixed copper interconnects cannot do.

8. According to Google's 2023 Nature paper on TPU v4, which large language model was trained using two TPU v4 pods run in parallel?

Correct. Google's 2023 Nature paper disclosed that PaLM — its 540-billion-parameter model — was trained using two TPU v4 pods running in parallel, connected via Google's Jupiter data center networking fabric.

Not correct. The 2023 Nature paper on TPU v4 identified PaLM (540B parameters) as the flagship training workload. BERT was trained on v3; Gemini training details were not disclosed in the v4 paper.

9. What is the primary reason most AI researchers outside Google continue to use NVIDIA GPUs rather than TPUs, despite the TPU's documented performance-per-watt advantages on certain workloads?

Correct. The CUDA + PyTorch ecosystem is the dominant reason GPU adoption persists. Code written in PyTorch runs on any NVIDIA GPU without modification. Running on TPUs requires JAX or PyTorch/XLA — a compatibility layer that adds complexity and is less seamless than native CUDA.

Not correct. The barrier is the software ecosystem, not price, training capability, or access restrictions. PyTorch's dominance as a research framework — combined with its native CUDA integration — creates an adoption moat for NVIDIA that performance benchmarks alone cannot overcome.

10. What specific hardware feature did NVIDIA introduce in the H100 to directly target transformer workloads, competing with TPU strengths?

Correct. The H100's Transformer Engine is NVIDIA's explicit architectural response to the transformer workload. It operates in FP8 and targets the attention mechanism that has become the dominant compute pattern in frontier AI models — directly challenging the workload where TPUs historically excelled.

Not correct. The H100's direct answer to transformer workloads is the Transformer Engine — dedicated logic for FP8 attention computation. NVIDIA added bfloat16 support in the Ampere A100, not the H100. Optical switching and systolic arrays are not GPU features.

11. What is Cloud TPU v5e specifically optimized for, in contrast to v5p?

Correct. The v5e/v5p split reflects the diverging requirements of inference/fine-tuning (where cost efficiency per operation matters most) versus frontier pre-training (where raw throughput per chip determines time-to-train). Google claimed 2× better perf/dollar for v5e versus v4 on inference.

Not correct. The distinction is inference/fine-tuning efficiency (v5e) versus maximum pre-training FLOPS (v5p). Both support the same software frameworks; neither is available for on-premise purchase. The split mirrors NVIDIA's H100/L40S differentiation for similar workload reasons.

12. Axion, announced at Google Cloud Next April 2024, is based on which processor architecture?

Correct. Axion is based on ARM's Neoverse V2 server core. Google claimed 30% better performance than the best available ARM cloud instances and 50% better than comparable x86 instances on the workloads it targets.

Not correct. Axion uses ARM's Neoverse V2 architecture — the same platform Amazon used for Graviton3 and Ampere used for Altra. Google customized it for its specific data center requirements, claiming substantial performance advantages over both ARM and x86 cloud alternatives.

13. What was the name and generation number of the TPU announced at Google I/O in May 2024, and what was its claimed performance improvement?

Correct. Trillium, also referred to as TPU v6, was announced at Google I/O May 2024 with 4.7× the peak compute of TPU v5e per chip, 2× the HBM capacity, and 2× the inter-chip interconnect bandwidth. Google disclosed it was already deployed internally for Gemini training.

Not correct. The chip announced at Google I/O May 2024 was Trillium (TPU v6), with 4.7× the peak compute of TPU v5e per chip. There is no "Ironwood," "Quantum," or "Pathways Chip" in Google's disclosed roadmap as of that announcement.

14. What does MLPerf Training v3.1 benchmark data (November 2023) reveal about the performance gap between TPU v5 and NVIDIA H100 at frontier model training scale?

Correct. MLPerf v3.1 shows both TPU v5p and H100 submitting competitive results on the GPT-3 training benchmark, with neither decisively faster. The era of TPU having a double-digit-multiple advantage over NVIDIA hardware (as in 2017) has ended as NVIDIA has explicitly targeted the same workloads.

Not correct. Google participated in MLPerf v3.1 with TPU v5p results. The evidence shows both systems competitive on frontier training tasks, with differences measured in percentages — not the 15–30× gap of 2017, which existed because 2013-era GPUs were not designed for inference at all.

15. Which of the following best describes the overall strategic objective of Google's TPU program — from v1 in 2015 through Trillium in 2024?

Correct. The TPU program is fundamentally a strategy of independence and internal cost control, not external market capture. Google does not sell TPU chips; it uses them to run its own products — Search, Translate, Photos, Gemini — without paying NVIDIA margins on what is now tens of millions of TPU-hours per day of internal compute.

Not quite. Google has no announced plans to sell TPU chips or license its designs, and it has not displaced NVIDIA in the broader market. The strategy is internal: use custom silicon to control the economics and architecture of Google's own AI infrastructure — a goal achieved by independence from external chip vendors, not by competing with them.