In 2013, a team inside Google ran a quiet calculation. If the company's users started using voice search with speech recognition powered by deep neural networks for just three minutes per day, Google would need to double its entire global data center capacity to keep up — just to handle the inference workload. The hardware they had, built for search and ads, was the wrong tool entirely.
The team brought the finding to Jeff Dean, Google's most senior engineer, and the direction that emerged was unusual for a software company: build the chip yourself.
In 2013, GPUs were already well-established for training neural networks. NVIDIA's Kepler-generation cards were the workhorses of academic deep learning labs. But Google's problem was not training — it was inference: running already-trained models millions of times per second, at low latency, across data centers that consumed electricity at the scale of small cities.
GPUs are optimized for throughput across a wide range of workloads. They carry significant die area devoted to graphics rendering logic, texture units, and rasterization pipelines that are completely irrelevant to matrix multiplication for neural nets. For inference at Google's scale, this meant paying for energy and silicon that delivered no useful compute.
A custom chip purpose-built for the narrow mathematical operations that neural networks actually use — primarily large matrix multiplications and additions — could in principle be far more efficient per operation. The trade-off was specialization: a chip that could do one thing extremely well but little else.
Google's internal projection estimated that three minutes of daily neural-network-powered voice search per user would require doubling data center capacity globally. This single calculation justified the entire TPU program's existence.
Rather than waiting for an external vendor to build what it needed, Google made the call to design in-house. The project was code-named TPU — Tensor Processing Unit — referencing TensorFlow's core data structure. The first-generation chip was designed, fabricated, and deployed in Google data centers in roughly 22 months, an extraordinary timeline for custom silicon, achieved partly by using an existing memory and interface design rather than building everything from scratch.
The TPU v1 was not a training chip at all. It was a pure inference accelerator: an 8-bit integer matrix multiply unit with 65,536 multiply-accumulate units arranged in a systolic array. It had 28MB of on-chip memory and a 256GB/s memory bandwidth to DRAM. Google deployed it quietly starting in 2015. Users interacting with Google Search, Street View, and Google Photos were already running on TPU hardware before the chip was publicly disclosed.
The first public disclosure came in May 2016 when Google CEO Sundar Pichai mentioned TPUs briefly at Google I/O. The detailed technical paper — "In-Datacenter Performance Analysis of a Tensor Processing Unit" — appeared in April 2017, authored by Norman Jouppi and colleagues, revealing performance figures that surprised the hardware community.
The 2017 paper compared TPU v1 against contemporary server-class CPUs (Haswell Xeon) and GPUs (NVIDIA K80) on Google's six most inference-heavy production workloads. The TPU v1 delivered 15–30× higher performance per watt compared to those contemporaries on these workloads. On raw throughput measured in TOPS (tera operations per second), TPU v1 achieved 92 TOPS at 40W TDP — roughly 2.3 TOPS per watt versus the K80's roughly 0.3 TOPS per watt at inference precision.
These figures came with important caveats: the comparison was on Google's specific production workloads, at 8-bit integer precision, where the TPU's systolic array had a structural advantage. For training, or for workloads requiring floating-point flexibility, the comparison would have looked very different. But for the specific problem Google needed to solve — inference at massive scale — the numbers were decisive.
The decision to build TPU v1 was not primarily about competitive advantage over other AI companies — in 2013 there were few. It was about Google's own operational economics. At Google's scale, even a 10× improvement in compute efficiency translates to billions of dollars in avoided capital expenditure on servers, cooling, and power infrastructure over a multi-year horizon.
But the secondary effect proved more significant: once Google had the internal capability to design AI chips, it could iterate. TPU v2, announced in 2017, added training capability and floating-point support. TPU v3, announced in 2018, doubled the performance of v2. TPU v4, deployed at scale in 2021, was housed in 4,096-chip "pods" capable of an aggregate 1 exaFLOP. The 2013 inference problem, and the organizational response to it, seeded a decade-long hardware program that became a foundational competitive asset.
Google's 2013 decision to build its own inference chip rather than buy more GPUs was a classic build-vs-buy inflection point. In this lab, you'll interrogate that decision with your AI tutor — examining what factors made custom silicon viable for Google but not for most organizations.
When Sundar Pichai stepped onto the stage at Google I/O in May 2017 and showed a slide labeled "TPU v2," the AI hardware community understood immediately: Google had not built a one-off chip. It had built a program. The v2 announcement came the same day as the detailed v1 technical paper, a coordinated disclosure designed to show an arc — inference chip, then training chip, and next, by implication, whatever came after that.
TPU v1's design choices reflected its narrow mandate: inference only, at high throughput, at low power. Its 256×256 systolic array of 8-bit integer multiply-accumulate units could perform 92 trillion 8-bit operations per second. It connected to the host server via PCIe (like a traditional add-in card) rather than requiring a redesigned server architecture. Memory bandwidth was 256 GB/s to 28MB of on-chip SRAM.
The constraint was inflexibility: 8-bit integer precision is sufficient for inference on most trained models but completely inadequate for training, which requires the gradient calculations that need floating-point precision. TPU v1 could not backpropagate. It could only run forward passes.
TPU v2 was a fundamental redesign. Its key changes:
Floating-point support. The v2 added bfloat16 (brain float 16) — a 16-bit floating-point format Google invented specifically for neural network training. Unlike IEEE float16, bfloat16 preserves the full 8-bit exponent range of float32, avoiding overflow/underflow problems during gradient descent while halving memory bandwidth requirements vs. float32.
High-bandwidth memory. The v2 introduced HBM (High Bandwidth Memory) — stacked DRAM integrated close to the compute die — giving 600 GB/s of memory bandwidth, more than double the v1's 256 GB/s and far above what PCIe-attached DRAM could provide.
Custom interconnect. The v2 was no longer a PCIe card. It was a board-level component designed to be interconnected with other TPU v2 boards via a custom 2D torus network. Four TPU v2 chips formed a "TPU v2 board," and up to 256 boards (1,024 chips) could be networked into a "TPU v2 Pod" with a peak performance of 11.5 petaFLOPS.
Google made TPU v2 available to external researchers via Google Cloud in 2017, marking the first time TPU compute was accessible outside of Google's internal workloads. The price was $6.50 per TPU-hour at launch.
TPU v3, announced at Google I/O 2018, doubled the FLOPS-per-chip of v2 and doubled the HBM bandwidth per chip to 900 GB/s. The performance increase came partly from a larger chip and partly from higher clock frequencies — which required a significant engineering departure: liquid cooling. TPU v3 was the first TPU generation to require liquid cooling rather than air cooling, constraining where and how pods could be deployed but enabling performance that air cooling could not sustain.
TPU v3 pods also scaled to 1,024 chips, matching v2's maximum pod configuration but at doubled per-chip performance, yielding pods capable of ~100 petaFLOPS. The v3 was used heavily for Google Brain's research work and for training production models including BERT (Bidirectional Encoder Representations from Transformers), published by Jacob Devlin and colleagues at Google in October 2018.
TPU v4 represented the most significant architectural advance since v2. Google disclosed details in a June 2023 paper in Nature. Key characteristics:
Optical circuit switching. TPU v4 pods used optical circuit switches (OCS) instead of fixed copper interconnects to connect chips. This allowed the network topology to be reconfigured in software — any chip could be connected to any other chip within the pod without physical rewiring. The flexibility enabled Google to route traffic around failed chips, dramatically improving pod-level utilization compared to fixed topologies where a single failed chip can strand an entire segment.
Scale. TPU v4 pods contained 4,096 chips and achieved approximately 1 exaFLOP of aggregate performance (at bfloat16 precision). Google noted in the 2023 paper that these were the largest deployed AI computing systems at the time of their use.
Production training workloads. The 2023 Nature paper disclosed that TPU v4 was used to train the PaLM (Pathways Language Model) 540-billion-parameter model, as well as its successors. PaLM training used two TPU v4 pods run in parallel across Google data centers, connected by Google's Jupiter data center networking fabric.
bfloat16 (Brain Float 16) is a 16-bit floating-point format developed by Google Brain. It uses the same 8-bit exponent as float32 but only 7 mantissa bits (versus 23). This preserves dynamic range — critical during training — while cutting memory and bandwidth requirements in half. bfloat16 has since been adopted by Intel, ARM, NVIDIA (in Ampere and later architectures), and AMD. A format Google invented for internal use became an industry standard.
The progression from TPU v1 to v4 follows a clear logic: each generation expanded the scope of what Google could do in-house without buying external compute. v1 eliminated the inference bottleneck. v2 allowed Google to train at scale without NVIDIA hardware. v3 accelerated the research cycle. v4 enabled models — like PaLM — that would have been economically impractical on any commercially available hardware at the time.
The optical circuit switching in v4 is particularly revealing. It represents a willingness to invest in infrastructure that is genuinely novel — not just a faster version of existing designs — because Google's workload scale justifies the engineering cost. Few organizations could amortize the development cost of custom optical networking across enough TPU deployments to make it worthwhile. Google could.
Each TPU generation made specific architectural trade-offs — precision vs. flexibility, per-chip performance vs. power, fixed vs. reconfigurable interconnect. In this lab, explore why those specific choices were made and what they reveal about how Google's priorities evolved from 2015 to 2023.
When the first MLPerf Training benchmarks were published in 2019, the AI hardware community finally had a standardized, apples-to-apples comparison framework. Google submitted results with TPU v3. NVIDIA submitted with A100. The headline numbers were close enough to make the question complicated: neither chip dominated the other across all tasks. Performance depended on batch size, model architecture, precision, and the specific benchmark configuration.
The real competitive dynamic, it turned out, was not about raw benchmark scores — it was about ecosystem lock-in, pricing power, and the question of who controls the software stack.
TPUs are optimized for a specific computational pattern: large matrix multiplications at high throughput with high memory bandwidth. This pattern dominates transformer-based models — the architecture underlying GPT, BERT, PaLM, and virtually every large language model since 2017. For workloads that fit this pattern and can be expressed in JAX or TensorFlow XLA (the compiler that generates TPU-native code), TPUs can deliver superior performance per dollar on Google Cloud versus equivalent NVIDIA configurations.
The memory bandwidth advantage is significant. TPU v4 has 1.2 TB/s of HBM bandwidth per chip. The NVIDIA A100 (the dominant training GPU during the period TPU v4 was deployed) has 2 TB/s in its 80GB HBM2e variant — numerically higher, but A100 clusters also carry more overhead per chip for non-matrix operations. For pure transformer forward-backward passes, both chips are substantially memory-bandwidth-bound rather than compute-bound, making bandwidth the relevant figure.
At the pod level, however, the comparison changes. A TPU v4 pod (4,096 chips) with its optical circuit switching and custom interconnect achieves collective communication efficiency (the speed at which gradient updates can be averaged across all chips) that exceeds what NVIDIA's NVLink and InfiniBand configurations achieved at equivalent scales during the same period, according to Google's 2023 Nature paper.
The GPU's fundamental advantage is flexibility and ecosystem breadth. CUDA, NVIDIA's programming platform, has been developed since 2006 and has an installed base of developers, libraries, and frameworks that simply does not exist for TPUs. PyTorch — which became the dominant deep learning research framework by approximately 2020 — runs natively on CUDA and required significant adaptation to run on TPUs via PyTorch/XLA (an integration that remains less seamless than native CUDA as of this writing).
This ecosystem gap has real consequences. A researcher or startup that writes PyTorch code can run it on any NVIDIA GPU without modification. Running the same code on a TPU requires either porting to JAX/TensorFlow or using PyTorch/XLA with its additional complexity. For the vast majority of the AI development community outside Google, this makes NVIDIA the default choice regardless of raw performance comparisons.
Additionally, GPUs are general-purpose enough to handle workloads that don't fit the TPU's matrix-multiplication-heavy sweet spot: irregular compute patterns, sparse models, reinforcement learning with non-differentiable operations, computer graphics and simulation tasks that may be co-located with AI workloads. The TPU has no competitive answer to these use cases.
Finally, GPUs are available everywhere. A researcher at any university can rent NVIDIA GPU time on AWS, Azure, or Google Cloud. TPUs are exclusively available on Google Cloud — there is no other vendor that sells TPU access. This captivity is by design from Google's perspective (it drives cloud adoption) but it is a genuine limitation for portability.
NVIDIA did not ignore the TPU program. The H100 (announced March 2022, shipping mid-2023) introduced the Transformer Engine — dedicated hardware logic for detecting and accelerating transformer attention operations in FP8 precision — directly responding to the workload pattern where TPUs had excelled. The NVLink Switch System, introduced with H100, allows up to 256 H100 GPUs to be connected in an all-reduce topology with 900 GB/s bidirectional bandwidth per GPU — competing with TPU pod-level collective communication efficiency. NVIDIA has explicitly acknowledged transformer workloads as the design target for H100 architecture decisions.
MLPerf Training benchmarks provide the most systematic public comparison. In the MLPerf Training v3.1 results (released November 2023), both Google (TPU v5p) and NVIDIA (H100) submitted results. On the GPT-3 175B training benchmark — the most relevant large-model task — Google submitted a result using 12,288 TPU v5p chips. NVIDIA submitted using configurations of H100s in NVIDIA DGX H100 systems. Both achieved competitive time-to-train figures; neither was decisively faster on a per-chip or per-dollar basis in the public comparison.
The nuance that MLPerf reveals: at the frontier model scale (hundreds of billions of parameters), the performance gap between optimally-configured TPU pods and optimally-configured H100 clusters is measured in percentages rather than multiples. The 15–30× advantage of TPU v1 over 2013-era GPUs existed because 2013-era GPUs were not designed for the task at all. Modern NVIDIA GPUs are explicitly designed for exactly the same workloads TPUs target.
The TPU program has not displaced NVIDIA in the broader AI hardware market. NVIDIA's data center GPU revenue for fiscal year 2024 (ending January 2024) was approximately $47 billion. Google does not break out TPU revenue separately, but analysts estimate Google Cloud's AI infrastructure segment (including TPU-related revenue from Cloud TPU) at low single-digit billions annually. The mass market for AI compute remains NVIDIA's.
What Google has achieved is independence for its own AI workloads. Google does not pay NVIDIA for the compute that runs Google Search's AI features, Google Translate, Google Photos, Bard/Gemini, or the research that produced PaLM and Gemini. At Google's scale — tens of millions of TPU-hours per day of internal compute — the cost avoidance from not paying NVIDIA's margins on that compute is enormous, even if it cannot be precisely quantified from outside.
The TPU vs. GPU debate is often framed as a simple performance race. The reality is more nuanced: both chips are competitive at frontier scale, the real differences lie in ecosystem, portability, and use-case fit. In this lab, dig into what the competition actually looks like — and what it means for an organization choosing AI infrastructure.
At Google Cloud Next in April 2024, Thomas Kurian, Google Cloud's CEO, announced two products on the same stage: TPU v5p — the highest-performance TPU to date — and Axion, Google's first custom ARM-based CPU for data center workloads. The pairing was not accidental. Google was telling a story about full-stack compute ownership: training chips, inference chips, and now the general-purpose processors that run everything else in the data center.
Google began offering Cloud TPU access to external customers in 2017, starting with TPU v2 at $6.50 per chip-hour. By 2024, the Cloud TPU catalog included v5e (optimized for cost-efficient inference and fine-tuning) and v5p (optimized for large-scale pre-training). The pricing structure reflects the market positioning:
TPU v5e is priced to compete with mid-tier GPU instances for inference workloads — particularly for organizations already invested in Google Cloud and using Google's Vertex AI platform. A single TPU v5e chip was priced at approximately $1.20 per chip-hour on demand in 2024, below the comparable cost of an H100 GPU instance on Google Cloud ($4.13/hr for a single H100 equivalent on GKE).
TPU v5p targets research organizations and large enterprises that need pre-training scale. Pricing is available primarily through committed-use contracts rather than on-demand, reflecting the pod-scale infrastructure commitment required.
The external Cloud TPU business serves multiple strategic purposes beyond direct revenue: it attracts AI research organizations to Google Cloud (creating switching costs and data gravity), it generates public benchmark data that validates TPU performance, and it gives Google visibility into what workloads external customers actually run — informing the roadmap for future TPU generations.
The TPU v5 generation marked a shift in how Google structures its chip lineup. Rather than a single chip covering all workloads, Google released two variants with different design points:
TPU v5e (announced August 2023): Optimized for efficiency. Lower per-chip FLOPS than v5p but significantly better FLOPS-per-dollar. Designed for serving, fine-tuning, and distillation workloads where cost matters more than raw throughput. Google claimed 2× better performance-per-dollar versus TPU v4 on inference workloads in its product announcement.
TPU v5p (announced December 2023): Optimized for performance. Highest per-chip FLOPS in any Google TPU, highest HBM capacity and bandwidth per chip. Designed for pre-training frontier models. Google claimed 4.7× better FLOPs/chip versus TPU v4 (at bfloat16). Available in pods of up to 8,960 chips.
This bifurcation mirrors a broader industry trend: the workloads of pre-training, fine-tuning, and inference have different enough compute and memory profiles that a single chip design is increasingly a compromise. NVIDIA made the same observation with the H100 (training) vs. L40S (inference) vs. L4 (cost-efficient inference) product segmentation.
In May 2024, Google announced "Trillium" — the sixth-generation TPU — at Google I/O. Trillium is also referred to as TPU v6. Google disclosed that Trillium delivers 4.7× the peak compute performance of TPU v5e per chip, with 2× the HBM capacity and 2× the inter-chip interconnect bandwidth. The chip uses an updated systolic array design and an upgraded matrix multiply unit (MXU) that supports a wider range of data types.
Trillium was described as available in Google's own infrastructure for training Gemini models, with Cloud TPU availability to follow. The timeline continues Google's approximately annual cadence of major TPU architectural announcements since v2 in 2017.
The April 2024 announcement of Axion introduced a new dimension to Google's silicon strategy. Axion is a custom ARM Neoverse V2-based CPU designed by Google for its data centers. It is not an AI accelerator — it runs operating systems, databases, web servers, and the "host" software that manages TPU workloads.
Google claimed Axion delivers 30% better performance than the best available general-purpose ARM-based instances in the cloud and 50% better performance versus comparable x86-based instances at equivalent workloads. Axion entered public preview on Google Kubernetes Engine (GKE) and Cloud Run in Q2 2024.
The strategic logic: a data center training job on TPU v5p still runs Linux, Python, orchestration software, data pipelines, and monitoring on conventional CPUs. If those CPUs are also custom-designed by Google for efficiency, the total cost and performance of the combined system improves. Google is not the first to follow this logic — AWS Graviton (ARM-based CPU) has been a significant cost advantage for Amazon Web Services since 2018 — but Axion signals Google's commitment to the same full-stack approach.
Google's TPU strategy does not exist in isolation from its software stack. Vertex AI — Google Cloud's managed ML platform — is tightly integrated with both JAX and TensorFlow, the frameworks that compile to TPU-native code via XLA. A customer using Vertex AI for model training is steered toward TPU instances by the platform's default configurations. This vertical integration — chip, compiler, framework, managed platform — is Google's answer to NVIDIA's CUDA + cuDNN + TensorRT vertical stack. Neither is vendor-neutral; both are designed to maximize friction for customers who consider switching.
Taken together — TPU generations for AI acceleration, Axion for general compute, the Vertex AI software stack, and the tight integration with Google's global fiber network — Google's hardware program is an attempt to own the entire compute stack for AI workloads within its own infrastructure.
This is not a strategy aimed at selling chips. Google has no current plans to license TPU designs or sell TPU hardware to third parties (unlike Intel, which sells Gaudi AI accelerators, or Amazon, which sells Trainium/Inferentia instances across AWS). The TPU program is structurally defensive: it insulates Google's core AI products from NVIDIA's pricing power, gives Google research teams hardware they can modify for specific model architectures, and creates infrastructure differentiation that is difficult for competitors to replicate without equivalent investments in silicon design capability.
The risk in this strategy is also visible: Google's AI products are more dependent on JAX and XLA than any other company's products are on a proprietary framework. If the AI field moves toward model architectures that the systolic array design handles poorly, or if PyTorch's dominance further constrains the talent pool that can use Google's tools, the internal optimization advantage could become a limiting factor. These are not hypothetical risks — they are active concerns that appear in Google's own research publications on making TPUs more accessible to external developers.
Google now designs custom chips at every layer of the stack: AI accelerators (TPU), general CPUs (Axion), and network infrastructure. This lab explores what this vertical integration strategy means — for Google's competitive position, for its cloud customers, and for the broader AI hardware industry.