By 2013, Google's engineers had a problem. Their new voice recognition system — the one powering Google Now — required so much GPU compute that deploying it at full scale would have doubled the company's entire data center footprint. The math was brutal. Either the AI stayed narrow, or the infrastructure bill became absurd.
Jeff Dean's infrastructure team had already been sketching a different answer on whiteboards for months. Instead of buying more of someone else's chips, Google would design its own — purpose-built for the matrix multiplications that neural networks demanded, and nothing else.
In 2013, Google began a secret project to build what it would eventually call the Tensor Processing Unit — a custom ASIC (application-specific integrated circuit) designed exclusively for the inference workloads underpinning its AI products. The project was classified internally because Google did not want NVIDIA or Intel to know what was coming.
The first-generation TPU (TPU v1) was deployed in Google's data centers in 2015 — a full year before Google publicly acknowledged its existence at Google I/O in May 2016. During that undisclosed year, Google was running AlphaGo's training computations and Google Search ranking on hardware no competitor knew existed.
TPU v1 was inference-only: it could run neural networks but not train them. Its key innovation was the systolic array architecture — a grid of multiply-accumulate units that could pass data from cell to cell without repeatedly fetching from off-chip memory. This slashed latency and power consumption by roughly 10x versus the GPU alternative for inference workloads.
Google published "In-Datacenter Performance Analysis of a Tensor Processing Unit" at ISCA 2017. It reported that the TPU v1 delivered 15–30x higher performance-per-watt than contemporary CPUs and GPUs on neural network inference. This was the first time a hyperscaler had publicly disclosed a custom AI ASIC's internal benchmarks — and the numbers shocked the semiconductor industry.
Google did not stop at inference. The TPU v2 (2017) added training capability and was made available to external researchers through Google Cloud TPU in early access. TPU v2 pods — clusters of 64 TPU v2 chips linked by a custom high-speed interconnect — could train large ResNet models in under an hour.
TPU v3 (2018) introduced liquid cooling and doubled the chip count per pod to 1,024. Google reported internally that its TPU v3 pods outperformed the best GPU clusters it could otherwise procure for large-scale language model training.
TPU v4 (2021) was the generation that trained PaLM — Google's 540-billion-parameter language model. A TPU v4 pod contained 4,096 chips connected via Google's proprietary optical interconnect, achieving what Google called the largest tightly-coupled AI supercomputer in operation at that time.
In 2024, Google announced Trillium (TPU v6), claiming a 4.7x improvement in compute performance per chip over TPU v5e, with substantially improved energy efficiency. By 2024, Google had deployed more custom AI silicon by total transistor count than any other organization on Earth.
Custom silicon requires three things most companies lack: enough volume to amortize design costs, deep software co-design capability, and the patience to wait 18–24 months for a chip to go from tape-out to deployed silicon. Google had all three.
Google's annual hardware capital expenditure was already in the billions by 2013. The incremental cost of funding an ASIC design team — roughly $200–500 million over several years for a first chip — was manageable against a budget that large. More critically, Google controlled both the workloads (TensorFlow, later JAX) and the infrastructure, enabling the tight hardware-software co-design that produces genuine gains over general-purpose chips.
The broader lesson: vertical integration — owning the chip, the software stack, the compiler, and the datacenter — is the structural advantage that makes custom silicon worth the investment. Companies that only own one or two layers rarely recoup the cost.
By building its own silicon, Google transformed NVIDIA from a vendor into an optional supplier. It also created a moat: external AI developers using Google Cloud TPUs are dependent on Google's roadmap, pricing, and availability — the same leverage dynamic that previously gave NVIDIA power over Google.
You are advising a large cloud company that currently buys all its AI chips from NVIDIA. The CEO has seen Google's TPU results and wants to know if building custom silicon makes sense for your company too. Use this lab to think through the decision.
Andy Jassy took the stage at re:Invent 2018 with an announcement that caught most analysts off guard. Amazon Web Services was releasing AWS Inferentia — a custom chip for machine learning inference designed entirely in-house. The announcement was notable not for what Inferentia could do, but for what it signalled: AWS no longer intended to be a passive reseller of other companies' silicon.
Three years later, at re:Invent 2021, AWS doubled down with Trainium — its custom training chip. The message was unambiguous. The world's largest cloud provider was building a parallel AI compute stack, from silicon to compiler to managed service, with the explicit goal of offering the lowest cost per training job in public cloud.
AWS Inferentia was announced in December 2018 and made generally available in December 2019 via Amazon EC2 Inf1 instances. The chip was fabricated by TSMC on a 16nm process and contained four NeuronCores — custom inference engines each capable of performing large tensor operations in a single clock cycle.
AWS's internal case study for Inferentia's launch was Amazon's own Alexa service — a workload processing hundreds of millions of voice queries daily. Running Alexa inference on GPU instances had cost tens of millions of dollars annually. Moving to Inferentia reduced that cost by approximately 70% per inference request, according to figures AWS disclosed at re:Invent 2019.
The Inf1 instances were priced at up to 40% lower cost per inference compared to equivalent GPU-based instances, AWS claimed at launch. Independent benchmarks from MLPerf Inference 2021 subsequently placed Inferentia chips among the top performers in the datacenter inference category on several standard models including BERT and ResNet-50.
AWS built the Neuron SDK — a compiler and runtime toolchain that converts models trained in PyTorch, TensorFlow, or MXNet into NeuronCore-optimized binaries. The SDK handles operator partitioning, memory layout, and on-chip data movement automatically. This software layer is what makes Inferentia usable without low-level chip expertise — a critical adoption requirement for AWS's customer base.
AWS Inferentia2 was announced at re:Invent 2022 and became available via Inf2 instances in 2023. The second generation moved to TSMC's 7nm node, quadrupled NeuronCore count to 12 per chip, and introduced direct chip-to-chip NeuronLink interconnects enabling multi-chip inference for models too large to fit on a single device.
AWS reported that Inf2 instances delivered up to 4x higher throughput and up to 10x lower latency for large language model inference compared to Inf1, at broadly comparable or lower per-hour pricing. The architectural improvement was driven by NeuronLink, which allowed a 175-billion-parameter model like GPT-3 to be served across two Inferentia2 chips with coherent memory access — eliminating the PCIe bottleneck that limits GPU-based multi-card inference.
AWS Trainium was announced at re:Invent 2021 and became available via Trn1 instances in 2022. Trainium was a more ambitious chip than Inferentia — it had to compete directly with NVIDIA A100 for the training market, where customer loyalty and software ecosystem lock-in (CUDA) were formidably entrenched.
AWS's approach was cost disruption. Trn1 instances were priced to offer training compute at roughly 50% lower cost per FLOP than comparable GPU instances on AWS, according to AWS pricing disclosures at re:Invent 2022. The actual performance-per-dollar advantage depended heavily on model architecture and how well the Neuron compiler could optimize the specific training graph.
Trainium2, announced in 2023, scales the architecture dramatically: up to 65,536 chips in a single EC2 UltraCluster, connected by AWS's custom EFA (Elastic Fabric Adapter) network. AWS claimed this configuration could train a 300-billion-parameter model in approximately 2 weeks — competitive with the largest GPU clusters, at a lower hourly rate.
Google built TPUs primarily for internal consumption — to power its own products and reduce its own chip bill, with Cloud TPU as a secondary external offering. Amazon's strategy is fundamentally different: Inferentia and Trainium are cloud products first. The chips only make business sense if AWS customers adopt them in volume.
This means AWS has to solve a harder problem than Google: the software ecosystem. NVIDIA's CUDA has a decade of optimized libraries, tutorials, and trained engineers. AWS's Neuron SDK is newer and covers a narrower surface area. AWS has addressed this partly through compatibility layers and partly through partnerships — notably a 2023 agreement with Anthropic to have Claude models run natively on Trainium, providing a marquee workload to validate the platform.
In September 2023, AWS announced a $4 billion investment in Anthropic with a key provision: Anthropic would use AWS Trainium and Inferentia as its primary training and inference platforms. This was not just financial — it was a validation strategy. By anchoring the leading AI safety lab's workloads to its custom silicon, AWS gained a credible proof point that Trainium can train frontier-class models.
AWS made a distinctive choice: build two separate chips — Inferentia for inference and Trainium for training — rather than a single unified chip. Most competitors build one chip that handles both tasks. Explore the trade-offs with your lab assistant.
At Microsoft Ignite 2023, CEO Satya Nadella unveiled two chips simultaneously: Azure Maia 100, a custom AI accelerator for training and inference, and Azure Cobalt 100, a custom Arm-based CPU for general cloud workloads. The announcements were notable for their timing — they came just weeks before the OpenAI governance crisis of November 2023, and they signalled that Microsoft was not content to remain a passive buyer of NVIDIA hardware even as it deepened its OpenAI relationship.
The strategic message was explicit: Microsoft intended to have custom silicon at every layer of Azure's AI stack — custom CPUs, custom AI accelerators, and custom networking — by 2025.
Azure Maia 100 was designed in partnership with TSMC and fabricated on TSMC's 5nm process node. Microsoft disclosed that it contains 105 billion transistors — among the largest transistor counts in any commercially deployed AI chip at time of announcement. It is designed specifically for large transformer model workloads: training and inference of GPT-class models.
The chip's architecture prioritizes memory bandwidth over raw FLOP count. Microsoft engineers concluded that most large model operations are memory-bandwidth-bound rather than compute-bound — the bottleneck is moving weights around, not performing the arithmetic. Maia 100 incorporates a large on-chip SRAM cache and low-latency off-chip DRAM access optimized for the sequential access patterns of autoregressive transformer inference.
Microsoft also designed a custom liquid cooling system — called the "cold plate" system — that mounts directly to the Maia chip package. This allows Maia to run at higher sustained clock speeds in dense rack configurations than air-cooled alternatives, a prerequisite for Azure's high-density AI data center buildout.
Microsoft disclosed that Maia 100's first major workload would be powering Microsoft Copilot — its AI assistant embedded across Office 365, Teams, Bing, and Windows. Copilot processes tens of billions of tokens per day across these products. By running inference on Maia rather than NVIDIA GPUs, Microsoft projected substantial savings on its own AI infrastructure bill before offering the chips to external Azure customers.
Alongside Maia, Microsoft announced Azure Cobalt 100 — a 128-core Arm-based CPU built on TSMC's 5nm process, designed to replace Intel Xeon and AMD EPYC processors in a large fraction of Azure's general-purpose compute fleet. Cobalt 100 is based on the Arm Neoverse CSS (Compute Subsystem) N2 architecture, though Microsoft added custom optimizations for Azure workloads.
Microsoft claimed Cobalt 100 delivers 40% better performance than the best Azure Arm-based instances available at time of announcement. Cobalt 100 instances became available in public preview on Azure in 2024, initially targeting compute-intensive non-AI workloads — web serving, data processing, and containerized applications — where the power efficiency of Arm versus x86 provides direct cost benefits.
The relevance of Cobalt to the AI hardware race is indirect but important: by reducing CPU costs, Microsoft frees capital budget for more Maia and NVIDIA GPU procurement. A hyperscaler's capex is a finite pool; efficiency in one area enables investment in another.
Microsoft's Maia strategy is intertwined with its OpenAI relationship in ways that create structural tensions. Microsoft has invested approximately $13 billion in OpenAI through multiple funding rounds. OpenAI trains its models — including GPT-4o and future GPT-series models — on Azure infrastructure, primarily on NVIDIA H100 clusters that Microsoft procures on OpenAI's behalf.
If Maia matures to the point where it can train frontier models at competitive cost, Microsoft faces a choice: push OpenAI to use Maia (saving Microsoft money and validating the chip), or allow OpenAI to continue specifying NVIDIA hardware (preserving the NVIDIA relationship and reducing transition risk). OpenAI has its own separate silicon ambitions — its fundraising and partnership conversations in 2023–2024 included discussions about custom chip development, potentially in partnership with SoftBank or through OpenAI's own ASIC project.
As of 2024, Maia has not been publicly confirmed as a primary training substrate for any OpenAI model. Its primary disclosed workload remains Microsoft's own Copilot products.
All three hyperscalers are building custom AI silicon, but their motivations and architectures differ meaningfully:
| Company | Chip(s) | Primary Internal Use | External Sales? | Distinctive Feature |
|---|---|---|---|---|
| TPU v4 / Trillium | Search, PaLM, Gemini | Yes (Cloud TPU) | Optical inter-chip network; 4,096-chip pods | |
| Amazon | Inferentia2 / Trainium2 | Alexa, AWS services | Yes (Inf2, Trn1 instances) | Two-chip strategy; 65,536-chip UltraCluster |
| Microsoft | Maia 100 / Cobalt 100 | Copilot, Azure services | Planned (preview 2024) | 105B transistors; custom liquid cooling |
All three hyperscalers are simultaneously building custom silicon to reduce NVIDIA dependence while also spending more on NVIDIA GPUs than ever before. In 2023, Microsoft, Google, and Amazon collectively spent an estimated $30+ billion on NVIDIA H100s. Custom chips reduce marginal costs on specific workloads but cannot yet replace NVIDIA across the full stack. The hyperscalers are playing a long game: replacing NVIDIA gradually, workload by workload, as their custom chips mature.
Microsoft has invested $13 billion in OpenAI and built Maia 100 to reduce its NVIDIA dependency. But OpenAI still trains on NVIDIA GPUs via Azure. You're a Microsoft strategy analyst asked to assess: should Microsoft push harder for OpenAI to adopt Maia, or would that create risks the company can't afford?
Every startup in the AI chip space is making the same fundamental bet: that there exists a regime where a purpose-built architecture outperforms a general-purpose GPU enough that customers will pay a premium, tolerate software migration costs, and accept vendor concentration risk. The history of the semiconductor industry suggests this is very hard to achieve. The history of the last five years suggests it might be possible — if the workload is right.
Cerebras Systems, founded in 2016 by Andrew Feldman (previously CEO of SeaMicro, acquired by AMD), made the most audacious architectural bet in the AI chip industry: build a chip the size of an entire silicon wafer.
The Cerebras WSE-2 (Wafer-Scale Engine 2), announced in 2021, contains 2.6 trillion transistors across 46,225 mm² of silicon — roughly 56x the area of an NVIDIA A100. It contains 850,000 AI-optimized cores and 40 GB of on-chip SRAM (compared to the A100's 40 MB). This eliminates off-chip memory bottlenecks entirely for models that fit within 40 GB — and Cerebras has developed techniques to spread larger models across multiple WSE-2 chips.
The WSE-3, announced at SC23 in November 2023, scaled further: 4 trillion transistors, 900,000 cores, and 44 GB of on-chip SRAM. It is fabricated by TSMC on a 5nm process. A single WSE-3 can train a GPT-3-sized (175B parameter) model without model parallelism — a feat that requires hundreds of GPU nodes.
Cerebras's real-world deployments include partnerships with Argonne National Laboratory and Lawrence Livermore National Laboratory for scientific AI workloads. In 2023, Cerebras launched Cerebras Inference — a cloud API service offering token generation at over 1,800 tokens per second for Llama 3 models, compared to roughly 60–80 tokens/second on equivalent GPU setups. In independent tests, this represented the fastest publicly available LLM inference service by a significant margin.
Building chips at wafer scale creates a fundamental yield problem: a single defect anywhere on the 46,000 mm² wafer could theoretically render the entire chip unusable. Cerebras solved this through redundant cores — enough extra compute units that the chip remains fully functional even with a statistically expected number of manufacturing defects. The yield math works at scale, but it required significant custom fab process development with TSMC.
Groq was founded in 2016 by Jonathan Ross — the Google engineer who first proposed building the TPU. Ross left Google after TPU v1 shipped and founded Groq to pursue a different architectural vision: deterministic, software-controlled data movement with zero hardware caches.
The Groq LPU (Language Processing Unit) uses a tensor streaming processor architecture where all data movement is explicitly scheduled by the compiler at compile time. There is no dynamic memory controller, no speculative execution, and no cache hierarchy. The result is perfectly predictable latency — every inference run takes exactly the same time as every other — and extremely high memory bandwidth utilization (near 100%, versus 60–80% typical for GPUs).
In February 2024, Groq launched GroqCloud — a public inference API. Within days of launch, independent users reported Llama 2 70B inference at over 500 tokens per second, faster than any GPU-based service at equivalent model size. Groq's internal benchmark claimed 750 tokens/second for Mixtral 8x7B. The service went viral in AI developer communities as the first genuinely sub-100ms time-to-first-token experience for large models.
Groq raised a $640 million Series D round in August 2024 at a reported $2.8 billion valuation, with investors including Samsung, BlackRock, and the government of Saudi Arabia (via the Public Investment Fund).
SambaNova Systems, founded in 2017 by former Stanford professors Kunle Olukotun and Chris Ré alongside engineer Rodrigo Liang, took a different architectural direction: the reconfigurable dataflow unit (RDU). Rather than designing a fixed-function chip, SambaNova built a chip whose data movement paths can be reprogrammed at runtime — a middle ground between an ASIC and an FPGA.
SambaNova's primary commercial traction has been in the enterprise on-premises market. The company sells "SambaNova Suite" — a full-stack AI appliance containing RDU chips, software, and pre-trained models — directly to large enterprises and national labs that cannot or will not send sensitive data to public cloud. Customers include Argonne National Laboratory, Oak Ridge National Laboratory, and several large financial institutions.
SambaNova raised $676 million over several rounds, reaching a valuation above $5 billion in 2021. It has been more conservative about publishing public benchmarks than Cerebras or Groq, reflecting its enterprise sales model where customer-specific performance demonstrations replace public leaderboards.
Tenstorrent was founded in 2016 and is led by Jim Keller — arguably the most celebrated CPU architect in the industry, responsible for AMD's Zen architecture, Apple's A-series chips, and Tesla's Dojo neural network chip. Keller joined as CEO in 2021.
Tenstorrent's chips — the Grayskull (2021) and Wormhole (2022) architectures — use a RISC-V-based tile mesh where each tile contains a RISC-V processor plus a matrix math unit. The tiles are connected by a 2D torus network-on-chip. Critically, Tenstorrent has open-sourced significant portions of its software stack and published its architecture specifications — a bet that an open ecosystem will attract developers faster than any proprietary moat.
In June 2023, Tenstorrent raised $100 million from investors including Hyundai Motor Group and Samsung Ventures. The Korean industrial conglomerate investment reflects a growing thesis: as AI moves into automotive and industrial applications, on-device AI chips designed in a RISC-V ecosystem become attractive to OEMs who want silicon independence from both NVIDIA and US hyperscalers.
All four startups, despite wildly different architectures, are targeting the same gap: NVIDIA optimizes for throughput — tokens per second per dollar across large batch jobs. The startup challengers are attacking latency — time-to-first-token, single-request response time, and the interactive AI use cases where a 200ms response feels alive and a 2,000ms response feels broken.
This is a coherent market segmentation bet. As AI inference moves from batch processing to real-time conversational applications — voice AI, coding assistants, autonomous agents — the relevant metric shifts from "how many requests can I process per hour?" to "how fast does this individual response arrive?" NVIDIA's multi-billion-dollar CUDA ecosystem is optimized for the former. Cerebras, Groq, and others are betting the future rewards the latter.
All four startups face the same existential risk: NVIDIA is not standing still. The H200 and upcoming Blackwell Ultra architectures have dramatically improved inference latency, and NVIDIA's NIM microservices platform is specifically targeting the low-latency inference market. A startup that wins on latency today may find its advantage eroded within 18 months by a new NVIDIA SKU. Sustainable differentiation requires either an architectural moat (Cerebras's wafer scale, Groq's deterministic scheduling) or a customer moat (SambaNova's enterprise lock-in, Tenstorrent's open-source ecosystem) that NVIDIA cannot easily replicate.
You are a venture capital analyst evaluating AI chip startups. Your fund has been asked to assess whether Cerebras, Groq, SambaNova, or Tenstorrent has a defensible long-term position — or whether NVIDIA will inevitably erode their advantages as it has done to GPU challengers historically.