When Microsoft deployed ChatGPT at scale following the November 2022 launch, its Azure infrastructure teams discovered an uncomfortable truth: the same GPU clusters that had trained GPT-3.5 were expensive and inefficient for serving millions of daily queries. Training and inference, it turned out, are not just different workloads — they demand different hardware architectures entirely.
Modern large language models live two very different lives. Training is the months-long process of adjusting billions of parameters by computing gradients across massive datasets. Inference is the millisecond-to-millisecond act of generating a response when a user submits a prompt. These two workloads have opposing computational signatures.
Training is compute-bound: the bottleneck is raw floating-point throughput. Every GPU cycle must be filled with matrix multiplications. GPT-4 training, estimated to have consumed roughly 25,000 A100 GPUs for approximately 90 to 100 days according to reporting from The Information and SemiAnalysis, demanded sustained throughput above all else.
Inference is often memory-bandwidth-bound: the bottleneck is how fast the chip can load model weights from memory into compute units. A 70-billion-parameter model in FP16 requires 140 GB of weight storage. Each token generated means streaming those weights across the memory bus — repeatedly, for every user.
A chip optimized for training maximizes FLOPS. A chip optimized for inference maximizes memory bandwidth and minimizes latency per token. Maximizing both simultaneously is physically constrained by power budgets and die area — hence the fork into specialized hardware.
Chip architects use the roofline model to visualize where a workload hits its ceiling. Every operation has an arithmetic intensity — the ratio of floating-point operations to bytes of memory traffic. Training large batch matrix multiplications have high arithmetic intensity (hundreds of FLOPs per byte), so they benefit from raw compute. Inference with small batch sizes has arithmetic intensity often below 10 FLOPs per byte, making it memory-bandwidth-limited.
NVIDIA's A100 has 2 TB/s of HBM2e bandwidth and 312 TFLOPS of BF16 compute. Its compute-to-bandwidth ratio is roughly 156 FLOPs per byte — fine for training, but inference at batch size 1 barely scratches the surface of that compute capacity. The chip sits mostly idle on the compute side while memory transfers dominate latency.
| Dimension | Training | Inference |
|---|---|---|
| Primary bottleneck | Compute (FLOPS) | Memory bandwidth |
| Batch size | Thousands of samples | Often 1–32 requests |
| Precision needed | BF16 / FP32 mixed | INT8 / INT4 viable |
| Duration | Weeks to months | Milliseconds per token |
| Memory footprint | Weights + gradients + optimizer states (3–4× model size) | Weights + KV cache only |
| Latency tolerance | High (throughput matters) | Low (user-facing) |
One inference-specific phenomenon with no training analog is the key-value cache. During autoregressive generation, each new token attends to all prior tokens' keys and values. Rather than recompute them, the model caches them in GPU memory. For a long context — GPT-4's 128K token window — the KV cache can consume tens of gigabytes of VRAM per concurrent user. This is purely an inference-time memory pressure that never appears during training.
Anthropic's Claude models running on AWS Inferentia2 chips must manage KV cache allocation dynamically across thousands of simultaneous sessions — a problem training hardware never encounters. This drove AWS to architect Inferentia2 with specific on-chip SRAM pools sized for KV cache patterns rather than gradient storage.
You're advising a startup that just finished training a 7B-parameter language model. They want to serve it to 10,000 concurrent users. Discuss with the AI tutor: what hardware tradeoffs should they consider when moving from training infrastructure to inference infrastructure?
At GTC 2022, Jensen Huang unveiled the H100 — the first GPU built from the ground up with transformer workloads explicitly in mind. The Hopper architecture introduced the Transformer Engine, a dedicated hardware block that dynamically switches between FP8 and BF16 precision within a single layer pass. The impact was immediate: the H100 delivered roughly 4× the training throughput of the A100 on GPT-style models, a performance leap that arrived precisely as demand for training capacity exploded post-ChatGPT.
Training a frontier model is, at its core, a sustained matrix multiplication problem. The forward pass computes activations; the backward pass computes gradients using those same weight matrices. Both phases demand high-precision arithmetic (gradients must accumulate without overflow), massive parallelism across thousands of chips, and high-bandwidth interconnects to synchronize gradient updates.
The A100 (Ampere, 2020) shipped with 312 TFLOPS of BF16 Tensor Core performance and 600 GB/s NVLink 3.0 chip-to-chip bandwidth. Each chip in a DGX A100 server is connected to seven others via NVLink — essential for tensor and pipeline parallelism across a node. Scaling beyond a single node required InfiniBand at 200 Gb/s, with all-reduce collective operations distributing gradient updates across the cluster.
The H100's most significant training innovation is its Transformer Engine. Traditional mixed-precision training (introduced by NVIDIA and Baidu researchers in 2018) uses FP16 for forward activations and FP32 for gradient accumulation. The Transformer Engine extends this by introducing FP8 computation — 8-bit floating point — for the most numerically tolerant operations within each transformer block.
The key challenge with FP8 is its limited dynamic range. The Transformer Engine maintains per-tensor scaling factors, automatically adjusting them each iteration to prevent underflow or overflow. This dynamic range management happens in hardware, transparently to the framework. The result is near-FP16 accuracy with roughly 2× the arithmetic throughput — verified in NVIDIA's H100 launch documentation and subsequently in production at Meta's training runs for Llama 2.
Meta trained Llama 2 70B on 2,000 A100 GPUs over approximately 1,720,000 GPU-hours, consuming roughly 3.3 million tokens per second. The training run used FSDP (Fully Sharded Data Parallelism) across nodes, where gradient synchronization latency across InfiniBand links was the primary inter-node bottleneck. Meta reported in the Llama 2 technical report that hardware failures during training caused approximately 2,000 individual job restarts — underscoring why checkpoint frequency is a first-class training hardware concern.
No single GPU can hold a frontier model's weights plus its optimizer states. GPT-3 at 175 billion parameters requires roughly 2.8 TB in FP32 optimizer state — far exceeding any single chip's memory. Training therefore requires model parallelism: splitting either layers (pipeline parallelism) or weight tensors (tensor parallelism) across multiple GPUs.
Tensor parallelism requires all-reduce operations after every transformer layer's forward and backward pass — typically every few milliseconds. At this frequency, the interconnect speed is critical. NVLink 4.0's 900 GB/s bidirectional bandwidth (H100) means an all-reduce across eight chips on a single node completes in microseconds. InfiniBand HDR at 200 Gb/s (roughly 25 GB/s) handles the slower inter-node reduction. The architectural consequence: training clusters are designed as fat-tree networks with high bisection bandwidth to minimize all-reduce latency.
You're a hardware architect at a lab planning to train a 100B-parameter model. Discuss with the AI tutor how you'd configure your GPU cluster: what parallelism strategies would you use, what interconnect do you need, and how does the H100's Transformer Engine affect your precision choices?
When Groq's inference service went public in early 2024, benchmarkers clocked it serving Llama 2 70B at over 300 tokens per second per user — roughly 10× faster than typical GPU-based inference endpoints. Groq's chip, the Language Processing Unit, achieved this not through raw FLOPS but through a radically different architectural philosophy: eliminating memory bandwidth as the bottleneck entirely by placing all model weights in on-chip SRAM.
Purpose-built inference accelerators compete on three primary dimensions: tokens per second per chip (throughput), time-to-first-token (latency), and cost per million tokens (efficiency). These metrics often trade off against each other and against chip area — creating distinct architectural philosophies.
Four major design approaches have emerged from the 2020–2025 inference hardware wave, each reflecting a different hypothesis about where the bottleneck lies and how to eliminate it.
| Chip | Key Insight | Throughput Advantage | Tradeoff |
|---|---|---|---|
| Groq LPU | All weights in on-chip SRAM; deterministic execution, zero DRAM | ~300 tok/s (Llama 70B, 2024) | Limited model size; high chip cost |
| Google TPU v5e | Matrix multiply units + high-BW HBM; bfloat16 native; scale-out mesh | Optimized for batch; strong at high QPS | Requires quantization or TPU-optimized models |
| AWS Inferentia2 | Custom ISA, on-chip SRAM scratchpads, NeuronCore clusters | 4× lower cost than GPU for INT8 inference | Requires Neuron SDK compilation |
| NVIDIA L40S | Ada Lovelace GPU; INT8 Tensor Cores; lower power than H100 | Good generality; high ecosystem support | Still DRAM-bound at small batch |
Google designed TPUs around systolic arrays — grids of multiply-accumulate units that pass partial sums neighbor-to-neighbor, maximizing data reuse without the generality overhead of a GPU's SIMT model. This makes TPUs extremely efficient for the large matrix multiplications that dominate transformer computations when batch sizes are high.
The TPU v4, deployed in Google's AI Supercomputer pods from 2021, connected 4,096 chips via a 3D torus interconnect with 1.1 Tb/s total bandwidth per chip. Google used this infrastructure to serve production Bard (now Gemini) queries from early 2023. The TPU v5e (2023), optimized specifically for inference, trades peak FLOPS for lower power and cost-per-token — Google claimed 2× better performance per dollar for inference versus TPU v4 in their launch materials.
Amazon introduced Inferentia in 2019 and Inferentia2 in 2023, targeting the specific economics of serving ML models at AWS scale. Inferentia2 uses a NeuronCore v2 architecture with 32 MB of on-chip SRAM per core — compared to roughly 50 MB shared across an entire A100. This SRAM serves as a managed scratchpad for weights and activations in tight inference loops, dramatically reducing HBM traffic.
AWS reports Inferentia2-based instances (Inf2) deliver up to 4× higher throughput and 10× lower latency than comparable GPU instances for production NLP workloads. Anthropic signed a $4 billion partnership with AWS partly predicated on Inferentia2 and Trainium2 for production Claude serving — a public signal that purpose-built inference silicon is cost-competitive at hyperscale.
Purpose-built inference chips are designed to take maximum advantage of quantized weights. INT8 inference requires half the memory bandwidth of FP16 for weight loading; INT4 cuts it to a quarter. On chips like Inferentia2 and the L40S with native INT8 Tensor Cores, this translates directly to 2–4× higher throughput. The accuracy cost of INT8 quantization on models like Llama 2 was measured by Meta at under 1% perplexity degradation — an acceptable tradeoff for the inference cost reduction at millions of daily queries.
Inference hardware efficiency depends critically on keeping arithmetic intensity high — which requires batching multiple user requests together. Continuous batching, introduced in Orca (2022) and popularized by vLLM (2023), allows new requests to join a batch mid-generation rather than waiting for all requests to finish. This keeps GPUs (and custom accelerators) from sitting idle between request arrivals, raising effective batch size and thus arithmetic intensity.
vLLM's PagedAttention additionally manages KV cache memory like virtual memory pages — reducing fragmentation that previously wasted 60–80% of KV cache VRAM. Together, these techniques are as important to inference throughput as the hardware itself, which is why inference chip vendors (NVIDIA, AWS, Groq) all provide tight software stack integrations rather than selling bare silicon.
Your company needs to serve a Llama 2 70B model to 50,000 daily active users with a p99 latency target of 500ms for first token. Explore with the AI tutor the tradeoffs between Groq LPU, AWS Inferentia2, Google TPU v5e, and NVIDIA L40S for this workload.
In late 2023, The Information reported that OpenAI spent approximately $700,000 per day running ChatGPT on Azure infrastructure — an annualized inference bill exceeding $250 million. Meanwhile, GPT-4 training was estimated at $50–$100 million as a one-time cost. The pattern was clear: training is a project; inference is a business. The ongoing compute bill dwarfs the upfront training investment within months of deployment.
AI infrastructure finance distinguishes between capital expenditure (buying or reserving GPU clusters for training) and operational expenditure (ongoing cloud compute for serving). A model trained once can serve queries for years — meaning inference hardware costs accumulate indefinitely while training costs are amortized over the model's useful life.
For a model with 1 billion daily tokens served at $0.002 per thousand tokens (a rough 2024 market rate for GPT-4 class models), the annual inference cost is approximately $730 million. A GPT-4-class training run at an estimated $50–100 million becomes economically insignificant within two months of serving at that scale. This asymmetry drives the inference optimization arms race: every 10% reduction in inference cost-per-token translates to $73 million in annual savings at that serving volume.
In response to these economics, major AI labs have adopted split hardware portfolios: H100/A100 clusters for training new models, combined with purpose-built or cost-optimized inference hardware for production serving. This creates distinct procurement, operations, and software engineering requirements — effectively two separate infrastructure teams within the same organization.
Google's approach is illustrative. The company uses TPU v4 pods for training Gemini models and TPU v5e for serving — the v5e chip was explicitly designed for inference economics, trading some peak FLOPS for lower power and cost-per-token. Google Cloud reported at Next 2023 that TPU v5e provides approximately 2× better price-performance for inference versus TPU v4.
Microsoft (OpenAI's partner) takes a different approach: using A100 and H100 clusters in Azure for both training and inference, but at different utilization modes. Training jobs run in large dedicated reservations; inference runs on the same hardware but with a different cluster management regime optimized for latency and throughput targets rather than sustained throughput.
Meta's 2023 Annual Report disclosed that AI infrastructure efficiency was a primary cost management lever. Meta deployed quantization (INT8 and INT4) across production recommendation and language models, reportedly reducing inference compute costs by 30–40% for those models. At Meta's scale — serving billions of daily users — this translates to hundreds of millions in annual infrastructure savings. The Llama 2 technical report specifically discusses INT8 quantization results, reporting under 1% perplexity increase on MMLU benchmarks — validating quantization as a production-viable inference cost lever.
A structural trend making inference hardware economics harder is context length expansion. GPT-4's jump from 8K to 128K context and Claude's 200K context window have dramatically increased KV cache memory requirements. At 128K tokens with a 70B model, the KV cache for a single session can exceed 50 GB — meaning a GPU with 80 GB HBM2e can serve only one such session at peak context.
This drives demand for KV cache offloading strategies (moving less-recently-used cache entries to CPU DRAM or NVMe SSD), and for hardware architectures with larger HBM capacity. The H100 NVL (NVLink) variant ships with 188 GB of HBM3e for exactly this reason — its primary use case is long-context inference, not training, despite being marketed as a training chip.
The training-inference split is reshaping the competitive landscape. NVIDIA still dominates training (roughly 80% market share in datacenter GPU compute as of 2024) but faces meaningful competition in inference from AWS Inferentia, Google TPUs, and emerging players. The inference market's fragmentation reflects its diversity: different latency requirements, model sizes, and throughput targets reward different architectural approaches.
Analysts at SemiAnalysis estimated in 2024 that inference hardware spending would exceed training hardware spending for the first time by 2025, driven by the deployment of foundation models into production applications. This shift is already visible in NVIDIA's product roadmap — the Blackwell B200 includes a dedicated inference configuration (the B100) and features like FP4 precision and multi-instance GPU partitioning explicitly targeted at inference efficiency.
You're the VP of Infrastructure at a startup that just deployed a GPT-4-class model. Your board wants to understand the 3-year total cost of ownership breakdown between training and inference, and which optimization levers — quantization, continuous batching, purpose-built inference chips — deliver the most financial impact.