Module 6 · Lesson 1

Two Modes, Two Hardware Philosophies

Training builds knowledge. Inference delivers it. The silicon behind each task looks nothing alike.

Why can't one chip optimally do both?

When Microsoft deployed ChatGPT at scale following the November 2022 launch, its Azure infrastructure teams discovered an uncomfortable truth: the same GPU clusters that had trained GPT-3.5 were expensive and inefficient for serving millions of daily queries. Training and inference, it turned out, are not just different workloads — they demand different hardware architectures entirely.

The Fundamental Split

Modern large language models live two very different lives. Training is the months-long process of adjusting billions of parameters by computing gradients across massive datasets. Inference is the millisecond-to-millisecond act of generating a response when a user submits a prompt. These two workloads have opposing computational signatures.

Training is compute-bound: the bottleneck is raw floating-point throughput. Every GPU cycle must be filled with matrix multiplications. GPT-4 training, estimated to have consumed roughly 25,000 A100 GPUs for approximately 90 to 100 days according to reporting from The Information and SemiAnalysis, demanded sustained throughput above all else.

Inference is often memory-bandwidth-bound: the bottleneck is how fast the chip can load model weights from memory into compute units. A 70-billion-parameter model in FP16 requires 140 GB of weight storage. Each token generated means streaming those weights across the memory bus — repeatedly, for every user.

Why This Matters

A chip optimized for training maximizes FLOPS. A chip optimized for inference maximizes memory bandwidth and minimizes latency per token. Maximizing both simultaneously is physically constrained by power budgets and die area — hence the fork into specialized hardware.

Roofline Model: The Physics of the Divide

Chip architects use the roofline model to visualize where a workload hits its ceiling. Every operation has an arithmetic intensity — the ratio of floating-point operations to bytes of memory traffic. Training large batch matrix multiplications have high arithmetic intensity (hundreds of FLOPs per byte), so they benefit from raw compute. Inference with small batch sizes has arithmetic intensity often below 10 FLOPs per byte, making it memory-bandwidth-limited.

NVIDIA's A100 has 2 TB/s of HBM2e bandwidth and 312 TFLOPS of BF16 compute. Its compute-to-bandwidth ratio is roughly 156 FLOPs per byte — fine for training, but inference at batch size 1 barely scratches the surface of that compute capacity. The chip sits mostly idle on the compute side while memory transfers dominate latency.

Dimension	Training	Inference
Primary bottleneck	Compute (FLOPS)	Memory bandwidth
Batch size	Thousands of samples	Often 1–32 requests
Precision needed	BF16 / FP32 mixed	INT8 / INT4 viable
Duration	Weeks to months	Milliseconds per token
Memory footprint	Weights + gradients + optimizer states (3–4× model size)	Weights + KV cache only
Latency tolerance	High (throughput matters)	Low (user-facing)

The KV Cache: Inference's Hidden Memory Burden

One inference-specific phenomenon with no training analog is the key-value cache. During autoregressive generation, each new token attends to all prior tokens' keys and values. Rather than recompute them, the model caches them in GPU memory. For a long context — GPT-4's 128K token window — the KV cache can consume tens of gigabytes of VRAM per concurrent user. This is purely an inference-time memory pressure that never appears during training.

Anthropic's Claude models running on AWS Inferentia2 chips must manage KV cache allocation dynamically across thousands of simultaneous sessions — a problem training hardware never encounters. This drove AWS to architect Inferentia2 with specific on-chip SRAM pools sized for KV cache patterns rather than gradient storage.

Key Terms

Arithmetic IntensityRatio of floating-point operations to bytes of data moved; determines whether a workload is compute-bound or memory-bound.

KV CacheMemory buffer storing attention key and value tensors for all prior tokens in a generation sequence; purely an inference-time cost.

Roofline ModelFramework plotting achievable performance against arithmetic intensity, revealing whether compute or bandwidth is the binding constraint.

Lesson 1 Quiz

Two Modes, Two Hardware Philosophies — check your understanding

1. Why is inference often described as memory-bandwidth-bound rather than compute-bound?

Correct. With batch size 1, arithmetic intensity drops below the chip's compute-to-bandwidth ratio — weight loading becomes the bottleneck, not multiplication.

Not quite. The key insight is arithmetic intensity at small batch sizes — weight streaming dominates the time budget, not floating-point precision or hardware design.

2. What is the KV cache and why does it only appear during inference?

Correct. Training processes full sequences in parallel (no autoregression), so no KV cache is needed. Inference generates one token at a time and must cache prior context.

Incorrect. The KV cache is specific to attention during autoregressive token generation — it stores keys and values so the model doesn't recompute them for every new token.

3. Training large LLMs requires storing gradients and optimizer states in addition to weights. Approximately how much more memory does this require compared to weights alone?

Correct. With mixed-precision training using Adam optimizer: FP32 master weights + FP32 gradients + two FP32 optimizer moments ≈ 16 bytes per parameter, roughly 3–4× the FP16 weight footprint.

Incorrect. Adam optimizer requires storing first and second moment estimates in FP32 alongside FP32 master weights and FP16 working weights — totaling 3–4× the weight size.

Lab 1 — Roofline Reasoning

Discuss training vs. inference hardware constraints with an AI tutor

Your Mission

You're advising a startup that just finished training a 7B-parameter language model. They want to serve it to 10,000 concurrent users. Discuss with the AI tutor: what hardware tradeoffs should they consider when moving from training infrastructure to inference infrastructure?

Starter: "We trained our 7B model on A100s. Should we serve on A100s too, or is there a better option for inference?"

AI Tutor — Hardware Architecture

Lab 1

Welcome to Lab 1. I'm here to help you think through the training-to-inference hardware transition. What's your first question about moving from A100 training clusters to production inference infrastructure?

Module 6 · Lesson 2

Training Chips — The A100 and H100 Era

NVIDIA's datacenter GPUs became the default training substrate. Understanding why reveals what training hardware must do.

What architectural decisions make the H100 four times faster than the A100 at transformer training?

At GTC 2022, Jensen Huang unveiled the H100 — the first GPU built from the ground up with transformer workloads explicitly in mind. The Hopper architecture introduced the Transformer Engine, a dedicated hardware block that dynamically switches between FP8 and BF16 precision within a single layer pass. The impact was immediate: the H100 delivered roughly 4× the training throughput of the A100 on GPT-style models, a performance leap that arrived precisely as demand for training capacity exploded post-ChatGPT.

What Training Hardware Must Deliver

Training a frontier model is, at its core, a sustained matrix multiplication problem. The forward pass computes activations; the backward pass computes gradients using those same weight matrices. Both phases demand high-precision arithmetic (gradients must accumulate without overflow), massive parallelism across thousands of chips, and high-bandwidth interconnects to synchronize gradient updates.

The A100 (Ampere, 2020) shipped with 312 TFLOPS of BF16 Tensor Core performance and 600 GB/s NVLink 3.0 chip-to-chip bandwidth. Each chip in a DGX A100 server is connected to seven others via NVLink — essential for tensor and pipeline parallelism across a node. Scaling beyond a single node required InfiniBand at 200 Gb/s, with all-reduce collective operations distributing gradient updates across the cluster.

312

TFLOPS BF16

NVIDIA A100 · Tensor Core peak

989

TFLOPS BF16

NVIDIA H100 SXM · Transformer Engine

3.35×

Throughput Gain

H100 vs A100 on GPT-3 training benchmark

900 GB/s

NVLink 4.0

H100 intra-node bandwidth

The Transformer Engine: FP8 Mixed Precision

The H100's most significant training innovation is its Transformer Engine. Traditional mixed-precision training (introduced by NVIDIA and Baidu researchers in 2018) uses FP16 for forward activations and FP32 for gradient accumulation. The Transformer Engine extends this by introducing FP8 computation — 8-bit floating point — for the most numerically tolerant operations within each transformer block.

The key challenge with FP8 is its limited dynamic range. The Transformer Engine maintains per-tensor scaling factors, automatically adjusting them each iteration to prevent underflow or overflow. This dynamic range management happens in hardware, transparently to the framework. The result is near-FP16 accuracy with roughly 2× the arithmetic throughput — verified in NVIDIA's H100 launch documentation and subsequently in production at Meta's training runs for Llama 2.

Real Case: Meta Llama 2 Training

Meta trained Llama 2 70B on 2,000 A100 GPUs over approximately 1,720,000 GPU-hours, consuming roughly 3.3 million tokens per second. The training run used FSDP (Fully Sharded Data Parallelism) across nodes, where gradient synchronization latency across InfiniBand links was the primary inter-node bottleneck. Meta reported in the Llama 2 technical report that hardware failures during training caused approximately 2,000 individual job restarts — underscoring why checkpoint frequency is a first-class training hardware concern.

NVLink and the Multi-GPU Training Problem

No single GPU can hold a frontier model's weights plus its optimizer states. GPT-3 at 175 billion parameters requires roughly 2.8 TB in FP32 optimizer state — far exceeding any single chip's memory. Training therefore requires model parallelism: splitting either layers (pipeline parallelism) or weight tensors (tensor parallelism) across multiple GPUs.

Tensor parallelism requires all-reduce operations after every transformer layer's forward and backward pass — typically every few milliseconds. At this frequency, the interconnect speed is critical. NVLink 4.0's 900 GB/s bidirectional bandwidth (H100) means an all-reduce across eight chips on a single node completes in microseconds. InfiniBand HDR at 200 Gb/s (roughly 25 GB/s) handles the slower inter-node reduction. The architectural consequence: training clusters are designed as fat-tree networks with high bisection bandwidth to minimize all-reduce latency.

Key Terms

Transformer EngineNVIDIA H100 hardware block that automatically manages FP8/BF16 precision switching per tensor to maximize throughput without accuracy loss.

Tensor ParallelismModel parallelism strategy splitting individual weight matrices across GPUs; requires all-reduce every forward/backward pass.

FSDPFully Sharded Data Parallelism — shards optimizer states, gradients, and parameters across ranks, reducing per-GPU memory by the number of GPUs in the shard group.

Lesson 2 Quiz

Training Chips — the A100/H100 Era

1. What is the primary innovation of the H100's Transformer Engine compared to standard mixed-precision training?

Correct. The Transformer Engine's automatic per-tensor scaling solves FP8's limited dynamic range problem, enabling near-FP16 accuracy with roughly 2× the arithmetic throughput.

Incorrect. The Transformer Engine uses FP8 (not FP64) with automatic per-tensor scaling factors to manage dynamic range — this is what enables higher throughput.

2. Why does tensor parallelism place such stringent demands on interconnect bandwidth?

Correct. Tensor parallelism splits weight matrices; recomposing partial outputs requires all-reduce after every layer, making high-frequency, low-latency interconnects essential.

Incorrect. Tensor parallelism requires all-reduce after every individual transformer layer — not just at step boundaries — making interconnect latency a per-layer training cost.

3. Meta's Llama 2 70B training run experienced approximately 2,000 job restarts. What does this reveal about training infrastructure requirements?

Correct. At the scale of thousands of GPUs running for months, hardware failures are statistically near-certain. Checkpointing strategy and fast restart capability are infrastructure requirements, not afterthoughts.

Incorrect. Failures at scale are statistically expected — with 2,000 GPUs running for months, individual component failures are inevitable. The lesson is designing for fault tolerance and checkpoint recovery.

Lab 2 — Training Cluster Design

Explore the architectural decisions behind frontier model training hardware

Your Mission

You're a hardware architect at a lab planning to train a 100B-parameter model. Discuss with the AI tutor how you'd configure your GPU cluster: what parallelism strategies would you use, what interconnect do you need, and how does the H100's Transformer Engine affect your precision choices?

Starter: "We need to train a 100B parameter model. Walk me through how to think about parallelism strategy and interconnect requirements."

AI Tutor — Training Architecture

Lab 2

Welcome to Lab 2. Let's work through the hardware design for a 100B-parameter training run. What aspect would you like to start with — memory capacity planning, parallelism strategy, or interconnect design?

Module 6 · Lesson 3

Inference Chips — Purpose-Built Accelerators

Google's TPUs, AWS Inferentia, Groq's LPU, and NVIDIA's inference-optimized line redrew the economics of serving AI at scale.

When the workload shifts from training to serving, what does optimal silicon look like — and who got there first?

When Groq's inference service went public in early 2024, benchmarkers clocked it serving Llama 2 70B at over 300 tokens per second per user — roughly 10× faster than typical GPU-based inference endpoints. Groq's chip, the Language Processing Unit, achieved this not through raw FLOPS but through a radically different architectural philosophy: eliminating memory bandwidth as the bottleneck entirely by placing all model weights in on-chip SRAM.

The Inference Optimization Axes

Purpose-built inference accelerators compete on three primary dimensions: tokens per second per chip (throughput), time-to-first-token (latency), and cost per million tokens (efficiency). These metrics often trade off against each other and against chip area — creating distinct architectural philosophies.

Four major design approaches have emerged from the 2020–2025 inference hardware wave, each reflecting a different hypothesis about where the bottleneck lies and how to eliminate it.

Chip	Key Insight	Throughput Advantage	Tradeoff
Groq LPU	All weights in on-chip SRAM; deterministic execution, zero DRAM	~300 tok/s (Llama 70B, 2024)	Limited model size; high chip cost
Google TPU v5e	Matrix multiply units + high-BW HBM; bfloat16 native; scale-out mesh	Optimized for batch; strong at high QPS	Requires quantization or TPU-optimized models
AWS Inferentia2	Custom ISA, on-chip SRAM scratchpads, NeuronCore clusters	4× lower cost than GPU for INT8 inference	Requires Neuron SDK compilation
NVIDIA L40S	Ada Lovelace GPU; INT8 Tensor Cores; lower power than H100	Good generality; high ecosystem support	Still DRAM-bound at small batch

Google TPU v4 and v5e: The Systolic Array Philosophy

Google designed TPUs around systolic arrays — grids of multiply-accumulate units that pass partial sums neighbor-to-neighbor, maximizing data reuse without the generality overhead of a GPU's SIMT model. This makes TPUs extremely efficient for the large matrix multiplications that dominate transformer computations when batch sizes are high.

The TPU v4, deployed in Google's AI Supercomputer pods from 2021, connected 4,096 chips via a 3D torus interconnect with 1.1 Tb/s total bandwidth per chip. Google used this infrastructure to serve production Bard (now Gemini) queries from early 2023. The TPU v5e (2023), optimized specifically for inference, trades peak FLOPS for lower power and cost-per-token — Google claimed 2× better performance per dollar for inference versus TPU v4 in their launch materials.

AWS Inferentia2: The Custom Silicon Bet

Amazon introduced Inferentia in 2019 and Inferentia2 in 2023, targeting the specific economics of serving ML models at AWS scale. Inferentia2 uses a NeuronCore v2 architecture with 32 MB of on-chip SRAM per core — compared to roughly 50 MB shared across an entire A100. This SRAM serves as a managed scratchpad for weights and activations in tight inference loops, dramatically reducing HBM traffic.

AWS reports Inferentia2-based instances (Inf2) deliver up to 4× higher throughput and 10× lower latency than comparable GPU instances for production NLP workloads. Anthropic signed a $4 billion partnership with AWS partly predicated on Inferentia2 and Trainium2 for production Claude serving — a public signal that purpose-built inference silicon is cost-competitive at hyperscale.

Quantization: The Software Lever on Inference Hardware

Purpose-built inference chips are designed to take maximum advantage of quantized weights. INT8 inference requires half the memory bandwidth of FP16 for weight loading; INT4 cuts it to a quarter. On chips like Inferentia2 and the L40S with native INT8 Tensor Cores, this translates directly to 2–4× higher throughput. The accuracy cost of INT8 quantization on models like Llama 2 was measured by Meta at under 1% perplexity degradation — an acceptable tradeoff for the inference cost reduction at millions of daily queries.

Continuous Batching: The Software–Hardware Co-design

Inference hardware efficiency depends critically on keeping arithmetic intensity high — which requires batching multiple user requests together. Continuous batching, introduced in Orca (2022) and popularized by vLLM (2023), allows new requests to join a batch mid-generation rather than waiting for all requests to finish. This keeps GPUs (and custom accelerators) from sitting idle between request arrivals, raising effective batch size and thus arithmetic intensity.

vLLM's PagedAttention additionally manages KV cache memory like virtual memory pages — reducing fragmentation that previously wasted 60–80% of KV cache VRAM. Together, these techniques are as important to inference throughput as the hardware itself, which is why inference chip vendors (NVIDIA, AWS, Groq) all provide tight software stack integrations rather than selling bare silicon.

Key Terms

Language Processing Unit (LPU)Groq's inference chip using on-chip SRAM to hold all model weights, eliminating DRAM bandwidth as the throughput bottleneck.

Continuous BatchingInference serving technique allowing new requests to join active batches mid-generation, keeping hardware utilization high.

PagedAttentionvLLM's KV cache memory management system using virtual-memory-style paging to reduce fragmentation and increase inference throughput.

Lesson 3 Quiz

Inference Chips — Purpose-Built Accelerators

1. Groq's LPU achieved dramatically higher inference throughput than GPU-based systems primarily by:

Correct. By fitting weights in on-chip SRAM, the LPU eliminates the memory-bandwidth wall that limits GPU inference throughput at small batch sizes — the fundamental inference bottleneck.

Incorrect. Groq's key innovation is on-chip SRAM for weight storage, not clock speed or HBM. This eliminates the memory bandwidth bottleneck that constrains GPU inference performance.

2. Why does continuous batching improve inference hardware utilization compared to static batching?

Correct. Static batching forces the accelerator to wait for all batch members to finish before accepting new work. Continuous batching fills those idle slots, keeping hardware utilization near 100%.

Incorrect. Continuous batching works by allowing mid-generation request insertion — solving the idle-wait problem of static batching where slow requests hold back the entire batch.

3. AWS's Anthropic partnership ($4B) was partly predicated on Inferentia2 and Trainium2. What does this signal about the inference silicon market?

Correct. When serving billions of tokens daily, even modest per-token cost reductions from custom silicon translate to hundreds of millions in savings — justifying the tight vertical integration between hyperscalers and AI labs.

Incorrect. The partnership signals that custom inference silicon achieves cost-per-token advantages sufficient to justify deep integration between chip designers and AI model providers at hyperscale serving volumes.

Lab 3 — Inference Architecture Decisions

Evaluate purpose-built inference chips for a real-world serving scenario

Your Mission

Your company needs to serve a Llama 2 70B model to 50,000 daily active users with a p99 latency target of 500ms for first token. Explore with the AI tutor the tradeoffs between Groq LPU, AWS Inferentia2, Google TPU v5e, and NVIDIA L40S for this workload.

Starter: "We need sub-500ms first-token latency for Llama 70B at scale. Which inference chip architecture should we evaluate first and why?"

AI Tutor — Inference Hardware

Lab 3

Welcome to Lab 3. You have a real latency target — sub-500ms first-token for Llama 70B — and four chip architectures to evaluate. This is a genuine engineering tradeoff. Where would you like to start: latency characteristics, throughput economics, or operational complexity of each platform?

Module 6 · Lesson 4

The Economics of Split Infrastructure

Training costs are one-time; inference costs are forever. The hardware decisions made at training time constrain the economics of every query that follows.

How do the largest AI labs actually balance training and inference hardware spend — and what does that reveal about the future of the hardware market?

In late 2023, The Information reported that OpenAI spent approximately $700,000 per day running ChatGPT on Azure infrastructure — an annualized inference bill exceeding $250 million. Meanwhile, GPT-4 training was estimated at $50–$100 million as a one-time cost. The pattern was clear: training is a project; inference is a business. The ongoing compute bill dwarfs the upfront training investment within months of deployment.

Training Capex vs. Inference Opex

AI infrastructure finance distinguishes between capital expenditure (buying or reserving GPU clusters for training) and operational expenditure (ongoing cloud compute for serving). A model trained once can serve queries for years — meaning inference hardware costs accumulate indefinitely while training costs are amortized over the model's useful life.

For a model with 1 billion daily tokens served at $0.002 per thousand tokens (a rough 2024 market rate for GPT-4 class models), the annual inference cost is approximately $730 million. A GPT-4-class training run at an estimated $50–100 million becomes economically insignificant within two months of serving at that scale. This asymmetry drives the inference optimization arms race: every 10% reduction in inference cost-per-token translates to $73 million in annual savings at that serving volume.

$50–100M

GPT-4 Training Est.

One-time capital cost · 2023 estimates

$700K/day

ChatGPT Serving

Reported Azure inference bill · late 2023

<3 months

Break-Even

When inference cost exceeds training cost

60–70%

Inference Share

Of total GPU cloud spend for AI workloads · 2024

The Training–Inference Chip Portfolio Strategy

In response to these economics, major AI labs have adopted split hardware portfolios: H100/A100 clusters for training new models, combined with purpose-built or cost-optimized inference hardware for production serving. This creates distinct procurement, operations, and software engineering requirements — effectively two separate infrastructure teams within the same organization.

Google's approach is illustrative. The company uses TPU v4 pods for training Gemini models and TPU v5e for serving — the v5e chip was explicitly designed for inference economics, trading some peak FLOPS for lower power and cost-per-token. Google Cloud reported at Next 2023 that TPU v5e provides approximately 2× better price-performance for inference versus TPU v4.

Microsoft (OpenAI's partner) takes a different approach: using A100 and H100 clusters in Azure for both training and inference, but at different utilization modes. Training jobs run in large dedicated reservations; inference runs on the same hardware but with a different cluster management regime optimized for latency and throughput targets rather than sustained throughput.

Real Case: Meta's Inference Cost Reduction Program

Meta's 2023 Annual Report disclosed that AI infrastructure efficiency was a primary cost management lever. Meta deployed quantization (INT8 and INT4) across production recommendation and language models, reportedly reducing inference compute costs by 30–40% for those models. At Meta's scale — serving billions of daily users — this translates to hundreds of millions in annual infrastructure savings. The Llama 2 technical report specifically discusses INT8 quantization results, reporting under 1% perplexity increase on MMLU benchmarks — validating quantization as a production-viable inference cost lever.

The Longer-Context Problem: Inference Gets More Expensive

A structural trend making inference hardware economics harder is context length expansion. GPT-4's jump from 8K to 128K context and Claude's 200K context window have dramatically increased KV cache memory requirements. At 128K tokens with a 70B model, the KV cache for a single session can exceed 50 GB — meaning a GPU with 80 GB HBM2e can serve only one such session at peak context.

This drives demand for KV cache offloading strategies (moving less-recently-used cache entries to CPU DRAM or NVMe SSD), and for hardware architectures with larger HBM capacity. The H100 NVL (NVLink) variant ships with 188 GB of HBM3e for exactly this reason — its primary use case is long-context inference, not training, despite being marketed as a training chip.

What the Split Means for the Hardware Market

The training-inference split is reshaping the competitive landscape. NVIDIA still dominates training (roughly 80% market share in datacenter GPU compute as of 2024) but faces meaningful competition in inference from AWS Inferentia, Google TPUs, and emerging players. The inference market's fragmentation reflects its diversity: different latency requirements, model sizes, and throughput targets reward different architectural approaches.

Analysts at SemiAnalysis estimated in 2024 that inference hardware spending would exceed training hardware spending for the first time by 2025, driven by the deployment of foundation models into production applications. This shift is already visible in NVIDIA's product roadmap — the Blackwell B200 includes a dedicated inference configuration (the B100) and features like FP4 precision and multi-instance GPU partitioning explicitly targeted at inference efficiency.

Key Terms

Capex vs. OpexCapital expenditure (one-time hardware purchase/reservation for training) versus operational expenditure (ongoing serving costs); inference Opex exceeds training Capex within months at scale.

KV Cache OffloadingTechnique moving less-active KV cache entries from GPU HBM to CPU DRAM or NVMe to support long-context inference within GPU memory budgets.

Split Hardware PortfolioOrganizational strategy maintaining separate hardware fleets optimized for training (throughput) and inference (latency/cost-per-token) rather than using identical chips for both.

Lesson 4 Quiz

The Economics of Split Infrastructure

1. OpenAI's reported ~$700K/day ChatGPT serving cost vs. ~$50–100M one-time GPT-4 training cost implies what about the training-inference economic relationship?

Correct. At $700K/day, inference costs match GPT-4's estimated $50–100M training cost within 72–143 days. Every subsequent year of serving costs more than the original training — making inference the dominant optimization target.

Incorrect. At $700K/day inference spending, the one-time training cost is recovered within 3–4 months. Inference is the long-term dominant cost, not training.

2. Why does expanding context length (e.g., from 8K to 128K tokens) make inference hardware economics significantly harder?

Correct. KV cache size scales with sequence length. At 128K tokens with a 70B model, a single session can require 50+ GB of HBM — an entire GPU's memory — reducing concurrency to near 1 per chip and dramatically increasing cost-per-session.

Incorrect. The key issue is KV cache memory, which scales with context length. A 128K-token session can consume an entire GPU's HBM for KV storage alone, collapsing concurrency and exploding cost-per-session.

3. SemiAnalysis estimated inference hardware spending would exceed training hardware spending by 2025. What market dynamic drives this shift?

Correct. Training runs are finite projects; inference serving is continuous and grows with user adoption. As more models deploy to production, the aggregate inference compute demand compounds — eventually exceeding the episodic training demand that dominated earlier in the AI scaling era.

Incorrect. The driver is that inference demand is continuous and grows with deployment, while training is episodic. The sheer volume of daily queries across many deployed models accumulates into a larger sustained demand than periodic training runs.

Lab 4 — Infrastructure Economics

Model the training vs. inference cost split for a real AI product

Your Mission

You're the VP of Infrastructure at a startup that just deployed a GPT-4-class model. Your board wants to understand the 3-year total cost of ownership breakdown between training and inference, and which optimization levers — quantization, continuous batching, purpose-built inference chips — deliver the most financial impact.

Starter: "We trained our model for $30M. We're now serving 500M tokens/day. Help me model the 3-year TCO and identify the highest-impact cost reduction lever."

AI Tutor — Infrastructure Economics

Lab 4

Welcome to Lab 4. Let's build out your TCO model. At 500M tokens/day, your inference costs will dominate your training investment within weeks. To start: what's your current cost per million tokens, and are you running on general-purpose GPUs or purpose-built inference chips?

Module 6 — Module Test

Inference vs. Training Hardware · 15 questions · Pass at 80%

1. What does "arithmetic intensity" measure, and why is it the key variable in the roofline model?

Correct. Arithmetic intensity determines which roofline boundary a workload hits: the compute peak or the bandwidth peak. Training has high intensity; small-batch inference has low intensity.

Incorrect. Arithmetic intensity is FLOPs per byte of memory traffic — the key variable in determining whether a workload is compute-bound or memory-bandwidth-bound.

2. During autoregressive inference, why does each new token require loading model weights again?

Correct. Unlike training with large batches where weight loading is amortized across many samples, single-user inference at batch size 1 must stream all weights through the memory bus for every token with no reuse benefit.

Incorrect. Weights don't change during inference. The issue is that at small batch sizes, weight loading can't be amortized across multiple samples — each token pass must stream billions of weight bytes through the memory bus.

3. The H100's Transformer Engine introduced FP8 precision. What problem does its dynamic scaling solve?

Correct. FP8 has only 8 bits to represent values — a much narrower range than FP16. The Transformer Engine's per-tensor scaling factors calibrate the representable range per tensor, preventing the numerical errors that would otherwise make FP8 unusable for training.

Incorrect. The dynamic scaling in the Transformer Engine addresses FP8's limited dynamic range — the narrow span of values an 8-bit float can represent — by maintaining per-tensor scaling factors calibrated each iteration.

4. Fully Sharded Data Parallelism (FSDP) reduces per-GPU memory by sharding optimizer states across ranks. What is the maximum memory reduction achievable with FSDP?

Correct. FSDP shards parameters, gradients, and optimizer states across all ranks in the shard group. With N GPUs, each GPU holds 1/N of each shard type — theoretically N× reduction, bounded by practical all-gather communication costs.

Incorrect. FSDP achieves memory reduction proportional to the number of GPUs in the shard group, since each GPU only holds 1/N of the total sharded state for N-GPU groups.

5. Google's TPU v5e was designed specifically for inference rather than training. Which characteristic reflects this inference optimization?

Correct. Google explicitly traded peak FLOPS for improved cost-per-token on TPU v5e — the right optimization for inference where sustained economic efficiency matters more than peak throughput.

Incorrect. TPU v5e trades some compute peak for lower power and better cost-per-token — the characteristic signature of an inference-optimized chip versus a training-optimized one.

6. vLLM's PagedAttention addresses which specific inference memory problem?

Correct. Traditional KV cache management reserved contiguous memory blocks per request, leading to severe fragmentation. PagedAttention uses virtual-memory-style paging, allocating KV cache in small non-contiguous blocks — dramatically improving GPU utilization.

Incorrect. PagedAttention specifically addresses KV cache memory fragmentation — not weight storage or attention complexity. It manages cache memory in pages rather than contiguous blocks, recovering the 60–80% of memory wasted by traditional approaches.

7. Groq's LPU architecture has a fundamental capacity constraint compared to GPU-based inference. What is it?

Correct. SRAM is much more expensive per bit than DRAM — the LPU's on-chip SRAM approach eliminates bandwidth bottlenecks but caps the model size to what fits in SRAM, a fundamental physical constraint.

Incorrect. The LPU's constraint is on-chip SRAM capacity. SRAM is far more expensive per bit than DRAM, so LPU chips can't match GPUs' hundreds of gigabytes of HBM — models must fit in the available SRAM.

8. Meta reported that INT8 quantization on Llama 2 70B caused under 1% perplexity increase on MMLU. What is the practical significance of this finding?

Correct. Under 1% perplexity degradation at INT8 means the accuracy cost is acceptable for production serving. Combined with 2× memory bandwidth efficiency from half-width weights, this makes INT8 a compelling inference optimization at scale.

Incorrect. Under 1% perplexity increase is well within production tolerance, meaning INT8 is viable for real serving — enabling 2× better memory bandwidth utilization (half the bytes to load) and proportional throughput gains.

9. NVLink 4.0 provides 900 GB/s bidirectional bandwidth between H100s in a node. Why is this specifically important for tensor parallelism during training?

Correct. In tensor parallelism, each layer's computation is split across GPUs. After each layer's partial outputs are computed, all-reduce must aggregate them — meaning every transformer layer adds a collective communication round. High NVLink bandwidth makes this per-layer cost small.

Incorrect. Tensor parallelism requires all-reduce after every transformer layer — not just at step end. This means NVLink bandwidth directly affects per-layer computation time, appearing hundreds of times per training step.

10. AWS Inferentia2's NeuronCore v2 architecture uses 32 MB of on-chip SRAM per core. What inference workload characteristic does this specifically address?

Correct. On-chip SRAM serves as a managed scratchpad for weight tiles and activations in tight inference loops. By staging frequently-used weights in SRAM, the NeuronCore reduces the volume of HBM traffic per token, improving throughput and energy efficiency.

Incorrect. The on-chip SRAM acts as a scratchpad reducing HBM traffic for frequently accessed weight tiles during inference loops — not KV cache or gradient storage. Reducing this DRAM traffic is the key inference optimization.

11. A 128K-token context window with a 70B model can require 50+ GB of KV cache per session. What is the direct operational consequence for inference serving?

Correct. If KV cache alone consumes most of a GPU's HBM, fewer concurrent sessions fit on each chip. At the limit, a single long-context session monopolizes an entire GPU — making per-session compute cost equivalent to dedicated hardware.

Incorrect. The key consequence is concurrency collapse: KV cache consuming most of a GPU's HBM means very few (potentially just one) long-context sessions can run simultaneously per chip, exploding cost-per-session.

12. Google's TPU v4 pods use a 3D torus interconnect topology. What property of this topology is specifically valuable for training large models?

Correct. In a 3D torus, no chip is far from any other — maximum hop count is bounded by the cube root of total chip count. This keeps all-reduce latency low at scale, critical for model parallelism where synchronization happens every layer.

Incorrect. The 3D torus provides high bisection bandwidth and short maximum paths between chips — these properties minimize all-reduce latency for collective operations that must synchronize across thousands of chips every training step.

13. Continuous batching (Orca) improves hardware utilization by allowing new requests to join mid-generation. What problem with static batching does this solve?

Correct. Static batching gates new work on the slowest batch member. Since generation lengths vary widely, fast-finishing requests create idle GPU slots. Continuous batching fills those slots immediately, keeping arithmetic intensity — and hardware utilization — high.

Incorrect. Static batching's core problem is the idle-wait: fast requests finish but the batch slot can't be refilled until all batch members complete. Continuous batching allows immediate slot reuse, eliminating this waste.

14. SemiAnalysis estimated inference hardware spending would exceed training hardware spending by 2025. Which trend is the primary driver of inference spend growth?

Correct. Training is episodic; inference is perpetual and grows with adoption. Each new model deployment adds a sustained inference load. As the installed base of deployed models grows, aggregate inference demand accumulates and eventually dominates the market.

Incorrect. The driver is accumulated continuous inference demand: each deployed model generates ongoing serving load that compounds as more models deploy and user adoption grows — eventually overwhelming the episodic nature of training runs.

15. The NVIDIA H100 NVL (NVLink) variant ships with 188 GB of HBM3e — more than the standard H100's 80 GB. What is its primary intended use case?

Correct. Despite being marketed alongside training chips, the H100 NVL's primary use case is long-context inference. At 128K+ token contexts, KV cache alone can exceed the standard 80 GB HBM — the 188 GB variant provides headroom for weights plus large KV caches per session.

Incorrect. The H100 NVL's 188 GB HBM is primarily for long-context inference — where KV cache for 128K+ token sessions can exceed 50 GB, requiring more HBM than the standard H100 provides even after accounting for model weights.