In 2017, researchers at Google published a paper called Attention Is All You Need, introducing the Transformer architecture. What they did not headline — but what every hardware engineer noticed — was that the attention mechanism was extraordinarily memory-hungry. The operations themselves were simple multiplications and additions. The problem was getting the numbers to the arithmetic units fast enough. The GPU's compute cores sat idle a significant fraction of the time, waiting. The bottleneck was not processing power. It was memory bandwidth.
Memory bandwidth is the rate at which data can be transferred between a processor's compute units and its memory pool. It is measured in gigabytes per second (GB/s) or, for high-end AI hardware, terabytes per second (TB/s). Think of it as the width of the road connecting a warehouse (memory) to a factory floor (compute cores): the number of workers on the factory floor matters less if the loading dock can only deliver goods so fast.
A modern NVIDIA H100 GPU can execute roughly 2,000 trillion floating-point operations per second (2 petaFLOPS) in FP8 precision. Its HBM3 memory subsystem delivers roughly 3.35 TB/s of bandwidth. At 1 byte per FP8 value, the GPU could theoretically read 3.35 × 10¹² values per second — but it needs two operands and one result per multiply-add. The arithmetic unit is capable of processing far more data than the memory system can deliver. This gap is called arithmetic intensity imbalance.
Arithmetic intensity = FLOP ÷ bytes read/written. Operations with low arithmetic intensity (reading lots of data, doing little math per byte) are memory-bandwidth-bound. High-intensity operations are compute-bound. Most AI inference and attention layers are bandwidth-bound.
Modern processors operate across multiple tiers of memory, each with a different speed, capacity, and cost. The tiers closest to the compute cores (registers, SRAM cache) are fast but tiny. The tiers furthest away (DRAM, storage) are large but slow. Every time a compute unit needs data that is not in the fast tier, it stalls and waits.
Approximate peak bandwidths, H100 SXM5 class hardware (2023). Values vary by precision and access pattern.
The term "memory wall" was coined by William Wulf and Sally McKee in a 1994 paper in ACM SIGARCH. They showed that DRAM bandwidth was growing at roughly 7% per year while processor performance grew at 54% per year — predicting that memory would eventually become the dominant bottleneck. They were correct, and the arrival of large neural networks three decades later made the wall impossible to ignore.
You will explore the Roofline performance model and how it reveals whether an AI operation is limited by compute or memory bandwidth. Discuss specific examples, ask about real GPU specs, or challenge the AI tutor with edge cases.
By 2014, AMD's graphics cards faced a hard ceiling. Their Hawaii GPU had 512 GB/s of memory bandwidth using a 512-bit GDDR5 bus — wide by any standard, but already constraining. To go wider with GDDR on a conventional PCB was impractical: more pins, more signal integrity problems, more power. AMD partnered with SK Hynix on something radical. Rather than connecting memory chips across a circuit board, they would stack DRAM dies vertically on a silicon interposer next to the GPU die itself. The resulting product, Fiji (launched as the Radeon R9 Fury X in June 2015), was the world's first consumer GPU with High Bandwidth Memory. Bandwidth: 512 GB/s from just four HBM stacks, each using a 1,024-bit bus. The interposer-connected approach would define AI accelerator design for the next decade.
Conventional GDDR memory chips sit on a PCB and connect to the GPU through copper traces that carry signals at high frequencies but through a narrow interface (typically 32 bits per chip). HBM takes the opposite approach: multiple DRAM dies are stacked vertically using through-silicon vias (TSVs), with the stack placed on a silicon interposer in the same package as the GPU. The interposer carries a very wide bus — 1,024 bits per stack in HBM1/HBM2, 1,024 bits in HBM2e and HBM3 but at higher speeds.
Because the bus is so wide and the electrical path so short (millimeters on silicon rather than centimeters on PCB), HBM can achieve high bandwidth at lower signal frequencies and lower power per GB/s than GDDR. The tradeoff: HBM is expensive to manufacture and package, and capacity per stack is limited by the number of dies that can be stacked (currently 12 dies per stack in HBM3e).
| Generation | Year | Bandwidth/Stack | Capacity/Stack | Used In |
|---|---|---|---|---|
| HBM1 | 2015 | 128 GB/s | 1–4 GB | AMD Fiji (R9 Fury X) |
| HBM2 | 2016 | 256 GB/s | 4–8 GB | NVIDIA V100, AMD Vega |
| HBM2e | 2020 | 307–460 GB/s | 8–16 GB | NVIDIA A100, AMD MI250 |
| HBM3 | 2022 | ~819 GB/s | 16–24 GB | NVIDIA H100 |
| HBM3e | 2024 | ~1,200 GB/s | 24–36 GB | NVIDIA H200, AMD MI300X |
When NVIDIA launched the Volta-architecture V100 in May 2017 — the first GPU explicitly designed for deep learning — it came with HBM2 providing 900 GB/s of bandwidth. For comparison, the contemporary GDDR5X-based GTX 1080 Ti provided 484 GB/s. That bandwidth difference, combined with Tensor Cores for matrix multiplication, made the V100 the dominant training accelerator almost overnight. Cloud providers — Amazon AWS (P3 instances, October 2017), Google, and Microsoft Azure — built entire product lines around it. The message was clear: HBM bandwidth was not a premium option, it was a prerequisite for serious AI work.
HBM is manufactured by only three companies: SK Hynix, Samsung, and Micron. SK Hynix has held roughly 50% of HBM market share and was the first to qualify HBM3 for NVIDIA's H100 in volume. In 2023 and 2024, demand from AI accelerator orders — led by NVIDIA and AMD — outstripped HBM supply capacity by a wide margin. Reports from NVIDIA's quarterly earnings calls in 2023 explicitly named HBM supply as a constraint on H100 shipment volumes. The situation forced customers onto multi-year HBM allocation agreements and contributed to H100 spot-market prices exceeding $40,000 per card. A memory architecture that began as a graphics curiosity had become a geopolitical supply-chain chokepoint.
A theoretical GDDR6X design achieving 1 TB/s would require roughly a 256-chip interface at 32 bits per chip — physically impossible on a GPU package, and catastrophic for power consumption. HBM achieves 3+ TB/s with 5–6 stacks. The silicon interposer is expensive, but it is the only practical path to the bandwidth AI workloads require.
Investigate HBM technology: how stacking and interposers work, why it costs so much more than GDDR, how different generations compare, and why supply became a chokepoint. Push for specifics and tradeoffs.
In May 2022, PhD student Tri Dao at Stanford published a preprint titled FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. The key insight was deceptively simple: standard attention implementations move data between HBM and SRAM on the GPU far more times than necessary. By restructuring the computation to keep data in fast SRAM as long as possible — tiling the attention matrix — Dao reduced the number of HBM reads and writes by a factor proportional to the sequence length. The result: 2–4× faster attention on real hardware, not through more compute, but through smarter memory access patterns. FlashAttention was rapidly adopted by every major lab. GPT-4, Claude, Gemini — all use variants of it. The paper won the ICML 2022 Outstanding Paper Award.
The self-attention mechanism at the heart of every Transformer model computes, for a sequence of N tokens, an N×N matrix of attention scores. For each query token, the model measures similarity against every key token, weights every value token accordingly, and produces an output. The problem: the full attention matrix has N² elements. At N=2,048 (GPT-3's training context), that is 4 million elements per layer per head. At N=128,000 (GPT-4's reported context window), it is over 16 billion elements — far too large to fit in GPU SRAM. Every element must be read from and written to HBM.
Standard PyTorch attention (before FlashAttention) materialized this full N×N matrix in HBM memory, requiring O(N²) memory reads and writes. Since HBM bandwidth is finite, longer sequences hit the bandwidth wall hard. A sequence 4× longer requires 16× the HBM bandwidth for attention — a brutal scaling law that made long-context models impractical on 2020-era hardware.
Approximate HBM access reduction from Dao et al. (2022, 2023). Actual gains depend on sequence length and hardware.
During autoregressive inference — where a language model generates tokens one at a time — recomputing the key and value matrices from scratch for every new token is wasteful: the context from previous tokens doesn't change. The solution, universally implemented since 2020, is the KV cache: store all previous key and value tensors in HBM and read them on each new token generation step.
The bandwidth implication is severe. For a 70-billion-parameter model like Meta's Llama 2 70B, with 80 transformer layers, 64 key-value heads each of dimension 128, running at FP16: the KV cache for a 4,096-token context requires roughly 20 GB of HBM. The H100 has 80 GB total. At batch size 1, this is manageable. At batch size 8 (eight simultaneous user requests), the KV cache alone consumes the full HBM capacity, leaving no room for model weights. This is why memory bandwidth and capacity together constrain LLM serving throughput, and why multi-query attention (MQA) and grouped-query attention (GQA) — introduced in papers by Noam Shazeer (2019) and Ainslie et al. (2023) — reduce the number of KV heads to cut cache size.
80 layers × 64 KV heads × 128 head dimension × 2 (K+V) × 2 bytes (FP16) × 4,096 tokens = ~20.5 GB per batch item. Eight concurrent users fill the H100's 80 GB HBM entirely with KV cache, requiring weight offloading or quantization. This calculation explains why H100 clusters are sized as multi-GPU configurations for production LLM serving.
FlashAttention v1 (Dao et al., 2022) introduced IO-aware tiled attention. FlashAttention v2 (Dao, 2023) further improved GPU utilization by better partitioning work across thread blocks and reducing non-matmul FLOPs. FlashAttention-3 (Shah et al., 2024) was designed specifically for H100, exploiting asynchronous pipeline stages and the WGMMA instruction set to achieve ~75% of H100's theoretical FP16 throughput on attention. All versions achieve the same mathematical result as standard attention; only the memory access pattern changes. This class of optimization — reorganizing computation to minimize HBM round-trips without changing outputs — is called IO-aware algorithm design.
FlashAttention demonstrates a general principle that will recur throughout AI hardware evolution: algorithmic optimization and hardware specification co-design. The best AI systems are built by teams who understand both the mathematical structure of the computation and the physical constraints of the memory hierarchy. Neither hardware nor algorithms alone are sufficient.
You'll analyze how KV cache sizes scale with model architecture and context length, and explore how FlashAttention and grouped-query attention address the bandwidth bottleneck. Try computing KV cache sizes for specific models or asking about algorithm design tradeoffs.
Every approach to memory bandwidth ultimately confronts the same physics: moving electrical signals between chips consumes energy proportional to distance and bus capacitance. HBM attacked this by shortening the distance (interposer packaging) and widening the bus. But the data still crosses a chip boundary twice per memory access — once from DRAM to interposer, once from interposer to logic die. Some engineers believe the only escape from this constraint is to bring computation to the data, not data to computation. Others are betting that photonics — using light rather than electricity — can remove the energy-per-bit constraint at chip boundaries entirely.
Processing-in-memory places simple arithmetic units inside the DRAM stack itself, executing operations on data without it ever crossing to the logic chip. Samsung demonstrated its HBM-PIM product in 2021, integrating programmable compute units between DRAM layers. Each HBM2 stack contained a bank of SIMD processing engines capable of executing FP16 MAC operations. Samsung reported that HBM-PIM reduced power consumption by 70% and improved AI inference performance by 2.5× on their Aquabolt-XL platform for neural network workloads.
SK Hynix has pursued a similar concept under the name AiM (Accelerator-in-Memory), targeting transformer inference. The challenge for PIM is programmability: the compute units inside DRAM must be simple enough to fit in the memory die's transistor budget, but complex enough to handle the diversity of modern AI operations. Current PIM products handle limited operation sets — primarily vector dot products and activation functions — and require software adaptation to use effectively.
Samsung's HBM2-PIM product (Aquabolt-XL) was described in an IEEE paper at ISSCC 2021. It integrated 2 GFLOPS of FP16 compute per HBM die, adding ~0.5% die area. In GEMM-heavy AI inference benchmarks, the company reported 2.5× throughput improvement and 70% power reduction compared to standard HBM2 by eliminating inter-chip data transfer for computation that fits within the PIM engine's capability.
When a model is too large for one chip's HBM, data must travel between chips. The bandwidth of these inter-chip links directly constrains parallelism efficiency. NVIDIA's NVLink 4.0 (H100) provides 900 GB/s total bidirectional bandwidth across NVSwitch-connected configurations, enabling 8-GPU NVL8 systems to share a 640 GB aggregate HBM pool at GPU-like speeds. This is what makes tensor parallelism viable: splitting a model's weight matrices across 8 GPUs and communicating partial sums via NVLink fast enough that the bandwidth overhead doesn't dominate.
In 2023, the industry launched two standardization efforts targeting future chiplet-to-chiplet and accelerator-to-memory connectivity. UCIe (Universal Chiplet Interconnect Express), supported by Intel, AMD, ARM, Qualcomm, and others, defines a standard die-to-die interface that can achieve up to ~1.3 TB/s per millimeter of die edge in the advanced fabric variant. UALink (Ultra Accelerator Link), announced in May 2024 by AMD, Intel, Broadcom, Cisco, Google, Meta, Microsoft, and HPE, targets scale-out communication between AI accelerators at data-center scale, positioning itself as an alternative to NVIDIA's proprietary NVLink for multi-chip AI clusters.
Electrical interconnects, even on-package, consume energy proportional to the capacitance they must charge and discharge. At scale — 100,000-GPU AI clusters — interconnect power becomes a meaningful fraction of total system power. Optical signals do not share this fundamental property: a photon traveling through a waveguide consumes energy only at the modulator and detector, not along the transmission path. Co-packaged optics (CPO) integrates optical transceivers directly into the switch or accelerator package, eliminating the electrical-to-optical conversion loss at a separate module.
NVIDIA announced co-packaged optical technology in 2024 for future Spectrum-X and NVLink platforms, with the expectation that CPO will become standard for data-center switch bandwidth by 2025–2026. Intel has invested heavily in silicon photonics through its acquisition of Habana and its own photonics research. Ayar Labs, a startup backed by Intel and others, demonstrated a chiplet-scale optical I/O die (TeraPHY) that achieves 2 Tb/s of bandwidth per chiplet at roughly 5 pJ/bit — competitive with advanced electrical interconnects but with far better scaling potential at distance.
No single technology will replace HBM by 2026. The most likely trajectory: HBM4 (targeting ~4–6 TB/s per stack) arrives in 2025–2026 as the conventional path; PIM becomes a niche optimization for memory-intensive inference; UCIe and UALink standardize chiplet bandwidth in high-end AI SoCs; co-packaged optics begins to replace electrical interconnects at the node-to-node level in large clusters. The memory bandwidth bottleneck does not disappear — it migrates to the next hierarchy level where the next generation of models inevitably runs out of bandwidth again.
You'll examine the technologies positioned to succeed or supplement HBM: processing-in-memory, open chiplet standards (UCIe, UALink), and co-packaged optical interconnects. Ask about technical tradeoffs, industry dynamics, and what the next generation of AI clusters might look like.