L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Lesson 1 · Module 5

What Is Memory Bandwidth?

The highway between memory and processor — and why AI traffic jams it instantly.
Why does a GPU with a trillion operations per second still sit idle waiting for data?

In 2017, researchers at Google published a paper called Attention Is All You Need, introducing the Transformer architecture. What they did not headline — but what every hardware engineer noticed — was that the attention mechanism was extraordinarily memory-hungry. The operations themselves were simple multiplications and additions. The problem was getting the numbers to the arithmetic units fast enough. The GPU's compute cores sat idle a significant fraction of the time, waiting. The bottleneck was not processing power. It was memory bandwidth.

Defining Memory Bandwidth

Memory bandwidth is the rate at which data can be transferred between a processor's compute units and its memory pool. It is measured in gigabytes per second (GB/s) or, for high-end AI hardware, terabytes per second (TB/s). Think of it as the width of the road connecting a warehouse (memory) to a factory floor (compute cores): the number of workers on the factory floor matters less if the loading dock can only deliver goods so fast.

A modern NVIDIA H100 GPU can execute roughly 2,000 trillion floating-point operations per second (2 petaFLOPS) in FP8 precision. Its HBM3 memory subsystem delivers roughly 3.35 TB/s of bandwidth. At 1 byte per FP8 value, the GPU could theoretically read 3.35 × 10¹² values per second — but it needs two operands and one result per multiply-add. The arithmetic unit is capable of processing far more data than the memory system can deliver. This gap is called arithmetic intensity imbalance.

Key Concept — Arithmetic Intensity

Arithmetic intensity = FLOP ÷ bytes read/written. Operations with low arithmetic intensity (reading lots of data, doing little math per byte) are memory-bandwidth-bound. High-intensity operations are compute-bound. Most AI inference and attention layers are bandwidth-bound.

The Memory Hierarchy

Modern processors operate across multiple tiers of memory, each with a different speed, capacity, and cost. The tiers closest to the compute cores (registers, SRAM cache) are fast but tiny. The tiers furthest away (DRAM, storage) are large but slow. Every time a compute unit needs data that is not in the fast tier, it stalls and waits.

Register File
~100 TB/s
L1/Shared SRAM
~30 TB/s
L2 Cache
~9 TB/s
HBM3 (H100)
3.35 TB/s
PCIe 5.0 x16
~128 GB/s
NVLink 4.0
~900 GB/s

Approximate peak bandwidths, H100 SXM5 class hardware (2023). Values vary by precision and access pattern.

Key Terms

Memory BandwidthRate of data transfer between memory and compute, in GB/s or TB/s. The throughput ceiling for memory-bound workloads.
Arithmetic IntensityFLOP per byte accessed. Low values indicate memory-bound operations; high values indicate compute-bound.
HBM (High Bandwidth Memory)Stacked DRAM architecture using a wide parallel bus and 3D packaging to deliver very high bandwidth at lower power than conventional GDDR.
Roofline ModelA visual performance model plotting achievable FLOP/s against arithmetic intensity, with a "roofline" set by either peak compute or peak bandwidth.
Historical Anchor

The term "memory wall" was coined by William Wulf and Sally McKee in a 1994 paper in ACM SIGARCH. They showed that DRAM bandwidth was growing at roughly 7% per year while processor performance grew at 54% per year — predicting that memory would eventually become the dominant bottleneck. They were correct, and the arrival of large neural networks three decades later made the wall impossible to ignore.

Lesson 1 Quiz

Memory bandwidth fundamentals · 3 questions
1. A GPU operation that reads 8 bytes of data and performs only 2 floating-point operations has an arithmetic intensity of 0.25 FLOP/byte. This operation is most likely:
Correct. Low arithmetic intensity (FLOP/byte) means the operation consumes bandwidth faster than it consumes compute. The GPU cores sit idle waiting for data, placing the workload on the bandwidth-bound side of the Roofline model.
Not quite. The arithmetic intensity of 0.25 FLOP/byte is far below the ridge point of any modern GPU, placing this firmly in the memory-bandwidth-bound region. The compute units have idle capacity; the memory bus is the constraint.
2. The NVIDIA H100 SXM5 uses HBM3 memory rather than GDDR6X. The primary reason is:
Correct. HBM stacks DRAM dies with a silicon interposer and uses a very wide bus (1024-bit or wider) to move data in parallel, achieving 3+ TB/s versus ~1 TB/s for GDDR6X, while consuming less power per GB/s. Cost per GB is actually higher for HBM.
Not quite. HBM's advantage is bandwidth, not raw capacity or latency (GDDR6X can have comparable random-access latency). HBM is also more expensive per gigabyte to manufacture. The wide parallel bus and 3D stacking give it superior bandwidth per watt.
3. William Wulf and Sally McKee's 1994 "memory wall" paper argued that:
Correct. Wulf and McKee showed processor performance grew at ~54% per year while DRAM bandwidth grew at ~7% per year — a compounding gap that would eventually make memory, not compute, the binding constraint on system performance.
Not quite. The paper's central thesis was the compounding performance gap between DRAM bandwidth growth (~7%/yr) and processor speed growth (~54%/yr), leading to a "wall" where memory access time would dominate total execution time.

Lab 1 — Mapping the Roofline

Discuss arithmetic intensity and bandwidth-bound AI workloads with your AI tutor

Your Task

You will explore the Roofline performance model and how it reveals whether an AI operation is limited by compute or memory bandwidth. Discuss specific examples, ask about real GPU specs, or challenge the AI tutor with edge cases.

Starter: "Walk me through how to plot an attention layer's arithmetic intensity on a Roofline model for an H100 GPU. Is it compute-bound or bandwidth-bound?"
AI Tutor
Memory Bandwidth · L1
Welcome to Lab 1. I'm here to help you understand memory bandwidth, arithmetic intensity, and the Roofline model as they apply to AI hardware. Ask me about specific GPU specs, why attention layers are bandwidth-bound, or how to calculate whether your workload hits the compute or bandwidth ceiling. Where would you like to start?
Lesson 2 · Module 5

HBM — The Stack That Changed Everything

How AMD and SK Hynix created a new memory architecture that became mandatory for AI accelerators.
Why did the entire AI chip industry converge on a memory technology that costs ten times more per gigabyte than conventional DRAM?

By 2014, AMD's graphics cards faced a hard ceiling. Their Hawaii GPU had 512 GB/s of memory bandwidth using a 512-bit GDDR5 bus — wide by any standard, but already constraining. To go wider with GDDR on a conventional PCB was impractical: more pins, more signal integrity problems, more power. AMD partnered with SK Hynix on something radical. Rather than connecting memory chips across a circuit board, they would stack DRAM dies vertically on a silicon interposer next to the GPU die itself. The resulting product, Fiji (launched as the Radeon R9 Fury X in June 2015), was the world's first consumer GPU with High Bandwidth Memory. Bandwidth: 512 GB/s from just four HBM stacks, each using a 1,024-bit bus. The interposer-connected approach would define AI accelerator design for the next decade.

HBM Architecture: How It Works

Conventional GDDR memory chips sit on a PCB and connect to the GPU through copper traces that carry signals at high frequencies but through a narrow interface (typically 32 bits per chip). HBM takes the opposite approach: multiple DRAM dies are stacked vertically using through-silicon vias (TSVs), with the stack placed on a silicon interposer in the same package as the GPU. The interposer carries a very wide bus — 1,024 bits per stack in HBM1/HBM2, 1,024 bits in HBM2e and HBM3 but at higher speeds.

Because the bus is so wide and the electrical path so short (millimeters on silicon rather than centimeters on PCB), HBM can achieve high bandwidth at lower signal frequencies and lower power per GB/s than GDDR. The tradeoff: HBM is expensive to manufacture and package, and capacity per stack is limited by the number of dies that can be stacked (currently 12 dies per stack in HBM3e).

HBM Generation Comparison

GenerationYearBandwidth/StackCapacity/StackUsed In
HBM12015128 GB/s1–4 GBAMD Fiji (R9 Fury X)
HBM22016256 GB/s4–8 GBNVIDIA V100, AMD Vega
HBM2e2020307–460 GB/s8–16 GBNVIDIA A100, AMD MI250
HBM32022~819 GB/s16–24 GBNVIDIA H100
HBM3e2024~1,200 GB/s24–36 GBNVIDIA H200, AMD MI300X

The V100 Inflection Point — 2017

When NVIDIA launched the Volta-architecture V100 in May 2017 — the first GPU explicitly designed for deep learning — it came with HBM2 providing 900 GB/s of bandwidth. For comparison, the contemporary GDDR5X-based GTX 1080 Ti provided 484 GB/s. That bandwidth difference, combined with Tensor Cores for matrix multiplication, made the V100 the dominant training accelerator almost overnight. Cloud providers — Amazon AWS (P3 instances, October 2017), Google, and Microsoft Azure — built entire product lines around it. The message was clear: HBM bandwidth was not a premium option, it was a prerequisite for serious AI work.

The SK Hynix Monopoly and Supply Crunch

HBM is manufactured by only three companies: SK Hynix, Samsung, and Micron. SK Hynix has held roughly 50% of HBM market share and was the first to qualify HBM3 for NVIDIA's H100 in volume. In 2023 and 2024, demand from AI accelerator orders — led by NVIDIA and AMD — outstripped HBM supply capacity by a wide margin. Reports from NVIDIA's quarterly earnings calls in 2023 explicitly named HBM supply as a constraint on H100 shipment volumes. The situation forced customers onto multi-year HBM allocation agreements and contributed to H100 spot-market prices exceeding $40,000 per card. A memory architecture that began as a graphics curiosity had become a geopolitical supply-chain chokepoint.

Why Not Just Use More GDDR?

A theoretical GDDR6X design achieving 1 TB/s would require roughly a 256-chip interface at 32 bits per chip — physically impossible on a GPU package, and catastrophic for power consumption. HBM achieves 3+ TB/s with 5–6 stacks. The silicon interposer is expensive, but it is the only practical path to the bandwidth AI workloads require.

Key Terms

Through-Silicon Via (TSV)Vertical electrical connection through a silicon die, enabling DRAM layers in an HBM stack to communicate with near-zero RC delay.
Silicon InterposerA layer of silicon placed between package and dies, carrying the wide HBM bus and enabling precise, low-power, short-distance connections between GPU and memory.
HBM StackA vertical assembly of DRAM dies connected by TSVs, placed alongside the GPU on an interposer. Each H100 has six HBM3 stacks totaling 80 GB and 3.35 TB/s.

Lesson 2 Quiz

HBM architecture and history · 3 questions
1. AMD's Fiji GPU (Radeon R9 Fury X, 2015) was significant because it:
Correct. AMD Fiji, launched June 2015 in partnership with SK Hynix, was the world's first GPU to use HBM. It placed four HBM1 stacks on a silicon interposer alongside the GPU die, achieving 512 GB/s bandwidth — a preview of what would become the standard architecture for AI accelerators.
Not quite. Fiji was the first consumer GPU with HBM (High Bandwidth Memory), not the first deep-learning-focused GPU (that was NVIDIA's V100 in 2017) and it used HBM1, not GDDR6. It also did not break 1 TB/s — HBM3 on the H100 first approached that level.
2. HBM achieves high bandwidth at lower power per GB/s compared to GDDR primarily because:
Correct. HBM's efficiency comes from width, not speed. A 1,024-bit bus running at moderate frequency delivers the same throughput as a narrow bus running at extreme frequency — but the shorter, fatter path requires far less energy per bit transferred. Power per GB/s is substantially lower than GDDR6X.
Not quite. HBM actually runs at lower frequencies than GDDR. Its advantage is the width of the bus (1,024+ bits per stack) combined with the very short electrical paths on the silicon interposer. This lets it achieve high throughput without the power penalty of high-frequency signaling over long PCB traces.
3. In 2023, HBM supply became a public bottleneck for AI hardware primarily because:
Correct. NVIDIA's own earnings calls in 2023 named HBM supply as a constraint on H100 shipments. Only SK Hynix, Samsung, and Micron make HBM. The explosive demand from AI training infrastructure orders — particularly for H100s — outran what those three suppliers could manufacture, driving H100 spot prices above $40,000.
Not quite. The 2023 HBM crunch was a demand shock, not a supply-side disaster or regulatory event. AI accelerator orders — dominated by demand for NVIDIA H100 GPUs — exceeded the manufacturing capacity of the three HBM producers (SK Hynix, Samsung, Micron), creating multi-quarter allocation queues and extremely high spot prices.

Lab 2 — HBM Deep Dive

Explore HBM architecture, supply chain, and generational tradeoffs with your AI tutor

Your Task

Investigate HBM technology: how stacking and interposers work, why it costs so much more than GDDR, how different generations compare, and why supply became a chokepoint. Push for specifics and tradeoffs.

Starter: "Explain the through-silicon via process in HBM stacking — what are the yield challenges that make HBM so expensive per gigabyte compared to GDDR6X?"
AI Tutor
HBM Architecture · L2
Welcome to Lab 2. Let's dig into High Bandwidth Memory — the stacking process, interposer economics, generational improvements from HBM1 through HBM3e, and the supply-chain dynamics that made HBM a geopolitical commodity in 2023. What aspect would you like to explore first?
Lesson 3 · Module 5

Attention Is Memory-Hungry

Why Transformer attention scales quadratically with sequence length — and the engineering war to fix it.
Why does doubling the context length of a language model quadruple its memory bandwidth demand during inference?

In May 2022, PhD student Tri Dao at Stanford published a preprint titled FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. The key insight was deceptively simple: standard attention implementations move data between HBM and SRAM on the GPU far more times than necessary. By restructuring the computation to keep data in fast SRAM as long as possible — tiling the attention matrix — Dao reduced the number of HBM reads and writes by a factor proportional to the sequence length. The result: 2–4× faster attention on real hardware, not through more compute, but through smarter memory access patterns. FlashAttention was rapidly adopted by every major lab. GPT-4, Claude, Gemini — all use variants of it. The paper won the ICML 2022 Outstanding Paper Award.

Standard Attention: The Quadratic Memory Problem

The self-attention mechanism at the heart of every Transformer model computes, for a sequence of N tokens, an N×N matrix of attention scores. For each query token, the model measures similarity against every key token, weights every value token accordingly, and produces an output. The problem: the full attention matrix has N² elements. At N=2,048 (GPT-3's training context), that is 4 million elements per layer per head. At N=128,000 (GPT-4's reported context window), it is over 16 billion elements — far too large to fit in GPU SRAM. Every element must be read from and written to HBM.

Standard PyTorch attention (before FlashAttention) materialized this full N×N matrix in HBM memory, requiring O(N²) memory reads and writes. Since HBM bandwidth is finite, longer sequences hit the bandwidth wall hard. A sequence 4× longer requires 16× the HBM bandwidth for attention — a brutal scaling law that made long-context models impractical on 2020-era hardware.

HBM reads/writes — standard attention vs FlashAttention (relative, N=4096)
Standard Attn
100% (baseline)
FlashAttn v1
~30%
FlashAttn v2
~18%

Approximate HBM access reduction from Dao et al. (2022, 2023). Actual gains depend on sequence length and hardware.

The KV Cache: Trading Memory for Speed

During autoregressive inference — where a language model generates tokens one at a time — recomputing the key and value matrices from scratch for every new token is wasteful: the context from previous tokens doesn't change. The solution, universally implemented since 2020, is the KV cache: store all previous key and value tensors in HBM and read them on each new token generation step.

The bandwidth implication is severe. For a 70-billion-parameter model like Meta's Llama 2 70B, with 80 transformer layers, 64 key-value heads each of dimension 128, running at FP16: the KV cache for a 4,096-token context requires roughly 20 GB of HBM. The H100 has 80 GB total. At batch size 1, this is manageable. At batch size 8 (eight simultaneous user requests), the KV cache alone consumes the full HBM capacity, leaving no room for model weights. This is why memory bandwidth and capacity together constrain LLM serving throughput, and why multi-query attention (MQA) and grouped-query attention (GQA) — introduced in papers by Noam Shazeer (2019) and Ainslie et al. (2023) — reduce the number of KV heads to cut cache size.

Real Case — Llama 2 70B KV Cache Calculation

80 layers × 64 KV heads × 128 head dimension × 2 (K+V) × 2 bytes (FP16) × 4,096 tokens = ~20.5 GB per batch item. Eight concurrent users fill the H100's 80 GB HBM entirely with KV cache, requiring weight offloading or quantization. This calculation explains why H100 clusters are sized as multi-GPU configurations for production LLM serving.

FlashAttention and Its Successors

FlashAttention v1 (Dao et al., 2022) introduced IO-aware tiled attention. FlashAttention v2 (Dao, 2023) further improved GPU utilization by better partitioning work across thread blocks and reducing non-matmul FLOPs. FlashAttention-3 (Shah et al., 2024) was designed specifically for H100, exploiting asynchronous pipeline stages and the WGMMA instruction set to achieve ~75% of H100's theoretical FP16 throughput on attention. All versions achieve the same mathematical result as standard attention; only the memory access pattern changes. This class of optimization — reorganizing computation to minimize HBM round-trips without changing outputs — is called IO-aware algorithm design.

The Broader Principle

FlashAttention demonstrates a general principle that will recur throughout AI hardware evolution: algorithmic optimization and hardware specification co-design. The best AI systems are built by teams who understand both the mathematical structure of the computation and the physical constraints of the memory hierarchy. Neither hardware nor algorithms alone are sufficient.

Key Terms

KV CacheStored key and value tensors from previous context tokens, enabling autoregressive inference without recomputation. Major consumer of HBM during LLM serving.
IO-Aware AlgorithmAn algorithm designed to minimize data movement between memory tiers, exploiting the hierarchy (registers → SRAM → HBM) rather than treating memory as uniform.
Grouped-Query Attention (GQA)A variant of multi-head attention that shares key-value heads across groups of query heads, reducing KV cache size. Used in Llama 2 70B and later models.

Lesson 3 Quiz

Attention, KV cache, and bandwidth-aware algorithms · 3 questions
1. Standard scaled dot-product attention has O(N²) memory complexity with respect to sequence length. The primary consequence for hardware is:
Correct. The N×N attention matrix must be materialized in HBM for standard implementations. Doubling N quadruples the matrix size and the HBM bandwidth consumed per layer per head. This quadratic scaling made long-context models impractical on pre-FlashAttention hardware despite sufficient compute capacity.
Not quite. The memory complexity is O(N²) — quadratic — meaning HBM reads and writes scale quadratically with sequence length. While FLOP count also increases with N², the binding constraint on real hardware is bandwidth: the attention layer runs far below peak FLOP throughput because HBM cannot supply data fast enough.
2. FlashAttention achieves 2–4× speedup over standard attention without changing the mathematical output. The key technique is:
Correct. FlashAttention tiles the attention computation into blocks that fit in GPU SRAM (shared memory). By fusing the softmax and matrix multiplications and keeping intermediates on-chip, it avoids materializing the full N×N matrix in HBM. The result is exact attention with far fewer HBM accesses — and consequently, far higher throughput on bandwidth-bound hardware.
Not quite. FlashAttention is an exact attention algorithm — it produces identical outputs to standard attention, not approximations. It does not quantize tensors or use multiple GPUs for this purpose. The speedup comes entirely from tiling the computation to fit in SRAM, avoiding repeated round-trips to HBM for intermediate softmax and attention score values.
3. A Llama 2 70B model serving 8 simultaneous user requests at 4,096-token context requires approximately 160 GB of HBM just for KV cache. What does this imply?
Correct. 160 GB of KV cache far exceeds the H100's 80 GB HBM, even before accounting for the 70B model's weights (~140 GB in FP16 or ~70 GB in INT8). This forces multi-GPU deployment (e.g., 4× H100), or use of techniques like grouped-query attention (which reduces KV heads 4–8×) and KV cache quantization to fit within a single GPU's memory budget.
Not quite. INT4 model weights for 70B parameters require ~35 GB — but 160 GB of KV cache still exceeds the remaining 45 GB on an 80 GB H100. PCIe streaming of KV cache from CPU RAM would create a massive bandwidth bottleneck (PCIe 5.0 ≈ 128 GB/s vs H100 HBM ≈ 3.35 TB/s). Multi-GPU or cache reduction techniques are the practical solutions.

Lab 3 — KV Cache & FlashAttention Analysis

Work through attention bandwidth problems with your AI tutor

Your Task

You'll analyze how KV cache sizes scale with model architecture and context length, and explore how FlashAttention and grouped-query attention address the bandwidth bottleneck. Try computing KV cache sizes for specific models or asking about algorithm design tradeoffs.

Starter: "Walk me through calculating the KV cache size for Mistral 7B at 32K context length, and compare it to how GQA reduces that requirement."
AI Tutor
Attention Bandwidth · L3
Welcome to Lab 3. Let's analyze attention memory bandwidth in depth — KV cache sizing, FlashAttention's tiling strategy, grouped-query vs multi-head attention tradeoffs, and how these interact with real GPU memory capacity. Want to work through a calculation, or explore a specific model's memory profile?
Lesson 4 · Module 5

Beyond HBM — The Next Frontiers

Processing-in-memory, chiplet interconnects, optical bandwidth, and the architectural bets being placed for the post-HBM era.
If HBM is already the best conventional DRAM architecture, what comes next — and who is betting on which successor?

Every approach to memory bandwidth ultimately confronts the same physics: moving electrical signals between chips consumes energy proportional to distance and bus capacitance. HBM attacked this by shortening the distance (interposer packaging) and widening the bus. But the data still crosses a chip boundary twice per memory access — once from DRAM to interposer, once from interposer to logic die. Some engineers believe the only escape from this constraint is to bring computation to the data, not data to computation. Others are betting that photonics — using light rather than electricity — can remove the energy-per-bit constraint at chip boundaries entirely.

Processing-In-Memory (PIM)

Processing-in-memory places simple arithmetic units inside the DRAM stack itself, executing operations on data without it ever crossing to the logic chip. Samsung demonstrated its HBM-PIM product in 2021, integrating programmable compute units between DRAM layers. Each HBM2 stack contained a bank of SIMD processing engines capable of executing FP16 MAC operations. Samsung reported that HBM-PIM reduced power consumption by 70% and improved AI inference performance by 2.5× on their Aquabolt-XL platform for neural network workloads.

SK Hynix has pursued a similar concept under the name AiM (Accelerator-in-Memory), targeting transformer inference. The challenge for PIM is programmability: the compute units inside DRAM must be simple enough to fit in the memory die's transistor budget, but complex enough to handle the diversity of modern AI operations. Current PIM products handle limited operation sets — primarily vector dot products and activation functions — and require software adaptation to use effectively.

Samsung HBM-PIM — Documented Results (2021)

Samsung's HBM2-PIM product (Aquabolt-XL) was described in an IEEE paper at ISSCC 2021. It integrated 2 GFLOPS of FP16 compute per HBM die, adding ~0.5% die area. In GEMM-heavy AI inference benchmarks, the company reported 2.5× throughput improvement and 70% power reduction compared to standard HBM2 by eliminating inter-chip data transfer for computation that fits within the PIM engine's capability.

Chiplet Interconnects: NVLink, UCIe, and UALink

When a model is too large for one chip's HBM, data must travel between chips. The bandwidth of these inter-chip links directly constrains parallelism efficiency. NVIDIA's NVLink 4.0 (H100) provides 900 GB/s total bidirectional bandwidth across NVSwitch-connected configurations, enabling 8-GPU NVL8 systems to share a 640 GB aggregate HBM pool at GPU-like speeds. This is what makes tensor parallelism viable: splitting a model's weight matrices across 8 GPUs and communicating partial sums via NVLink fast enough that the bandwidth overhead doesn't dominate.

In 2023, the industry launched two standardization efforts targeting future chiplet-to-chiplet and accelerator-to-memory connectivity. UCIe (Universal Chiplet Interconnect Express), supported by Intel, AMD, ARM, Qualcomm, and others, defines a standard die-to-die interface that can achieve up to ~1.3 TB/s per millimeter of die edge in the advanced fabric variant. UALink (Ultra Accelerator Link), announced in May 2024 by AMD, Intel, Broadcom, Cisco, Google, Meta, Microsoft, and HPE, targets scale-out communication between AI accelerators at data-center scale, positioning itself as an alternative to NVIDIA's proprietary NVLink for multi-chip AI clusters.

Optical Interconnects and Co-Packaged Optics

Electrical interconnects, even on-package, consume energy proportional to the capacitance they must charge and discharge. At scale — 100,000-GPU AI clusters — interconnect power becomes a meaningful fraction of total system power. Optical signals do not share this fundamental property: a photon traveling through a waveguide consumes energy only at the modulator and detector, not along the transmission path. Co-packaged optics (CPO) integrates optical transceivers directly into the switch or accelerator package, eliminating the electrical-to-optical conversion loss at a separate module.

NVIDIA announced co-packaged optical technology in 2024 for future Spectrum-X and NVLink platforms, with the expectation that CPO will become standard for data-center switch bandwidth by 2025–2026. Intel has invested heavily in silicon photonics through its acquisition of Habana and its own photonics research. Ayar Labs, a startup backed by Intel and others, demonstrated a chiplet-scale optical I/O die (TeraPHY) that achieves 2 Tb/s of bandwidth per chiplet at roughly 5 pJ/bit — competitive with advanced electrical interconnects but with far better scaling potential at distance.

The Long View

No single technology will replace HBM by 2026. The most likely trajectory: HBM4 (targeting ~4–6 TB/s per stack) arrives in 2025–2026 as the conventional path; PIM becomes a niche optimization for memory-intensive inference; UCIe and UALink standardize chiplet bandwidth in high-end AI SoCs; co-packaged optics begins to replace electrical interconnects at the node-to-node level in large clusters. The memory bandwidth bottleneck does not disappear — it migrates to the next hierarchy level where the next generation of models inevitably runs out of bandwidth again.

Key Terms

Processing-In-Memory (PIM)Architecture placing compute units inside the DRAM stack, executing operations without transferring data to the logic chip. Reduces bandwidth demand and power for compatible operations.
UCIe (Universal Chiplet Interconnect Express)An open industry standard for die-to-die interconnect, enabling chips from different manufacturers to communicate at near-monolithic speeds within a package.
UALinkUltra Accelerator Link — an open consortium standard (2024) targeting scale-out AI accelerator interconnects, positioned as an alternative to NVIDIA's proprietary NVLink at cluster scale.
Co-Packaged Optics (CPO)Integration of optical transceivers directly into the chip package, reducing electrical-to-optical conversion loss and enabling higher bandwidth at longer distances with lower energy per bit.

Lesson 4 Quiz

PIM, chiplet interconnects, and optical bandwidth · 3 questions
1. Samsung's HBM-PIM (Aquabolt-XL) reported 70% power reduction in AI inference benchmarks compared to standard HBM2. The most accurate explanation for this improvement is:
Correct. Every data byte transferred from DRAM to a logic chip consumes energy to charge and discharge bus capacitance. PIM eliminates this transfer for operations it can handle internally. Samsung's reported 70% power reduction comes primarily from eliminating the HBM→GPU data movement energy cost for applicable compute operations like FP16 SIMD dot products.
Not quite. PIM's power efficiency comes from eliminating the energy cost of transferring data across the chip boundary. Moving electrical signals between chips requires charging and discharging bus capacitance — a significant energy cost at the bandwidth levels HBM operates at. If computation can happen inside the memory stack, that transfer energy is avoided entirely.
2. UALink, announced in May 2024, is best described as:
Correct. UALink was announced in May 2024 by a consortium including AMD, Intel, Broadcom, Cisco, Google, Meta, Microsoft, and HPE. It targets scale-out communication between AI accelerators in data-center clusters — the same market as NVIDIA's NVLink — offering an open standard alternative.
Not quite. UALink is an open industry consortium standard for accelerator-to-accelerator communication at scale, announced May 2024 by AMD, Intel, Broadcom, Cisco, Google, Meta, Microsoft, and HPE. It is not a DRAM standard, not an optical standard, and not AMD's internal chiplet bus (that is Infinity Fabric).
3. Co-packaged optics (CPO) offers a fundamental advantage over electrical interconnects at data-center scale because:
Correct. Electrical interconnect energy consumption scales with bus capacitance, which increases with trace length. Optical signals require energy only at the modulator and detector — the waveguide propagation itself is nearly free. This makes optical interconnects fundamentally more efficient at distance, which is why CPO is attractive for connecting nodes in large AI clusters where cable runs span meters to tens of meters.
Not quite. While EMI immunity and speed are minor benefits, the fundamental advantage of optical interconnects is energy efficiency at distance. Electrical signals must charge and discharge capacitance along the entire trace length. Optical signals lose energy only at modulation and detection, not during propagation. At data-center cable distances, this difference compounds into meaningful power savings for high-bandwidth links.

Lab 4 — Future Bandwidth Technologies

Explore PIM, chiplet interconnects, and photonics with your AI tutor

Your Task

You'll examine the technologies positioned to succeed or supplement HBM: processing-in-memory, open chiplet standards (UCIe, UALink), and co-packaged optical interconnects. Ask about technical tradeoffs, industry dynamics, and what the next generation of AI clusters might look like.

Starter: "Compare the practical limitations of processing-in-memory (PIM) versus co-packaged optics as strategies for addressing the memory bandwidth bottleneck. Which is more likely to matter for LLM training by 2027?"
AI Tutor
Future Bandwidth · L4
Welcome to Lab 4. Let's explore the technologies competing to push past HBM's limits — processing-in-memory, chiplet interconnect standards like UCIe and UALink, co-packaged optics, and how these fit into the trajectory of AI accelerator design. What would you like to examine first?

Module 5 — Final Test

The Memory Bandwidth Bottleneck · 15 questions · Pass at 80%
1. Memory bandwidth is measured in GB/s or TB/s. For AI workloads, it represents:
Correct. Bandwidth is a throughput measure — bytes per second — not a capacity or latency measure. For memory-bound operations like attention, it caps achievable performance.
Memory bandwidth is a rate (bytes per second), not a capacity or compute speed measure. It sets the ceiling on how fast data can reach the arithmetic units.
2. A workload with arithmetic intensity of 50 FLOP/byte running on a GPU with a "ridge point" (peak FLOP/s ÷ peak bandwidth) of 100 FLOP/byte is:
Correct. In the Roofline model, workloads to the left of the ridge point (lower arithmetic intensity) are memory-bound. 50 FLOP/byte < 100 FLOP/byte ridge = memory-bound.
The Roofline model places workloads below the ridge point in the memory-bound region. With a ridge at 100 FLOP/byte and arithmetic intensity of 50, this workload cannot saturate the compute units — bandwidth is the constraint.
3. The "memory wall" concept, introduced by Wulf and McKee in 1994, predicted that:
Correct. Wulf and McKee's key argument was the compounding performance gap between CPU speed (~54%/yr) and DRAM bandwidth (~7%/yr), eventually making memory access time dominant.
The memory wall paper argued that DRAM bandwidth grew much slower than processor performance, not that latency or cost were the primary issues. The compounding gap meant memory would eventually dominate total execution time.
4. The world's first consumer GPU to ship with HBM was:
Correct. AMD's Fiji-based Radeon R9 Fury X, launched June 2015, was the first consumer GPU with HBM, using four HBM1 stacks on a silicon interposer to achieve 512 GB/s.
The first consumer GPU with HBM was AMD's Radeon R9 Fury X (Fiji GPU), launched June 2015 in partnership with SK Hynix. NVIDIA continued using GDDR5 and later GDDR5X on consumer cards until much later.
5. HBM achieves higher bandwidth at lower power than GDDR primarily due to:
Correct. Width, not speed, is HBM's core advantage. A 1,024-bit bus achieves high throughput at moderate frequencies; shorter electrical paths on the interposer reduce energy per bit.
HBM's advantage comes from its wide bus (1,024 bits per stack) combined with the short electrical path on the silicon interposer. It actually runs at lower frequencies than GDDR and uses standard DRAM cell technology.
6. NVIDIA's V100 GPU (2017) was the first to use HBM2 in an AI-focused accelerator. Its HBM2 bandwidth of 900 GB/s compared to the consumer GTX 1080 Ti's 484 GB/s meant:
Correct. The V100's 900 GB/s HBM2 versus the 1080 Ti's 484 GB/s GDDR5X provided nearly 2× the memory throughput for bandwidth-bound operations — a decisive advantage for attention layers and other LLM workloads, compounded by Tensor Cores for matrix math.
The bandwidth ratio is approximately 900/484 ≈ 1.86×. For bandwidth-bound operations (which most attention layers are), this is nearly a 2× throughput difference before even accounting for Tensor Cores. It made the V100 the dominant training card of 2017–2020.
7. HBM3e (used in NVIDIA H200 and AMD MI300X, 2024) improves upon HBM3 primarily by:
Correct. HBM3e pushes per-stack bandwidth to approximately 1.2 TB/s (up from ~819 GB/s for HBM3) and increases stack capacity to 24–36 GB by adding more DRAM die layers and increasing the I/O data rate per TSV.
HBM3e improves on HBM3 through higher electrical data rates per TSV and additional DRAM layers per stack, increasing bandwidth (~1.2 TB/s vs ~819 GB/s) and capacity (up to 36 GB/stack). Optical links and SRAM layers are not part of HBM3e's design.
8. Standard scaled dot-product self-attention is O(N²) in both compute and memory with respect to sequence length N. For an AI accelerator that is memory-bandwidth-bound for this operation, doubling N will approximately:
Correct. O(N²) memory complexity means the amount of HBM data read/written scales as N². Doubling N → 4× the data → 4× the time on a bandwidth-bound system. This quadratic wall is why long-context models required FlashAttention to become practical.
For a bandwidth-bound operation with O(N²) memory complexity, doubling N quadruples the HBM data accessed and therefore quadruples execution time. This is the core reason why long-context inference was impractical before FlashAttention.
9. FlashAttention's approach to reducing HBM bandwidth consumption is best described as:
Correct. FlashAttention uses tiling and kernel fusion: the N×N attention matrix is computed in SRAM-sized blocks, and the softmax normalization is handled with an online algorithm that never writes the full matrix to HBM. The output is identical to standard attention.
FlashAttention is an exact algorithm — not approximate. It works by tiling the computation into SRAM-resident blocks and fusing operations (Q@K^T, softmax, @V) into a single kernel, avoiding HBM writes of the full N×N intermediate matrix. No quantization or prefetch hardware is required.
10. The KV cache for a 70B-parameter LLM with 80 layers, 64 KV heads, 128 head dimension, FP16 precision, and 4,096-token context is approximately:
Correct. 80 layers × 64 KV heads × 128 dims × 2 (K+V) × 2 bytes (FP16) × 4,096 tokens = 80 × 64 × 128 × 2 × 2 × 4096 = ~20.5 GB per batch item. At batch size 4, this fills the H100's 80 GB HBM even before model weights.
The calculation: 80 layers × 64 KV heads × 128 head_dim × 2 (keys + values) × 2 bytes (FP16) × 4,096 tokens = 80 × 64 × 128 × 2 × 2 × 4096 ≈ 20.5 GB. This is the standard KV cache calculation for Llama 2 70B at this context length.
11. Grouped-Query Attention (GQA), used in Llama 2 70B, reduces KV cache size by:
Correct. GQA (Ainslie et al., 2023) reduces the number of distinct KV heads. If 64 query heads share 8 KV heads (groups of 8), the KV cache is 8× smaller than standard multi-head attention with 64 KV heads, dramatically reducing HBM requirements for long-context inference.
GQA reduces KV cache by having multiple query heads share the same key-value heads. Rather than 64 unique K and V tensors (one per head), you might have 8 — each shared by 8 query heads. This directly reduces KV cache memory by the grouping factor.
12. Samsung's HBM-PIM product integrated compute units inside the DRAM stack. The fundamental limitation of current PIM designs for general AI workloads is:
Correct. DRAM dies have a limited transistor budget — the primary die area is DRAM cells. PIM compute units must be simple SIMD engines (handling FP16 dot products, activation functions) rather than the complex, flexible compute engines in a GPU. This limits which AI operations can run on PIM without software adaptation.
The core PIM challenge is the transistor budget: DRAM dies are designed for storage, not computation. The compute units that fit alongside DRAM cells must be simple — limiting PIM to specific, regular operations like vector dot products rather than the full diversity of AI layer types.
13. UCIe (Universal Chiplet Interconnect Express) is significant for AI hardware because:
Correct. UCIe defines electrical and protocol specifications for die-to-die interconnect, allowing chiplets from Intel, AMD, ARM, Qualcomm, and others to be combined in a single package with high-bandwidth, low-latency connections — critical for disaggregated AI accelerator designs.
UCIe is a die-to-die interconnect standard for chiplet integration, backed by Intel, AMD, ARM, Qualcomm, and others. It is not an optical standard, not PCIe, and not an HBM packaging standard. Its importance for AI is enabling multi-vendor chiplet compositions at near-on-chip bandwidth.
14. NVLink 4.0 on the H100 provides ~900 GB/s total bidirectional bandwidth. This is critical for LLM training because:
Correct. Tensor parallelism splits weight matrices across GPUs and requires all-reduce communication of partial sums each forward pass. Pipeline parallelism passes activations between stages. Both require high inter-GPU bandwidth. NVLink 4.0's ~900 GB/s makes these viable for large models without communication becoming the primary bottleneck.
NVLink is the inter-GPU interconnect that enables multi-GPU training strategies (tensor parallelism, pipeline parallelism) by providing sufficient bandwidth for gradient and activation communication. Without high NVLink bandwidth, models too large for a single GPU's HBM would be severely communication-bottlenecked.
15. Co-packaged optics (CPO) is most likely to impact AI infrastructure first in which application?
Correct. CPO's energy-per-bit advantage scales with distance — making it most impactful first at the data-center networking layer (top-of-rack switches, server-to-server cables), where electrical interconnects at 400G+ speeds face rising power costs. NVIDIA has announced CPO for Spectrum-X switches; on-package and on-chip optical links are further out.
Optical physics means CPO is most valuable where electrical signaling at high bandwidth over distance is most power-intensive — data-center switch interconnects and server cables. On-chip and on-package optical is significantly further from commercial deployment, while switch-level CPO is already in product announcements for 2025–2026.