Module 4 · Lesson 1

Why Vanilla Attention Breaks at Scale

The quadratic wall that launched a thousand optimizations

What fundamental cost made standard self-attention impractical beyond a few thousand tokens?

When OpenAI released GPT-3 in May 2020 with a 2,048-token context window, that limit was not an arbitrary design choice — it was an engineering necessity. Standard self-attention requires every token to attend to every other token, so doubling the sequence length quadruples the compute and memory. At 2 K tokens, a single forward pass was already brushing the limits of an A100 GPU's 40 GB of HBM.

The researchers who wanted longer contexts were not facing a puzzle — they were facing a physics problem. Memory bandwidth and FLOP budgets are finite. The question became: can attention be redesigned so that the cost grows more slowly, ideally linearly, without sacrificing the quality that made transformers great in the first place?

The O(n²) Problem, Precisely

Standard scaled dot-product attention computes a similarity score between every query–key pair in a sequence of length n. That produces an n × n attention matrix. Storing that matrix in float16 for a single attention head, a single layer, at n = 100,000 tokens requires roughly 20 GB — more than a full A100 GPU just for one layer's attention scores.

The FLOPs scale the same way. Each of the n queries must be multiplied against each of the n keys (a dot product of dimension d), then softmaxed, then used to aggregate values. The total cost is O(n² · d) per head per layer. At n = 4 K, this is tolerable. At n = 100 K, it is not.

This is the context window race's central constraint. Every technique in Module 4 is a different answer to the same question: how do we approximate or restructure attention to escape the quadratic wall?

O(n²)

Vanilla Attention Cost

Both compute (FLOPs) and memory scale with the square of sequence length.

20 GB

Attn Matrix at 100K tokens

Float16, one head, one layer — more than an entire A100 GPU's HBM.

2 048

GPT-3 Context (2020)

The limit was set by GPU memory, not model capability.

4×

Cost Multiplier per 2× Length

Doubling context length quadruples memory and compute requirements.

Two Families of Solution

By 2021 the literature had converged on two broad strategies. The first is sparse attention: instead of computing all n² pairs, only compute a structured subset — local windows, dilated patterns, global tokens, or some combination. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) are canonical examples, each replacing the full matrix with a pattern that costs O(n) or O(n · √n).

The second is low-rank approximation: instead of computing the exact n × n matrix, approximate it as the product of two smaller matrices of rank k, reducing cost to O(n · k). Linformer (Wang et al., 2020) demonstrated this approach, projecting keys and values to a fixed dimension of k = 256 regardless of sequence length.

A third, hybrid approach — hardware-aware exact attention — arrived with FlashAttention (Dao et al., 2022), which keeps the full O(n²) computation but reorganizes memory access to fit within SRAM, dramatically reducing wall-clock time without any approximation. This lesson focuses on the first two families; FlashAttention is covered in Lesson 3.

Key Paper

Efficient Transformers: A Survey (Tay et al., 2020, Google Brain) catalogued 12 distinct efficient attention methods published within a single year, all attempting to escape the same quadratic bottleneck. The survey became one of the most-cited transformer papers of the decade.

Key Terms

Self-AttentionA mechanism where each token in a sequence attends to all other tokens, producing context-aware representations. Standard version costs O(n²) in both compute and memory.

Attention MatrixThe n × n matrix of softmaxed query–key similarity scores. The root cause of the quadratic scaling problem.

Sparse AttentionAttention variants that compute only a structured subset of query–key pairs, reducing cost to sub-quadratic complexity.

Low-Rank ApproximationRepresenting the attention matrix as the product of two smaller matrices, capping cost at O(n · k) where k is a fixed rank.

Lesson 1 Quiz

Why Vanilla Attention Breaks at Scale

What is the computational complexity of standard scaled dot-product self-attention with respect to sequence length n?

Correct. Every token must attend to every other token, producing an n × n attention matrix. Both compute and memory scale as O(n²).

Not quite. The standard attention matrix is n × n — every query attends to every key — making the cost O(n²) in both compute and memory.

GPT-3's 2,048-token context window in 2020 was primarily limited by which factor?

Correct. The quadratic memory cost of the attention matrix meant that at 2 K tokens a single layer's attention scores were already straining available A100 memory.

The primary bottleneck was GPU memory. The O(n²) attention matrix grows rapidly; storing it for longer sequences exceeded available HBM on A100 GPUs.

Linformer's core idea for reducing attention cost is best described as:

Correct. Linformer (Wang et al., 2020) projects keys and values to a fixed dimension k (e.g. 256), reducing attention cost to O(n · k) regardless of n.

That describes a different approach. Linformer uses low-rank projection — keys and values are projected to a fixed-size dimension k, making cost O(n · k) rather than O(n²).

Lab 1 — The Quadratic Wall

Explore the scaling properties of vanilla attention with your AI tutor

Your Task

In this lab you'll work through the concrete numbers behind the O(n²) scaling problem. Ask your AI tutor to walk you through calculations: how much memory does a single attention layer consume at different sequence lengths? Where does the quadratic wall become practically impassable? What happens to FLOP counts as context grows?

Suggested opener: "Walk me through exactly how much GPU memory is needed to store the attention matrix for a single head at 4K, 16K, 32K, and 128K tokens in float16."

AI Tutor

Efficient Attention · Lab 1

Welcome to Lab 1. Let's make the quadratic wall concrete. I can walk you through memory calculations, FLOP counts, or comparisons of specific sequence lengths. What would you like to explore first?

Module 4 · Lesson 2

Sparse Attention: Longformer, BigBird, and Local Windows

Structured sparsity as the first great escape from O(n²)

How did Longformer and BigBird prove that you don't need every token to see every other token to achieve strong language understanding?

Iz Beltagy and colleagues at the Allen Institute for AI published Longformer in April 2020 with a deceptively simple observation: for most NLP tasks, a token does not actually need direct access to every other token in the sequence. Language has local structure. A word depends heavily on its neighbors and only occasionally on something thousands of positions away.

Their solution was a combination of a sliding-window local attention — each token attending to its w nearest neighbors on each side — plus a small set of global tokens that could attend to the entire sequence. The result was O(n · w) complexity, where w is the window size, constant and small. On the SCROLLS benchmark, Longformer outperformed GPT-2 on summarization tasks with sequences up to 16,000 tokens, something impossible for full attention within the same memory budget.

Longformer's Three-Pattern Design

Longformer (Beltagy et al., 2020) combines three attention patterns in a single layer. Sliding window attention gives every token access to w tokens on each side, capturing local syntax and semantics. Dilated sliding window attention extends reach by skipping positions at a fixed dilation rate d, enabling coverage of 2wd positions with the same compute budget. Global attention designates special tokens — CLS tokens, question tokens in QA tasks — that attend to the entire sequence and receive attention from every other token.

The practical result: on a single 32 GB GPU, Longformer could process sequences of 4,096 tokens in the same time and memory that full attention required for just 512 tokens. On the CORD-19 scientific literature corpus, it achieved state-of-the-art results on document-level tasks that were simply out of reach for BERT-style models.

Documented Result

Longformer-base (149M parameters) achieved 91.4 F1 on HotpotQA, a multi-document question answering benchmark requiring reasoning across long contexts, compared to RoBERTa-large's 89.4 F1 — using less memory because of the sparse attention pattern.

BigBird: Random + Local + Global

Google Brain's BigBird (Zaheer et al., 2020) took a graph-theoretic approach. The authors proved, using results from expander graph theory, that any sparse attention pattern achieving three properties is theoretically equivalent to full attention as a universal approximator: local edges (sliding window), global tokens, and random edges. The random edges — each token attending to a small random subset of other tokens — allow information to propagate across the sequence with high probability in O(log n) hops.

BigBird extended BERT to 4,096-token sequences and set new state-of-the-art results on Genomics tasks, where DNA and protein sequences routinely exceed tens of thousands of characters — a domain where standard transformers were essentially unusable.

Method	Pattern	Complexity	Max Context Demonstrated
Full Attention	All-to-all	O(n²)	~2–4K (2020 hardware)
Longformer	Local window + global	O(n · w)	16K tokens
BigBird	Local + global + random	O(n)	4K–16K tokens
Sparse Transformer (Child et al.)	Strided + local	O(n · √n)	12K image tokens (OpenAI, 2019)

The Cost of Sparsity: What You Lose

Sparse attention is not a free lunch. Implementing irregular sparse patterns efficiently on GPUs — which are optimized for dense, contiguous memory access — is genuinely hard. Longformer required custom CUDA kernels. Without those kernels, a naive sparse implementation can actually be slower than dense attention because of scattered memory reads.

There is also a representational cost. Tasks requiring a token to integrate information from two distant, unrelated positions — rare in NLP but common in certain document retrieval and code analysis tasks — suffer. The global tokens partially mitigate this, but they introduce a new design choice: which tokens should be global? In question answering, the question tokens are obvious candidates. In summarization, the choice is less clear.

By 2022, many practitioners had moved away from hand-crafted sparse patterns in favor of FlashAttention (full attention with hardware-aware implementation) or retrieval-augmented architectures. But sparse attention's conceptual legacy endures: not all attention is equally necessary, a principle that resurfaced in modern KV-cache eviction strategies.

Key Terms

Sliding Window AttentionEach token attends only to tokens within a fixed window of size w on each side, reducing cost to O(n · w).

Global TokenA designated token (e.g. CLS) that attends to all positions and receives attention from all positions, enabling long-range information flow at O(n) extra cost per global token.

Dilated AttentionA sliding window that skips positions at a fixed rate, extending the receptive field without increasing compute.

Random AttentionEach token attends to a small random set of other tokens. Theoretically sufficient for information to propagate across the sequence in O(log n) hops.

Lesson 2 Quiz

Sparse Attention: Longformer, BigBird, and Local Windows

Longformer's sliding window attention achieves what computational complexity?

Correct. Each token attends to w neighbors on each side, so total operations scale as O(n · w). Since w is a fixed constant, this is effectively O(n).

Longformer's sliding window makes cost O(n · w), where w is a fixed window size. Since w is constant, cost grows linearly with sequence length.

BigBird's theoretical justification for its sparse pattern draws on which mathematical framework?

Correct. Zaheer et al. used results from expander graphs to prove that local + global + random attention patterns are universal approximators, equivalent to full attention.

BigBird's theoretical foundation is expander graph theory. The authors proved that their sparse combination is a universal approximator with the same theoretical power as full attention.

Why can a naive sparse attention implementation sometimes be slower than dense attention on GPUs?

Correct. GPU performance depends heavily on contiguous memory access patterns. Scattered reads for irregular sparse attention underutilize memory bandwidth, negating the FLOP savings.

The key issue is memory access patterns. GPU hardware is built around dense, contiguous reads. Irregular sparse patterns cause many small non-contiguous reads that stall memory pipelines.

Lab 2 — Sparse Pattern Design

Reason through sparse attention design choices with your AI tutor

Your Task

Explore the design space of sparse attention patterns. Ask your tutor to compare Longformer vs BigBird for specific tasks, reason through window size choices, or explain why global tokens work for question answering but create challenges for other tasks. Consider what tasks sparse attention handles poorly.

Suggested opener: "For a legal contract analysis task with 20,000-token documents, would Longformer or BigBird be a better choice, and why? What window size would you recommend?"

AI Tutor

Efficient Attention · Lab 2

Welcome to Lab 2. I can help you think through sparse attention pattern choices for different tasks and document types. What task or comparison would you like to explore?

Module 4 · Lesson 3

FlashAttention: Exact Attention, Faster

How a memory hierarchy insight made full attention practical at 100K tokens

FlashAttention achieves dramatic speedups without approximating anything — what changed?

In June 2022, Tri Dao and colleagues at Stanford published a paper that reframed the entire efficient-attention conversation. The title was simply FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. The key word was exact. Unlike Longformer or Linformer, FlashAttention made no approximation. It computed the same attention as vanilla scaled dot-product attention — just reorganized the computation so that the GPU's small, fast SRAM was used instead of slow HBM for intermediate results.

The results were startling. FlashAttention was 2–4× faster than standard PyTorch attention on an A100, used 10–20× less memory for the attention computation, and enabled training on sequences of 64K tokens on a single GPU. GPT-4-length contexts, previously requiring model parallelism across multiple nodes, became achievable on a single accelerator.

The IO-Awareness Insight

Modern GPUs have a memory hierarchy. HBM (High Bandwidth Memory) is large — 40–80 GB on an A100 — but relatively slow: ~2 TB/s bandwidth. SRAM (on-chip memory, equivalent to L2 cache) is tiny — about 20 MB total — but very fast: ~19 TB/s bandwidth. Standard attention implementations constantly read and write the n × n attention matrix to HBM: compute QK^T, write it to HBM, read it back for softmax, write again, read again for the value multiplication. At n = 8K, this is hundreds of gigabytes of HBM traffic.

Dao et al.'s insight: if the computation can be reorganized into tiles small enough to fit in SRAM, the n × n matrix never needs to be materialized in HBM at all. The trick is a numerically stable tiling of the softmax — computing running max and sum statistics to allow the softmax to be computed incrementally across tiles. The result is that attention requires O(n) HBM reads and writes instead of O(n²).

Benchmark Result — Dao et al. 2022

On an A100 80 GB GPU, FlashAttention achieved 124 TFLOPs/s on attention computation versus 38 TFLOPs/s for standard PyTorch attention — a 3.3× improvement — while using the same number of FLOPs. The speedup comes entirely from reducing HBM memory traffic.

FlashAttention-2 and Beyond

FlashAttention-2 (Dao, 2023) refined the approach with better parallelization across attention heads and sequence positions, reducing non-matmul FLOPs and improving occupancy on modern GPUs. It reached ~70% of peak A100 FLOP utilization on attention — extraordinarily high for a memory-bound operation. Anthropic, OpenAI, and Mistral all adopted FlashAttention-2 as a standard component of their training stacks.

FlashAttention-3 (Shah et al., 2024) targeted NVIDIA's H100 specifically, exploiting asynchronous memory pipelines to overlap data movement with computation. On H100s, it achieved up to 1.5× additional speedup over FlashAttention-2, pushing toward 75% utilization of the H100's 989 TFLOP/s theoretical peak.

Crucially, FlashAttention's impact extends beyond training. For inference, reducing KV-cache memory traffic is equally important. The same tiling strategy applies to decoding, enabling longer context inference without proportionally larger memory bandwidth requirements.

3.3×

Speedup vs PyTorch Attn

FlashAttention on A100, same FLOPs, same numerical result.

20×

Memory Reduction

HBM usage for attention computation on long sequences.

64K

Single-GPU Context (2022)

Sequence length achievable on one A100 80 GB with FlashAttention.

70%

A100 FLOP Utilization

FlashAttention-2 efficiency — extremely high for a memory-bound operation.

Why FlashAttention Beat Sparse Attention in Practice

By 2023, the field had largely converged on FlashAttention + full context windows rather than sparse attention patterns, for most tasks. The reasons are practical. FlashAttention requires no changes to model architecture, no custom sparse CUDA kernels, no decisions about which tokens should be global, and no approximation errors. It plugs into existing training pipelines as a drop-in replacement and immediately makes training faster and cheaper.

Sparse attention patterns, by contrast, require architectural decisions, custom kernels, and introduce inductive biases that may hurt some tasks. They excel in specific domains — notably processing DNA sequences (BigBird on genomics) and tasks with clear local structure — but are harder to deploy as general-purpose solutions.

The lesson practitioners took: hardware-aware exact computation frequently outperforms clever approximations, because the approximation overhead and engineering complexity often exceed the theoretical compute savings.

Key Terms

IO-AwarenessDesigning algorithms with explicit attention to the cost of memory reads and writes, not just FLOPs. FlashAttention's defining principle.

TilingPartitioning a computation into blocks that fit in fast on-chip SRAM, avoiding materialization of large intermediate tensors in slow HBM.

HBM vs SRAMHBM: large, slow GPU memory (~2 TB/s). SRAM: small, fast on-chip memory (~19 TB/s). FlashAttention keeps attention intermediates in SRAM.

Online SoftmaxAn algorithm for computing softmax incrementally across tiles using running max/sum statistics, enabling tiled attention without materializing the full n × n matrix.

Lesson 3 Quiz

FlashAttention: Exact Attention, Faster

FlashAttention achieves its speedup primarily by:

Correct. FlashAttention tiles the attention computation to stay within SRAM, dramatically reducing HBM traffic. The number of FLOPs is unchanged; only the memory access pattern changes.

FlashAttention makes no approximation. Its speedup comes from IO-awareness: tiling computation to keep intermediate values in fast on-chip SRAM rather than slow HBM.

Compared to standard PyTorch attention on an A100, FlashAttention achieved approximately what speedup in Dao et al.'s 2022 benchmarks?

Correct. FlashAttention achieved 2–4× wall-clock speedup (specifically ~3.3× in attention computation benchmarks) by reducing HBM memory traffic, not by reducing FLOPs.

The documented speedup was 2–4× (approximately 3.3× in direct attention benchmarks). This comes entirely from memory traffic reduction, not fewer FLOPs.

The "online softmax" technique in FlashAttention solves which specific problem?

Correct. Online softmax maintains running max and sum statistics as tiles are processed, allowing numerically stable softmax computation without ever storing the full n × n attention matrix.

Online softmax solves the tiling problem: how to compute correct softmax values when processing the attention matrix in small tiles. It uses running max/sum statistics to produce exact results without the full matrix.

Lab 3 — FlashAttention Deep Dive

Work through the tiling and IO-awareness concepts with your AI tutor

Your Task

Explore FlashAttention's design in depth. Ask your tutor to explain the tiling algorithm step by step, compare HBM vs SRAM access patterns, or work through why the online softmax produces numerically identical results to the standard version. Challenge yourself to articulate why hardware-aware algorithms often beat clever approximations.

Suggested opener: "Explain step by step how FlashAttention tiles the attention computation. How does it avoid writing the n×n attention matrix to HBM while still computing the correct output?"

AI Tutor

Efficient Attention · Lab 3

Welcome to Lab 3. FlashAttention's elegance is in its memory hierarchy insight. I can walk through the tiling algorithm, explain online softmax, or compare IO costs between standard and flash attention. Where would you like to start?

Module 4 · Lesson 4

Multi-Query, Grouped-Query, and Multi-Head Latent Attention

Rethinking how many heads you really need during inference

How did reducing KV head counts cut inference memory by 8× without meaningfully hurting model quality?

In 2019, Noam Shazeer at Google Brain published a largely overlooked paper introducing Multi-Query Attention (MQA). The core idea was almost aggressively simple: while each attention layer still used multiple query heads, all query heads would share a single key and value head. Instead of 32 KV heads for a 32-head model, you needed just 1. KV-cache memory dropped by 32×.

The paper sat in relative obscurity until the inference scaling crises of 2022–2023, when serving GPT-4-class models at production scale revealed that KV-cache memory was often the binding constraint. Suddenly MQA was everywhere. Google's PaLM 2 used it. So did Falcon and early versions of Mistral. But practitioners noticed quality regressions on complex reasoning tasks, and a 2023 paper from Google DeepMind introduced a middle ground: Grouped-Query Attention (GQA).

The KV-Cache Problem at Inference

During autoregressive generation, a transformer must store all previously computed key and value tensors — the KV cache — so each new token can attend to all prior context without recomputation. For a standard 70B parameter model with 64 heads and 8K context, the KV cache alone requires approximately 128 GB of memory. This is larger than the model weights themselves at int8 precision.

Multi-head attention (MHA) is the original design: each layer has H query heads, H key heads, and H value heads. Multi-Query Attention reduces this to H query heads but just 1 K head and 1 V head shared across all queries. This reduces KV-cache size by a factor of H (often 32 or 64), but the single shared KV representation is a strong constraint on what different query heads can learn.

Grouped-Query Attention: The Practical Compromise

Grouped-Query Attention (Ainslie et al., 2023) divides the H query heads into G groups, each group sharing one KV head. With H = 32 and G = 8, you have 8 KV heads instead of 32 (MHA) or 1 (MQA). KV-cache memory shrinks by 4× while representational capacity is far richer than MQA's single shared head.

The 2023 DeepMind paper showed that GQA with G = 8 matched MHA quality on most benchmarks while achieving inference throughput within 10% of MQA. Llama 2 (Meta, 2023) used GQA for its 34B and 70B parameter variants. Mistral 7B used 8 KV heads for 32 query heads. Gemma (Google, 2024) used GQA throughout its model family. By 2024 it had become the de facto standard for large model inference efficiency.

Method	KV Heads	KV Cache Size	Used By
Multi-Head Attention (MHA)	H (e.g. 32)	Baseline	GPT-3, original BERT, LLaMA-1
Multi-Query Attention (MQA)	1	~1/32× baseline	PaLM 2, early Falcon
Grouped-Query Attention (GQA)	G (e.g. 8)	~1/4× baseline	LLaMA 2 70B, Mistral, Gemma
Multi-Head Latent Attention (MLA)	1 compressed latent	~1/16× baseline	DeepSeek-V2, DeepSeek-V3 (2024)

Multi-Head Latent Attention: DeepSeek's Approach

DeepSeek-V2 (May 2024) introduced Multi-Head Latent Attention (MLA), a more aggressive compression scheme. Rather than grouping full-dimensional KV heads, MLA projects keys and values down to a single low-dimensional latent vector per token, then projects it back up for each head at attention time. The KV cache stores only the low-dimensional latent — roughly 512 dimensions instead of 128 heads × 128 head-dim = 16,384 dimensions.

DeepSeek reported that MLA reduced KV-cache memory by ~93% compared to standard MHA on their 236B-parameter MoE model, while maintaining quality competitive with MHA on all major benchmarks. The compression ratio made it economically viable to serve very long contexts: DeepSeek-V2 supports 128K token contexts with a KV cache smaller than many 4K-context MHA models.

Production Impact — Mistral 7B (2023)

Mistral 7B used 8 KV heads for 32 query heads (GQA ratio 4:1), reducing KV-cache memory by 4× compared to LLaMA-1 7B's full MHA. Combined with sliding window attention in alternate layers, Mistral 7B could serve 4× more concurrent users per GPU than an equivalent-quality MHA model — a direct revenue impact for cloud inference providers.

The Design Space in 2024–2025

By 2025, efficient attention at inference has settled into a clear hierarchy. For models under ~20B parameters, GQA with moderate group counts (G = 4 to 8) is standard. For very large models (100B+) or very long contexts (32K+), MLA or aggressive MQA provides necessary memory relief. FlashAttention underpins all of them at the CUDA kernel level, providing IO-efficient execution of whatever attention pattern the architecture uses.

The remaining frontier is attention with retrieval — rather than expanding the context window indefinitely, using embedding-based retrieval to inject relevant chunks into a shorter context. But that is the subject of Module 5. What Module 4 establishes is the foundation: the techniques that made it possible to make attention computationally tractable at the scale where useful long-context reasoning becomes possible.

Key Terms

Multi-Query Attention (MQA)All query heads share a single KV head. Reduces KV-cache by factor H but may hurt quality on complex tasks. Shazeer, 2019.

Grouped-Query Attention (GQA)Query heads divided into G groups, each sharing one KV head. Balances MHA quality and MQA efficiency. Ainslie et al., 2023.

KV CacheStored key and value tensors for all prior tokens during autoregressive generation. The primary memory bottleneck in long-context inference.

Multi-Head Latent Attention (MLA)Compresses KV to a single low-dimensional latent per token, projecting up at attention time. DeepSeek-V2, 2024. Reduces KV cache ~93% vs MHA.

Lesson 4 Quiz

MQA, GQA, and Multi-Head Latent Attention

In Multi-Query Attention, what is shared across all query heads?

Correct. MQA retains H distinct query heads but collapses all key and value heads into a single shared pair, reducing KV-cache memory by a factor of H.

In MQA, each attention layer has H query heads but only 1 K head and 1 V head shared across all queries. This single shared KV pair is what enables the dramatic cache reduction.

Which production model family was first to widely popularize Grouped-Query Attention (GQA) in 2023?

Correct. Meta's LLaMA 2 used GQA for its 34B and 70B variants in 2023, making GQA broadly visible and establishing it as a practical standard for large model inference.

LLaMA 2 (Meta, 2023) was the widely-cited adoption that established GQA as a standard. Its 34B and 70B models used GQA, enabling 4× KV-cache reduction versus the MHA used in LLaMA 1.

DeepSeek-V2's Multi-Head Latent Attention (MLA) achieves its KV-cache reduction primarily through:

Correct. MLA stores only a compressed low-dimensional latent (~512 dims) per token instead of full KV tensors for each head, reducing KV cache by ~93% while preserving full multi-head expressivity at attention time.

MLA's approach is low-dimensional latent compression — projecting KV down to ~512 dimensions per token rather than storing full KV for each of many heads. The latent is projected back up at attention time.

Lab 4 — KV-Cache Architecture Choices

Reason through MQA, GQA, and MLA tradeoffs with your AI tutor

Your Task

Work through the practical tradeoffs between MHA, MQA, GQA, and MLA. Ask your tutor to help you calculate KV-cache sizes for specific model configurations, compare quality-efficiency tradeoffs, or reason about what G value to choose for a given serving constraint. Consider why MLA is more complex to implement than GQA.

Suggested opener: "For a 70B parameter model with 64 attention heads, 128 head dimensions, and an 8K context window, calculate the KV-cache size in GB for MHA, GQA with 8 groups, and MQA. Assume float16."

AI Tutor

Efficient Attention · Lab 4

Welcome to Lab 4. I can help you work through KV-cache calculations, compare MHA vs GQA vs MQA vs MLA for specific model configurations, or reason through the quality-efficiency tradeoffs. What would you like to explore?

Module 4 Test

Efficient Attention Methods — 15 questions · 80% to pass

1. Standard self-attention with sequence length n has what computational complexity in both FLOPs and memory?

Correct. The n × n attention matrix drives O(n²) cost in both compute and memory.

Standard attention requires an n × n matrix — both FLOPs and memory are O(n²).

2. If you double the context length of a vanilla attention model, the memory required for the attention matrix:

Correct. Because memory scales as n², doubling n increases memory by 2² = 4×.

Quadratic scaling means doubling n → 4× memory. This is the core of the context window scaling problem.

3. Longformer was published by researchers at:

Correct. Iz Beltagy and colleagues at the Allen Institute for AI published Longformer in April 2020.

Longformer came from Iz Beltagy et al. at the Allen Institute for AI, published in April 2020.

4. BigBird's random attention edges — where tokens attend to a random subset of others — serve what theoretical purpose?

Correct. From expander graph theory, random edges ensure that with high probability any two tokens are connected within O(log n) hops, allowing global information flow.

The random edges provide long-range connectivity. Via expander graph theory, they ensure any token can reach any other within O(log n) hops with high probability.

5. Which of the following is NOT a component of Longformer's attention pattern?

Correct. Random attention is BigBird's innovation. Longformer uses local window, dilated window, and global token patterns.

Longformer uses local window, dilated window, and global tokens. Random attention was BigBird's addition, not Longformer's.

6. Linformer reduces attention cost by:

Correct. Linformer (Wang et al., 2020) projects K and V to dimension k (e.g. 256), making attention O(n · k) rather than O(n²).

Linformer's core idea is low-rank projection — K and V are compressed to fixed dimension k, making cost O(n · k) regardless of how large n gets.

7. FlashAttention's primary performance advantage over standard attention is that it reduces:

Correct. FlashAttention performs the same FLOPs but avoids materializing the n × n attention matrix in HBM, reducing memory traffic from O(n²) to O(n) per layer.

FlashAttention's key innovation is IO-awareness: it performs identical FLOPs but drastically reduces expensive HBM read/write operations by keeping intermediate values in fast SRAM.

8. Which NVIDIA GPU architecture did FlashAttention-3 specifically target to achieve up to 1.5× additional speedup over FlashAttention-2?

Correct. FlashAttention-3 (Shah et al., 2024) exploited H100-specific asynchronous memory pipelines to overlap data movement with computation.

FlashAttention-3 targeted the H100, using its asynchronous memory pipeline features unavailable on the A100.

9. In a GPU's memory hierarchy, approximately how much faster is SRAM bandwidth compared to HBM on an A100?

Correct. A100 HBM offers ~2 TB/s; SRAM (on-chip) offers ~19 TB/s — roughly a 9× bandwidth advantage, which is why keeping computation in SRAM is so valuable.

SRAM is ~19 TB/s vs HBM's ~2 TB/s on an A100 — approximately 9× faster. This bandwidth gap is why FlashAttention's tiling strategy produces such large speedups.

10. Multi-Query Attention (MQA) was originally published by which researcher in what year?

Correct. Noam Shazeer published MQA in 2019 at Google Brain; it received limited attention until KV-cache costs became a dominant serving concern in 2022–2023.

MQA was Noam Shazeer's 2019 paper at Google Brain. It was largely overlooked until the KV-cache crisis of 2022–2023 made it urgently relevant.

11. Grouped-Query Attention with H=32 query heads and G=8 groups reduces KV-cache size by approximately what factor versus standard MHA?

Correct. With H=32 query heads and G=8 KV groups, KV heads reduce from 32 to 8 — a 4× reduction in KV-cache memory.

32 KV heads (MHA) → 8 KV heads (GQA with G=8) = 4× reduction. MQA would give 32×; GQA gives a tunable intermediate reduction.

12. Mistral 7B's use of GQA with 8 KV heads for 32 query heads, combined with sliding window attention, primarily enabled:

Correct. The KV-cache memory reduction from GQA enabled 4× more concurrent inference requests per GPU — a direct commercial efficiency gain.

The primary practical benefit was serving efficiency: smaller KV cache per user means more concurrent users per GPU, roughly 4× for Mistral 7B's configuration.

13. DeepSeek-V2's Multi-Head Latent Attention (MLA) reported approximately what KV-cache reduction versus standard MHA?

Correct. DeepSeek reported ~93% KV-cache reduction with MLA on their 236B-parameter model, enabling 128K context windows with practical memory budgets.

DeepSeek-V2 reported ~93% KV-cache reduction with MLA. The per-token latent is ~512 dimensions vs the full KV representation of 128 heads × 128 dims = 16,384 dimensions.

14. The "Efficient Transformers: A Survey" (Tay et al., 2020) is notable in the context of efficient attention because it:

Correct. Tay et al.'s survey documented the explosion of efficient attention research, cataloguing 12 distinct methods in a single year — illustrating how urgently the field needed solutions.

The Tay et al. survey's significance was cataloguing the field: 12 distinct efficient attention approaches published in roughly one year, all attacking the same O(n²) bottleneck.

15. By 2023–2024, which efficient attention approach had become the de facto standard for most large language model training and inference pipelines?

Correct. FlashAttention became the standard because it requires no architectural changes, makes no approximation, and provides immediate speedups as a drop-in kernel replacement in PyTorch and JAX.

FlashAttention won in practice because it needed no architectural changes, introduced no approximation errors, and delivered immediate speedups. Anthropic, OpenAI, Mistral, and Meta all adopted it.