When OpenAI released GPT-3 in May 2020 with a 2,048-token context window, that limit was not an arbitrary design choice — it was an engineering necessity. Standard self-attention requires every token to attend to every other token, so doubling the sequence length quadruples the compute and memory. At 2 K tokens, a single forward pass was already brushing the limits of an A100 GPU's 40 GB of HBM.
The researchers who wanted longer contexts were not facing a puzzle — they were facing a physics problem. Memory bandwidth and FLOP budgets are finite. The question became: can attention be redesigned so that the cost grows more slowly, ideally linearly, without sacrificing the quality that made transformers great in the first place?
Standard scaled dot-product attention computes a similarity score between every query–key pair in a sequence of length n. That produces an n × n attention matrix. Storing that matrix in float16 for a single attention head, a single layer, at n = 100,000 tokens requires roughly 20 GB — more than a full A100 GPU just for one layer's attention scores.
The FLOPs scale the same way. Each of the n queries must be multiplied against each of the n keys (a dot product of dimension d), then softmaxed, then used to aggregate values. The total cost is O(n² · d) per head per layer. At n = 4 K, this is tolerable. At n = 100 K, it is not.
This is the context window race's central constraint. Every technique in Module 4 is a different answer to the same question: how do we approximate or restructure attention to escape the quadratic wall?
By 2021 the literature had converged on two broad strategies. The first is sparse attention: instead of computing all n² pairs, only compute a structured subset — local windows, dilated patterns, global tokens, or some combination. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) are canonical examples, each replacing the full matrix with a pattern that costs O(n) or O(n · √n).
The second is low-rank approximation: instead of computing the exact n × n matrix, approximate it as the product of two smaller matrices of rank k, reducing cost to O(n · k). Linformer (Wang et al., 2020) demonstrated this approach, projecting keys and values to a fixed dimension of k = 256 regardless of sequence length.
A third, hybrid approach — hardware-aware exact attention — arrived with FlashAttention (Dao et al., 2022), which keeps the full O(n²) computation but reorganizes memory access to fit within SRAM, dramatically reducing wall-clock time without any approximation. This lesson focuses on the first two families; FlashAttention is covered in Lesson 3.
Efficient Transformers: A Survey (Tay et al., 2020, Google Brain) catalogued 12 distinct efficient attention methods published within a single year, all attempting to escape the same quadratic bottleneck. The survey became one of the most-cited transformer papers of the decade.
In this lab you'll work through the concrete numbers behind the O(n²) scaling problem. Ask your AI tutor to walk you through calculations: how much memory does a single attention layer consume at different sequence lengths? Where does the quadratic wall become practically impassable? What happens to FLOP counts as context grows?
Iz Beltagy and colleagues at the Allen Institute for AI published Longformer in April 2020 with a deceptively simple observation: for most NLP tasks, a token does not actually need direct access to every other token in the sequence. Language has local structure. A word depends heavily on its neighbors and only occasionally on something thousands of positions away.
Their solution was a combination of a sliding-window local attention — each token attending to its w nearest neighbors on each side — plus a small set of global tokens that could attend to the entire sequence. The result was O(n · w) complexity, where w is the window size, constant and small. On the SCROLLS benchmark, Longformer outperformed GPT-2 on summarization tasks with sequences up to 16,000 tokens, something impossible for full attention within the same memory budget.
Longformer (Beltagy et al., 2020) combines three attention patterns in a single layer. Sliding window attention gives every token access to w tokens on each side, capturing local syntax and semantics. Dilated sliding window attention extends reach by skipping positions at a fixed dilation rate d, enabling coverage of 2wd positions with the same compute budget. Global attention designates special tokens — CLS tokens, question tokens in QA tasks — that attend to the entire sequence and receive attention from every other token.
The practical result: on a single 32 GB GPU, Longformer could process sequences of 4,096 tokens in the same time and memory that full attention required for just 512 tokens. On the CORD-19 scientific literature corpus, it achieved state-of-the-art results on document-level tasks that were simply out of reach for BERT-style models.
Longformer-base (149M parameters) achieved 91.4 F1 on HotpotQA, a multi-document question answering benchmark requiring reasoning across long contexts, compared to RoBERTa-large's 89.4 F1 — using less memory because of the sparse attention pattern.
Google Brain's BigBird (Zaheer et al., 2020) took a graph-theoretic approach. The authors proved, using results from expander graph theory, that any sparse attention pattern achieving three properties is theoretically equivalent to full attention as a universal approximator: local edges (sliding window), global tokens, and random edges. The random edges — each token attending to a small random subset of other tokens — allow information to propagate across the sequence with high probability in O(log n) hops.
BigBird extended BERT to 4,096-token sequences and set new state-of-the-art results on Genomics tasks, where DNA and protein sequences routinely exceed tens of thousands of characters — a domain where standard transformers were essentially unusable.
| Method | Pattern | Complexity | Max Context Demonstrated |
|---|---|---|---|
| Full Attention | All-to-all | O(n²) | ~2–4K (2020 hardware) |
| Longformer | Local window + global | O(n · w) | 16K tokens |
| BigBird | Local + global + random | O(n) | 4K–16K tokens |
| Sparse Transformer (Child et al.) | Strided + local | O(n · √n) | 12K image tokens (OpenAI, 2019) |
Sparse attention is not a free lunch. Implementing irregular sparse patterns efficiently on GPUs — which are optimized for dense, contiguous memory access — is genuinely hard. Longformer required custom CUDA kernels. Without those kernels, a naive sparse implementation can actually be slower than dense attention because of scattered memory reads.
There is also a representational cost. Tasks requiring a token to integrate information from two distant, unrelated positions — rare in NLP but common in certain document retrieval and code analysis tasks — suffer. The global tokens partially mitigate this, but they introduce a new design choice: which tokens should be global? In question answering, the question tokens are obvious candidates. In summarization, the choice is less clear.
By 2022, many practitioners had moved away from hand-crafted sparse patterns in favor of FlashAttention (full attention with hardware-aware implementation) or retrieval-augmented architectures. But sparse attention's conceptual legacy endures: not all attention is equally necessary, a principle that resurfaced in modern KV-cache eviction strategies.
Explore the design space of sparse attention patterns. Ask your tutor to compare Longformer vs BigBird for specific tasks, reason through window size choices, or explain why global tokens work for question answering but create challenges for other tasks. Consider what tasks sparse attention handles poorly.
In June 2022, Tri Dao and colleagues at Stanford published a paper that reframed the entire efficient-attention conversation. The title was simply FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. The key word was exact. Unlike Longformer or Linformer, FlashAttention made no approximation. It computed the same attention as vanilla scaled dot-product attention — just reorganized the computation so that the GPU's small, fast SRAM was used instead of slow HBM for intermediate results.
The results were startling. FlashAttention was 2–4× faster than standard PyTorch attention on an A100, used 10–20× less memory for the attention computation, and enabled training on sequences of 64K tokens on a single GPU. GPT-4-length contexts, previously requiring model parallelism across multiple nodes, became achievable on a single accelerator.
Modern GPUs have a memory hierarchy. HBM (High Bandwidth Memory) is large — 40–80 GB on an A100 — but relatively slow: ~2 TB/s bandwidth. SRAM (on-chip memory, equivalent to L2 cache) is tiny — about 20 MB total — but very fast: ~19 TB/s bandwidth. Standard attention implementations constantly read and write the n × n attention matrix to HBM: compute QK^T, write it to HBM, read it back for softmax, write again, read again for the value multiplication. At n = 8K, this is hundreds of gigabytes of HBM traffic.
Dao et al.'s insight: if the computation can be reorganized into tiles small enough to fit in SRAM, the n × n matrix never needs to be materialized in HBM at all. The trick is a numerically stable tiling of the softmax — computing running max and sum statistics to allow the softmax to be computed incrementally across tiles. The result is that attention requires O(n) HBM reads and writes instead of O(n²).
On an A100 80 GB GPU, FlashAttention achieved 124 TFLOPs/s on attention computation versus 38 TFLOPs/s for standard PyTorch attention — a 3.3× improvement — while using the same number of FLOPs. The speedup comes entirely from reducing HBM memory traffic.
FlashAttention-2 (Dao, 2023) refined the approach with better parallelization across attention heads and sequence positions, reducing non-matmul FLOPs and improving occupancy on modern GPUs. It reached ~70% of peak A100 FLOP utilization on attention — extraordinarily high for a memory-bound operation. Anthropic, OpenAI, and Mistral all adopted FlashAttention-2 as a standard component of their training stacks.
FlashAttention-3 (Shah et al., 2024) targeted NVIDIA's H100 specifically, exploiting asynchronous memory pipelines to overlap data movement with computation. On H100s, it achieved up to 1.5× additional speedup over FlashAttention-2, pushing toward 75% utilization of the H100's 989 TFLOP/s theoretical peak.
Crucially, FlashAttention's impact extends beyond training. For inference, reducing KV-cache memory traffic is equally important. The same tiling strategy applies to decoding, enabling longer context inference without proportionally larger memory bandwidth requirements.
By 2023, the field had largely converged on FlashAttention + full context windows rather than sparse attention patterns, for most tasks. The reasons are practical. FlashAttention requires no changes to model architecture, no custom sparse CUDA kernels, no decisions about which tokens should be global, and no approximation errors. It plugs into existing training pipelines as a drop-in replacement and immediately makes training faster and cheaper.
Sparse attention patterns, by contrast, require architectural decisions, custom kernels, and introduce inductive biases that may hurt some tasks. They excel in specific domains — notably processing DNA sequences (BigBird on genomics) and tasks with clear local structure — but are harder to deploy as general-purpose solutions.
The lesson practitioners took: hardware-aware exact computation frequently outperforms clever approximations, because the approximation overhead and engineering complexity often exceed the theoretical compute savings.
Explore FlashAttention's design in depth. Ask your tutor to explain the tiling algorithm step by step, compare HBM vs SRAM access patterns, or work through why the online softmax produces numerically identical results to the standard version. Challenge yourself to articulate why hardware-aware algorithms often beat clever approximations.
In 2019, Noam Shazeer at Google Brain published a largely overlooked paper introducing Multi-Query Attention (MQA). The core idea was almost aggressively simple: while each attention layer still used multiple query heads, all query heads would share a single key and value head. Instead of 32 KV heads for a 32-head model, you needed just 1. KV-cache memory dropped by 32×.
The paper sat in relative obscurity until the inference scaling crises of 2022–2023, when serving GPT-4-class models at production scale revealed that KV-cache memory was often the binding constraint. Suddenly MQA was everywhere. Google's PaLM 2 used it. So did Falcon and early versions of Mistral. But practitioners noticed quality regressions on complex reasoning tasks, and a 2023 paper from Google DeepMind introduced a middle ground: Grouped-Query Attention (GQA).
During autoregressive generation, a transformer must store all previously computed key and value tensors — the KV cache — so each new token can attend to all prior context without recomputation. For a standard 70B parameter model with 64 heads and 8K context, the KV cache alone requires approximately 128 GB of memory. This is larger than the model weights themselves at int8 precision.
Multi-head attention (MHA) is the original design: each layer has H query heads, H key heads, and H value heads. Multi-Query Attention reduces this to H query heads but just 1 K head and 1 V head shared across all queries. This reduces KV-cache size by a factor of H (often 32 or 64), but the single shared KV representation is a strong constraint on what different query heads can learn.
Grouped-Query Attention (Ainslie et al., 2023) divides the H query heads into G groups, each group sharing one KV head. With H = 32 and G = 8, you have 8 KV heads instead of 32 (MHA) or 1 (MQA). KV-cache memory shrinks by 4× while representational capacity is far richer than MQA's single shared head.
The 2023 DeepMind paper showed that GQA with G = 8 matched MHA quality on most benchmarks while achieving inference throughput within 10% of MQA. Llama 2 (Meta, 2023) used GQA for its 34B and 70B parameter variants. Mistral 7B used 8 KV heads for 32 query heads. Gemma (Google, 2024) used GQA throughout its model family. By 2024 it had become the de facto standard for large model inference efficiency.
| Method | KV Heads | KV Cache Size | Used By |
|---|---|---|---|
| Multi-Head Attention (MHA) | H (e.g. 32) | Baseline | GPT-3, original BERT, LLaMA-1 |
| Multi-Query Attention (MQA) | 1 | ~1/32× baseline | PaLM 2, early Falcon |
| Grouped-Query Attention (GQA) | G (e.g. 8) | ~1/4× baseline | LLaMA 2 70B, Mistral, Gemma |
| Multi-Head Latent Attention (MLA) | 1 compressed latent | ~1/16× baseline | DeepSeek-V2, DeepSeek-V3 (2024) |
DeepSeek-V2 (May 2024) introduced Multi-Head Latent Attention (MLA), a more aggressive compression scheme. Rather than grouping full-dimensional KV heads, MLA projects keys and values down to a single low-dimensional latent vector per token, then projects it back up for each head at attention time. The KV cache stores only the low-dimensional latent — roughly 512 dimensions instead of 128 heads × 128 head-dim = 16,384 dimensions.
DeepSeek reported that MLA reduced KV-cache memory by ~93% compared to standard MHA on their 236B-parameter MoE model, while maintaining quality competitive with MHA on all major benchmarks. The compression ratio made it economically viable to serve very long contexts: DeepSeek-V2 supports 128K token contexts with a KV cache smaller than many 4K-context MHA models.
Mistral 7B used 8 KV heads for 32 query heads (GQA ratio 4:1), reducing KV-cache memory by 4× compared to LLaMA-1 7B's full MHA. Combined with sliding window attention in alternate layers, Mistral 7B could serve 4× more concurrent users per GPU than an equivalent-quality MHA model — a direct revenue impact for cloud inference providers.
By 2025, efficient attention at inference has settled into a clear hierarchy. For models under ~20B parameters, GQA with moderate group counts (G = 4 to 8) is standard. For very large models (100B+) or very long contexts (32K+), MLA or aggressive MQA provides necessary memory relief. FlashAttention underpins all of them at the CUDA kernel level, providing IO-efficient execution of whatever attention pattern the architecture uses.
The remaining frontier is attention with retrieval — rather than expanding the context window indefinitely, using embedding-based retrieval to inject relevant chunks into a shorter context. But that is the subject of Module 5. What Module 4 establishes is the foundation: the techniques that made it possible to make attention computationally tractable at the scale where useful long-context reasoning becomes possible.
Work through the practical tradeoffs between MHA, MQA, GQA, and MLA. Ask your tutor to help you calculate KV-cache sizes for specific model configurations, compare quality-efficiency tradeoffs, or reason about what G value to choose for a given serving constraint. Consider why MLA is more complex to implement than GQA.