L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 3 · Lesson 1

The Quadratic Wall

Why attention doesn't scale the way you'd hope — and what the numbers actually mean.
If doubling a context window costs four times the compute, what happens at 1 million tokens?

When Google's PaLM 2 team benchmarked long-context performance in early 2023, engineers discovered something uncomfortable: inference costs for their longest prompts weren't growing linearly with length. They were growing quadratically. A 32K-token prompt cost roughly sixteen times more attention compute than an 8K prompt — not four times. The wall was real, and it was coming for everyone.

What Scaled Dot-Product Attention Actually Computes

The transformer attention mechanism, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," computes a weighted average over every token in the context for every query token. Formally, for a sequence of length n, each of the n queries attends to all n keys. That means the attention matrix has entries.

This is not a theoretical concern. It is a direct consequence of the arithmetic: if your context doubles, your attention matrix quadruples. If your context grows ten-fold, attention compute grows one-hundred-fold. Every layer of a transformer performs this operation independently.

Attention(Q, K, V) = softmax(QKᵀ / √d) · V
The QKᵀ matrix multiplication produces an n×n matrix — quadratic in sequence length n. This is the heart of the scaling problem.

The Memory Dimension

Compute isn't the only casualty. Storing the attention matrix requires O(n²) memory in standard implementations. For a 128K-token context with 32-bit floats across a model with 96 attention heads, the raw attention matrices alone can exceed hundreds of gigabytes — before accounting for KV caches, activations, or model weights.

This is why early long-context models like Anthropic's Claude 1 (released March 2023, with a 100K window) required specialized infrastructure rather than simply spinning up a larger instance of existing hardware. The memory footprint was architecturally novel.

O(n²)
Attention Complexity
Time and memory for standard full attention
Cost Multiplier
When sequence length doubles
100×
Cost Multiplier
When sequence length grows 10×

Why This Constrained Context Windows for Years

The original GPT-3 (June 2020) supported 2,048 tokens. GPT-3.5 (November 2022) expanded to 4,096. Each doubling was expensive not just in engineering effort but in real inference cost per query. At 2,048 tokens, quadratic attention was manageable. At 32,768 tokens, it required fundamental rethinking of memory management, batching strategies, and hardware allocation.

The quadratic wall explains why the race to longer context windows was not simply a matter of training on more data or buying more GPUs. It required either accepting enormous per-query costs (feasible for some enterprise use cases), or developing new attention mechanisms that break the quadratic scaling — which became the defining engineering challenge of 2023 and 2024.

Key Insight

The O(n²) attention complexity is not a bug or a temporary implementation detail. It is a fundamental property of how transformers compute relationships between tokens. Every technique discussed in this module — sparse attention, linear attention, sliding windows — is an attempt to approximate the full attention matrix while breaking the quadratic scaling.

O(n²) Big-O notation for quadratic complexity: doubling input size quadruples cost. For transformers, n is sequence length.
Attention Matrix The n×n matrix produced by QKᵀ, where each entry represents how much attention one token pays to another.
KV Cache Stored key and value vectors from prior tokens, enabling efficient autoregressive generation but growing linearly with context length and consuming significant GPU memory.

Quiz — The Quadratic Wall

Three questions · select the best answer
1. If a model processes a 16K-token context at cost X using full attention, what is the approximate attention cost for a 64K-token context?
Correct. Quadratic scaling means cost grows as the square of the length ratio. Going from 16K to 64K is a 4× length increase, so cost increases 4² = 16×.
Not quite. With O(n²) complexity, cost scales with the square of the length ratio. 64K ÷ 16K = 4, so cost multiplier is 4² = 16×.
2. What does the expression QKᵀ produce in the attention mechanism?
Correct. QKᵀ multiplies the n×d query matrix by the d×n key matrix transpose, producing an n×n matrix — every token against every other token.
Not quite. QKᵀ produces an n×n matrix because Q is n×d and Kᵀ is d×n. This is the root source of quadratic memory and compute.
3. Why did GPT-3 launch with only a 2,048-token context window in 2020?
Correct. The O(n²) memory and compute wall made longer contexts extremely expensive in 2020. Hardware efficiency and algorithmic innovations were needed before windows could expand substantially.
Not quite. The constraint was economic and technical: quadratic attention costs made longer windows prohibitively expensive given 2020 hardware capabilities and inference costs.

Lab 1 — Quadratic Arithmetic

Explore attention complexity with a live AI tutor · 3 exchanges to complete

Your Mission

Work through the math of quadratic attention scaling with the AI tutor. Ask it to calculate real cost multipliers for different context length jumps, explore why each transformer layer multiplies the problem, or challenge it on the KV cache memory implications for specific model sizes.

Suggested start: "If GPT-4 Turbo processes 128K tokens, how much more attention compute does that require compared to 8K, and what does that mean for inference pricing?"
AI Tutor — Attention Complexity
Lab 1
Welcome to Lab 1. I'm here to help you work through the arithmetic of quadratic attention complexity — the O(n²) scaling wall that shaped the entire context window race. Ask me to calculate cost multipliers, explain why each attention layer compounds the problem, or dig into KV cache memory requirements for any model configuration you're curious about.
Module 3 · Lesson 2

Sparse and Sliding-Window Attention

How researchers broke the quadratic barrier without discarding the transformer architecture entirely.
Can a model understand a 1-million-token document if it never reads all of it at once?

In December 2020, the Allen Institute for AI released Longformer, a paper demonstrating that sparse attention patterns — specifically sliding-window local attention combined with a handful of global tokens — could process documents up to 4,096 tokens with complexity that scaled linearly rather than quadratically. The key insight: most tokens don't need to attend to every other token. Locality matters.

The Core Idea: Not Every Token Needs Every Token

Full attention assumes that every token in a sequence is potentially relevant to every other token. This is theoretically true but practically wasteful. In a 100,000-word novel, the word "the" on page 1 rarely needs a direct attention connection to the word "the" on page 247. Nearby context is almost always more relevant than distant context.

Sparse attention exploits this observation by restricting which token pairs compute attention scores. Instead of an n×n matrix, sparse methods compute only a fraction of entries — reducing complexity toward O(n log n) or even O(n) depending on the pattern.

Three Canonical Sparse Attention Patterns

Research in 2020–2022 converged on several practical approaches, each with distinct trade-offs:

Pattern Complexity Key Property Real Example
Sliding Window O(n · w) Each token attends only to w neighbors Longformer (Beltagy et al., 2020)
Strided / Dilated O(n · w) Attends to every k-th token across full range BigBird (Zaheer et al., 2020)
Global + Local O(n · g + n · w) Designated global tokens attend to all; others attend locally Longformer, BigBird CLS token
Axial O(n^1.5) Decomposes 2D attention into row+column passes Image Transformer (Parmar et al., 2018)

Mistral's Sliding Window in Production

Mistral AI's Mistral 7B model (released September 2023) used grouped-query attention paired with a sliding window of 4,096 tokens, enabling a theoretical attention span of 128K tokens through a technique called "rolling buffer cache." Each token attends only to its local window, but information propagates through layers — similar to how humans read: we don't remember every word, but meaning accumulates across paragraphs.

The practical implication: Mistral 7B could handle far longer contexts than same-sized models using full attention, at a fraction of the memory cost. This made it one of the most commercially deployed open-weight models of 2023 and demonstrated that sparse attention wasn't just a research curiosity — it was production-viable.

The Trade-off

Sparse attention improves efficiency but may miss long-range dependencies — information that is genuinely far away in the token sequence but semantically critical. A contract's definitions clause on page 1 may directly control interpretation of a clause on page 50. Pure sliding-window attention may fail to connect them. This is why most production systems combine sparse local attention with some form of global or retrieval-augmented connectivity.

Flash Attention: Efficient Full Attention Through IO Optimization

Separately from sparse attention, Stanford's Tri Dao published FlashAttention in June 2022 — not a new attention pattern, but a dramatically more efficient implementation of full attention. By restructuring computation to minimize data movement between GPU high-bandwidth memory (HBM) and SRAM, FlashAttention reduced memory reads/writes by an order of magnitude while computing the same result as standard attention.

FlashAttention 2 (August 2023) and FlashAttention 3 (2024) extended these gains further. GPT-4, Claude 2, and most production-scale models adopted FlashAttention-style optimizations, enabling longer contexts not by approximating attention but by making exact attention far more hardware-efficient. This distinguished FlashAttention from sparse methods: it preserved the full O(n²) computation but reduced the wall-clock time and memory needed to perform it.

Sliding Window Attention pattern where each token attends only to a fixed-size neighborhood of w tokens, yielding O(n·w) complexity.
FlashAttention An IO-aware implementation of exact full attention by Tri Dao et al. (2022) that reduces memory bandwidth requirements through tiled computation, enabling longer contexts without approximation.
Global Token In hybrid sparse attention, a small set of designated tokens (e.g., [CLS]) that attend to and from all other tokens, propagating long-range information despite local attention elsewhere.

Quiz — Sparse & Sliding-Window Attention

Three questions · select the best answer
1. What is the complexity of sliding-window attention with window size w and sequence length n?
Correct. Each of the n tokens attends to w neighbors, giving O(n·w). Since w is a fixed constant (e.g., 4096), this is effectively linear in n.
Not quite. Sliding-window attention has each token attend to w neighbors. That's n tokens × w computations each = O(n·w), which is linear in n when w is fixed.
2. FlashAttention (Tri Dao, 2022) improves attention efficiency primarily by:
Correct. FlashAttention is an IO-aware algorithm that tiles computation to stay within fast SRAM, drastically reducing the expensive HBM memory traffic — without approximating the result.
Not quite. FlashAttention does not approximate attention. It computes the exact same result as standard attention but restructures data access patterns to minimize slow GPU memory traffic.
3. Longformer's "global token" mechanism solves which fundamental limitation of pure sliding-window attention?
Correct. Pure local attention can miss information that is far away in the sequence but semantically critical. Global tokens attend to and from all positions, acting as information highways across the full context.
Not quite. The global token mechanism addresses long-range dependencies. Without it, a [CLS] token in local-only attention would only see its immediate neighbors — useless for sentence classification.

Lab 2 — Sparse Patterns in Practice

Explore trade-offs between sparse attention methods · 3 exchanges to complete

Your Mission

Engage the AI tutor on the practical trade-offs of sparse attention patterns. Ask about specific failure cases for sliding-window attention, how FlashAttention and sparse attention differ as solutions, or why Mistral 7B's rolling buffer approach was an engineering breakthrough for production deployment.

Suggested start: "What kind of document or task would cause sliding-window attention to fail where full attention would succeed? Give me a concrete example."
AI Tutor — Sparse Attention Patterns
Lab 2
Ready to explore sparse attention patterns. I can walk you through concrete failure cases for sliding-window attention, explain how Longformer's global tokens work in practice, compare FlashAttention's IO optimization to sparse approximation approaches, or analyze why certain production models chose specific sparse patterns. What aspect would you like to dig into?
Module 3 · Lesson 3

Linear Attention and State Space Models

The quest to achieve O(n) scaling — and the architectural trade-offs that come with it.
What if you could process an arbitrarily long sequence in fixed memory — and what would you have to give up?

In December 2023, Albert Gu and Tri Dao released Mamba, a state space model (SSM) architecture that processed sequences in O(n) time and O(1) memory per token during inference. Unlike transformers, Mamba maintained no growing KV cache. Instead, it compressed the entire prior context into a fixed-size hidden state — updated recurrently. Benchmarks showed it matching transformer quality on language tasks up to sequence lengths of 1 million tokens, while running five times faster than a transformer of equivalent parameter count at those lengths.

Linear Attention: Rewriting the Kernel

Standard attention uses a softmax over the full n×n matrix, which cannot be decomposed. Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel function φ, rewriting:

Attention(Q,K,V) ≈ φ(Q) · (φ(K)ᵀ · V)
By computing (φ(K)ᵀ · V) first — an associativity trick — the n×n matrix is never materialized. Complexity drops from O(n²d) to O(nd²). When d ≪ n, this is effectively linear.

Linear attention was an important theoretical breakthrough, but in practice it underperformed softmax attention significantly on language tasks. The softmax normalization captures something important about selective attention — the ability to focus sharply on a small number of highly relevant tokens — that smooth kernel functions approximate poorly.

State Space Models: A Different Computational Paradigm

State Space Models approach the sequence modeling problem from a completely different direction, rooted in control theory and signal processing. An SSM maintains a hidden state h that summarizes all prior inputs and updates it recurrently:

hₜ = A·hₜ₋₁ + B·xₜ     yₜ = C·hₜ + D·xₜ
A, B, C, D are learned matrices. The hidden state hₜ summarizes the entire prior sequence in fixed memory — O(1) per step regardless of sequence length.

S4 (Gu et al., 2021) was the first SSM to match transformer performance on long-range dependency benchmarks. It used carefully structured A matrices (diagonal plus low-rank) to enable stable training and efficient parallel training despite the recurrent structure.

Mamba (2023) extended S4 with selective state spaces — input-dependent A, B, C matrices, letting the model decide how much of the hidden state to retain or overwrite based on current input. This selectivity was the key innovation that brought SSMs to transformer parity on language tasks.

Architecture Training Inference Memory (Inference)
Full Attention O(n²) O(n²) O(n) KV cache grows with context
Sparse Attention O(n·w) O(n·w) O(w) per layer
Linear Attention O(n·d²) O(n·d²) O(d²) fixed state
SSM (Mamba) O(n·d) parallel O(n) recurrent O(d) fixed hidden state

The Retrieval Trade-off

SSMs' fixed-memory inference comes with a cost: they cannot perfectly retrieve arbitrary facts from a long prior context the way attention can. Attention has random access — it can compute exactly which earlier token is most relevant to the current query. An SSM's compressed hidden state is a lossy summary; information that wasn't encoded strongly when it passed through may not be recoverable later.

This is not a theoretical objection — it appears in benchmarks. On "needle-in-a-haystack" retrieval tasks (finding a specific fact buried in a long document), pure SSMs underperform transformers. This observation drove research into hybrid architectures in 2024: models like Jamba (AI21 Labs, March 2024) interleaved attention layers with SSM layers, attempting to capture SSM efficiency for most processing while preserving attention's precise retrieval capability at key layers.

2024 Industry Direction

Neither pure SSMs nor pure transformers "won" the efficiency debate. The emerging consensus in 2024 is hybrid: use SSM layers for efficient long-range context compression, interleaved with sparse or full attention layers for precise retrieval when needed. This architectural pattern appears in Jamba, Zamba, and research prototypes from Google DeepMind and Microsoft Research.

Mamba A selective state space model (Gu & Dao, 2023) with input-dependent state transitions, achieving O(n) inference complexity and fixed memory while matching transformer quality on language tasks.
Selective SSM SSM variant where A, B, C matrices are functions of the input, giving the model learned control over what to retain in its compressed hidden state.
Hybrid Architecture Models (e.g., Jamba) that interleave SSM and attention layers to combine SSM efficiency with attention's precise retrieval capability.

Quiz — Linear Attention & SSMs

Three questions · select the best answer
1. What is the key inference advantage of Mamba (SSM) over a transformer with full attention?
Correct. Unlike transformers whose KV cache grows linearly with context, Mamba's hidden state h is a fixed-size vector updated recurrently — the same memory regardless of sequence length.
Not quite. Mamba's advantage is its fixed-size hidden state: no growing KV cache, no quadratic attention. Processing token 1,000,000 requires the same memory as token 1.
2. Linear attention's core trick for achieving sub-quadratic complexity is:
Correct. By computing the d×d matrix (φ(K)ᵀ·V) first — which doesn't depend on n queries — and then multiplying each query by that result, the n×n matrix is never formed.
Not quite. Linear attention uses the associativity of matrix multiplication: instead of (φ(Q)·φ(K)ᵀ)·V (which forms n×n), it computes φ(Q)·(φ(K)ᵀ·V) — a d×d intermediate that never depends on n².
3. On "needle-in-a-haystack" benchmarks, pure SSMs like Mamba tend to underperform transformers because:
Correct. Transformers can directly attend to any prior token (random access). SSMs compress prior context into a fixed state — precise retrieval of arbitrary buried facts is fundamentally harder.
Not quite. The issue is retrieval precision: attention gives random access to any prior token. SSMs' hidden state is a learned compression where specific details may be lost if they weren't strongly weighted during compression.

Lab 3 — SSMs vs. Transformers

Probe the limits of state space models with a live AI tutor · 3 exchanges to complete

Your Mission

Stress-test your understanding of SSM trade-offs. Ask the tutor to compare Mamba and a transformer on specific tasks, explore why hybrid architectures like Jamba emerged in 2024, or dig into how selective state spaces differ from standard S4 in terms of what the model can "choose to remember."

Suggested start: "Design a task where Mamba would clearly outperform a transformer — and then design a task where the transformer would win. Explain the architectural reason in each case."
AI Tutor — State Space Models
Lab 3
Let's explore the SSM vs. transformer trade-space. I can help you design benchmark tasks that expose each architecture's strengths and weaknesses, explain Mamba's selective state space mechanism, analyze why Jamba (AI21 Labs, 2024) interleaves attention and SSM layers, or walk through the mathematical reason SSMs struggle with precise retrieval. What would you like to investigate?
Module 3 · Lesson 4

The KV Cache Economy

Why memory bandwidth — not raw compute — became the binding constraint in long-context inference.
When a 1M-token context costs more to store than to compute, what does deployment economics look like?

By mid-2024, Anthropic's Claude 3 models supported a 200K-token context window in production. Inside the inference cluster, the binding constraint was no longer FLOP throughput — it was KV cache memory bandwidth. Loading the cached keys and values for a 200K-token prompt from GPU HBM took more time than the actual attention computation on that data. The bottleneck had shifted from arithmetic to memory.

What the KV Cache Is and Why It Grows

In autoregressive generation, a transformer generates one token at a time. Each new token needs to attend to all prior tokens. To avoid recomputing the key and value vectors for every prior token at each step, models cache them — the KV cache.

For a model with L layers, H attention heads, head dimension d, and sequence length n, the KV cache contains:

KV Cache Size = 2 · L · H · d · n · bytes_per_param
The factor of 2 is for keys and values. For GPT-4 scale (assuming ~96 layers, 128 heads, d=128) at 16-bit precision: roughly 0.5 GB per 1K tokens — or 500 GB for a 1M-token context.

This creates a structural problem for long-context inference: the KV cache for a single long conversation can exceed the total VRAM of a high-end GPU, forcing expensive offloading to CPU RAM or requiring multi-GPU deployments even for a single inference call.

Three Engineering Responses to the KV Cache Problem

1. Grouped-Query Attention (GQA). Instead of one K/V head per Q head, multiple Q heads share a single K/V head. Google introduced GQA in a 2023 paper showing it achieved near-identical quality to multi-head attention while reducing KV cache size by a factor equal to the grouping ratio. Llama 2 70B, Mistral 7B, and Gemini all adopted GQA. A grouping ratio of 8 reduces KV cache memory 8-fold with minimal quality loss.

2. Multi-Query Attention (MQA). The extreme case: all Q heads share a single K/V pair per layer. Noam Shazeer proposed MQA in 2019; it was adopted by PaLM (2022) and Falcon (2023). It reduces KV cache to O(L·d·n) — independent of head count — at some quality cost, particularly on tasks requiring fine-grained attention diversity.

3. KV Cache Quantization. Storing KV cache in 8-bit or 4-bit integers rather than 16-bit floats halves or quarters memory requirements. Research in 2023–2024 from MIT, together with production implementations in vLLM and TensorRT-LLM, demonstrated that aggressive KV quantization causes minimal perplexity degradation while doubling effective context capacity on fixed hardware.

Technique KV Memory Reduction Quality Impact Adopted By
Multi-Head Attention (baseline) 1× (no reduction) None Original GPT, BERT
Multi-Query Attention H× (= # of heads) Moderate on diverse tasks PaLM, Falcon, Gemini Ultra
Grouped-Query Attention G× (grouping ratio) Minimal Llama 2 70B, Mistral, Gemini
KV Quantization (INT8) Negligible (<0.5 PPL) vLLM, TensorRT-LLM
KV Quantization (INT4) Small (task-dependent) Research + enterprise deployments

PagedAttention and Virtual Memory for KV Caches

In 2023, UC Berkeley's vLLM project introduced PagedAttention — borrowing the virtual memory paging concept from operating systems to manage KV caches across many concurrent inference requests. Instead of allocating a contiguous block of GPU memory for each request's KV cache (which wastes memory due to variable sequence lengths and unpredictable generation length), PagedAttention allocated non-contiguous fixed-size pages and maintained a page table.

The practical impact was dramatic: vLLM demonstrated 24× higher throughput than HuggingFace's naive implementation on the same hardware, primarily by eliminating KV cache memory fragmentation. PagedAttention became the dominant inference serving paradigm for open-weight models by late 2023, and commercial inference providers adopted similar techniques to serve long-context requests at scale.

The Deployment Insight

By 2024, the economics of long-context inference were dominated not by training compute but by KV cache management: how efficiently you can store, retrieve, and serve cached attention states across thousands of concurrent requests. The winners in the inference efficiency race were teams that treated the KV cache as a first-class resource management problem — applying techniques from databases and operating systems, not just deep learning research.

Grouped-Query Attention Attention variant where G query heads share one K/V head, reducing KV cache size by factor G with minimal quality degradation. Adopted in Llama 2, Mistral, Gemini.
PagedAttention vLLM's OS-inspired technique for KV cache management using non-contiguous memory pages and a page table, eliminating fragmentation and enabling ~24× higher inference throughput.
Memory Bandwidth Bound An inference regime where the bottleneck is reading data from GPU HBM, not performing arithmetic — characteristic of long-context generation where KV cache loading dominates compute time.

Quiz — The KV Cache Economy

Three questions · select the best answer
1. Grouped-Query Attention reduces KV cache memory by factor G because:
Correct. With G query heads per K/V head, you need H/G K/V heads total instead of H, directly reducing KV cache size by factor G.
Not quite. GQA groups query heads to share K/V heads. If you have 64 Q heads and group size 8, you only need 8 K/V heads — a direct 8× KV memory reduction.
2. PagedAttention (vLLM, 2023) achieved its ~24× throughput improvement primarily by:
Correct. Contiguous KV cache allocation wastes memory due to length variability. PagedAttention's paging eliminates this waste, allowing far more concurrent requests on the same GPU memory — hence the throughput gain.
Not quite. The key innovation was memory management. Naive implementations pre-allocate a contiguous maximum-length KV block per request, wasting memory. PagedAttention uses pages, dramatically reducing waste and enabling more concurrent requests.
3. When inference is described as "memory bandwidth bound" rather than "compute bound," it means:
Correct. In memory-bandwidth-bound regimes, the GPU's arithmetic units are idle waiting for data to arrive from HBM. This is common in long-context generation where loading a large KV cache dominates over the actual attention computation.
Not quite. Memory bandwidth bound means the GPU is starved for data — it finishes computations faster than HBM can supply the next batch. For long-context inference, reading the KV cache is often slower than computing attention over it.

Lab 4 — KV Cache Engineering

Explore inference memory economics with a live AI tutor · 3 exchanges to complete

Your Mission

Work through real KV cache sizing calculations and deployment economics. Ask the tutor to estimate KV cache memory for specific model configurations, compare GQA and MQA trade-offs for a hypothetical deployment, or analyze how PagedAttention changes the economics of serving 1,000 concurrent long-context sessions.

Suggested start: "Calculate the KV cache memory needed to serve Llama 2 70B with a 32K-token context per request for 100 concurrent users — and tell me how GQA changes that number."
AI Tutor — KV Cache Economics
Lab 4
Welcome to Lab 4. I can help you work through KV cache sizing calculations for real model architectures, analyze the deployment economics of GQA versus MQA at scale, explain how PagedAttention changes concurrent request capacity, or explore why KV quantization has become standard in production serving. What would you like to calculate or explore?

Module 3 — Test

15 questions · 80% to pass · covers all four lessons
1. The attention mechanism in standard transformers scales as O(n²) in both time and memory because:
Correct. QKᵀ produces an n×n matrix — the defining quadratic operation.
The quadratic scaling comes from the n×n matrix in QKᵀ — all n queries against all n keys.
2. If a 4K-token context costs 1 unit of attention compute, a 32K-token context costs approximately:
Correct. 32K ÷ 4K = 8× length increase. Quadratic: 8² = 64× cost.
Quadratic scaling: length ratio is 8, so cost ratio is 8² = 64.
3. Anthropic's Claude 1 required specialized infrastructure for its 100K context window (2023) primarily because:
Correct. The O(n²) memory growth at 100K tokens required novel infrastructure, not just more of the same hardware.
The O(n²) attention matrix at 100K tokens overwhelms standard GPU memory — specialized infrastructure was needed to manage this.
4. Longformer (Allen Institute, 2020) reduced attention complexity to approximately O(n) by:
Correct. Longformer's sliding window gives O(n·w) local attention, plus O(n·g) for g global tokens — effectively O(n) when w and g are constants.
Longformer used sliding-window local attention (each token attends to w neighbors) plus global tokens that attend to all positions.
5. FlashAttention differs from sparse attention methods in that:
Correct. FlashAttention is an IO-aware algorithm — same mathematical result, different execution order that stays within fast SRAM rather than repeatedly reading from HBM.
FlashAttention computes the exact same attention as standard attention. Its innovation is IO efficiency, not mathematical approximation.
6. Mistral 7B's rolling buffer KV cache (September 2023) enables long-context processing by:
Correct. The rolling buffer maintains a fixed-size cache where old entries are overwritten, while information propagates forward through layer-by-layer processing.
Mistral's rolling buffer is a fixed-size circular cache — new tokens overwrite old ones, keeping memory constant while allowing context to propagate through deep layers.
7. The mathematical trick that gives linear attention its sub-quadratic complexity is:
Correct. This associativity trick is the core insight: (AB)C vs A(BC) — same mathematical result, but one path forms an n×n intermediate and the other doesn't.
The trick is matrix multiplication associativity: compute the d×d matrix (φ(K)ᵀ·V) first, then multiply each of n queries by that result — no n×n intermediate needed.
8. Mamba's "selective" state spaces differ from standard S4 in that:
Correct. Input-dependent A, B, C matrices are what make Mamba "selective" — the model can decide, based on content, how much to gate the hidden state update.
Mamba's selectivity comes from input-dependent SSM parameters — A, B, C change per token, giving the model learned control over its memory update.
9. On needle-in-a-haystack benchmarks, pure SSMs like Mamba underperform transformers because:
Correct. Transformers have random access via attention. SSMs compress prior context into a fixed state — precise retrieval of specific buried details is inherently harder.
The retrieval gap is architectural: transformers attend directly to any prior token; SSM hidden states are lossy compressions where specific facts can be difficult to recover.
10. Jamba (AI21 Labs, March 2024) is significant because it:
Correct. Jamba validated the hybrid SSM-attention paradigm in production, showing neither architecture alone was optimal for all tasks.
Jamba's innovation was hybridization — interleaving Mamba SSM layers with transformer attention layers to get the benefits of both.
11. The KV cache size formula includes a factor of 2 because:
Correct. KV cache stores both the key vector and value vector per token per layer — hence 2 × L × H × d × n.
The factor of 2 is because attention needs both K and V cached. K is used to compute attention scores; V is used to compute the weighted sum.
12. Grouped-Query Attention (GQA) was adopted by Llama 2 70B primarily to:
Correct. GQA's primary motivation was inference efficiency — reducing KV cache memory while preserving the quality advantages of multi-head attention.
GQA's adoption was driven by KV cache economics: sharing K/V heads across query heads reduces cache memory by the grouping ratio with negligible accuracy cost.
13. PagedAttention (vLLM) addresses which specific problem in production inference serving?
Correct. Contiguous allocation wastes memory because you must reserve worst-case space per request. Pages eliminate this waste, allowing far more concurrent long-context sessions.
PagedAttention solves memory fragmentation: naive contiguous KV allocation wastes GPU memory proportional to (max_length - actual_length) per request. Pages fix this.
14. When long-context inference is "memory bandwidth bound," the most effective optimization is:
Correct. If you're memory-bandwidth bound, more compute units are idle — the bottleneck is data transfer. Reducing data size or movement (quantization, GQA, tiling) directly addresses the constraint.
In a memory-bandwidth-bound regime, adding arithmetic compute doesn't help — the GPU is already starved for data. The fix is to reduce data volume or improve access patterns.
15. Which combination of techniques best describes how modern production models (GPT-4, Claude 3, Gemini 1.5) handle million-token-range contexts?
Correct. No single technique is sufficient. Production long-context systems stack FlashAttention (IO efficiency), GQA/MQA (KV cache size), KV quantization (further memory reduction), and paged serving infrastructure.
Production long-context systems combine many techniques: IO-efficient exact attention (FlashAttention), reduced KV cache (GQA/MQA), KV quantization, and paged memory management for serving.