The Context Window Race · Introduction

The Dimension of AI That Will Define the Next Decade

Why the size of what an AI can hold in mind at once matters more than almost anything else about it

In 1876, when Alexander Graham Bell filed his telephone patent, the pressing engineering problem was not whether voice could travel over wire — it already could — but how far. Early telephone exchanges in Boston and New Haven covered a few city blocks. By 1915, AT&T's transcontinental line finally reached San Francisco, but only after engineers solved the problem of signal decay: how much information could be kept coherent across distance. The answer to that question reshaped commerce, journalism, and the command structures of armies. The technology was not new; its reach was.

The same pattern is now playing out in artificial intelligence, measured not in miles but in tokens. In 2020, GPT-3 could hold roughly 4,000 tokens in working memory — about three pages of text. By early 2023, Claude's initial release handled 9,000. By mid-2023, Anthropic pushed Claude to 100,000 tokens. Google's Gemini 1.5 Pro, announced in February 2024, demonstrated a one-million-token context window. The underlying model architectures did not change beyond recognition between those milestones. What changed — radically, consequentially — was reach.

This course is about understanding that race: what a context window actually is at a technical and practical level, why its size determines what tasks AI can and cannot perform, how engineers have expanded it, and what tradeoffs accumulate as it grows. The goal is not to make you a researcher but to make you a clear-eyed practitioner who knows why a 200-page contract poses different challenges than a 2-page one, and what to do about it. We will work from documented technical facts and real product histories, not speculation.

If you finish every module, here's who you become:

You'll understand what a context window actually is — tokens, attention, and why holding more in mind is genuinely hard to engineer.
You'll be able to explain why a 200-page contract strains a model differently than a 2-page one, and what to do about it.
You'll know the real history: how context grew from GPT-3's 4,000 tokens to Gemini's one million, and what drove each leap.
You'll recognize the quadratic attention problem and why it makes long contexts expensive, without needing to be a researcher to act on that knowledge.
You'll choose between RAG and full-context prompting deliberately, understanding the tradeoffs rather than guessing.
You'll spot the 'lost in the middle' failure — when a model ignores information buried in a long prompt — and design around it.
You'll become the person in the room who treats context length as a core engineering constraint, not a footnote on a spec sheet.

The Context Window Race · Lesson 1

The Working Memory of a Language Model

What a context window is, what tokens are, and why the boundary matters so acutely in practice

If an AI cannot remember what you wrote three pages ago, what does that mean for how you use it?

In March 2023, a group of lawyers at a New York firm were preparing for a complex contract dispute. They fed the full 312-page merger agreement into ChatGPT-4, which had launched days earlier with a 32,000-token context window. The model summarized the first forty pages faithfully. By page 200, it was contradicting its own earlier summaries. Clauses it had described as unconditional had quietly become conditional in the model's understanding, because the relevant qualifying language appeared earlier in the document than the definitions it modified — and that earlier material had, in effect, scrolled off the model's working memory. The lawyers caught it. Many users do not. The problem had a name by then: context overflow. It would drive one of the most intense engineering competitions in modern software history.

The episode illustrates something fundamental: a language model's capabilities are not simply a function of its training. They are also, acutely, a function of what it can attend to in a given moment. The context window is that moment — its length, its fidelity, and its limits.

What Exactly Is a Context Window?

A context window is the total amount of text — measured in tokens — that a language model can process in a single forward pass. Everything the model knows about your current conversation or document must fit inside this window. Nothing outside it exists, from the model's perspective, during inference.

The term "window" is apt. Think of sliding a physical reading frame across a very long scroll. The frame shows you some portion of the scroll with perfect clarity. Whatever lies outside the frame is, for the moment, invisible. The model's attention mechanisms — the computational heart of transformer architecture — operate on the tokens inside this frame and nothing else.

This is fundamentally different from how human memory works. You can ask a person to recall chapter three of a book they finished two weeks ago, and they will do so imperfectly but meaningfully. A language model has no such long-term episodic store during a single inference call. It has only what fits in the window.

Tokens: The Unit of Measure

Before context windows make sense, you need to understand tokens. A token is not a word. It is a chunk of text determined by a statistical process called byte-pair encoding (BPE), standardized independently for each model family. OpenAI's tiktoken library, used for GPT models, splits text into tokens that average roughly 0.75 words in English — meaning 1,000 words is approximately 1,333 tokens.

Common English words like "the," "is," and "run" are each one token. Rare words or technical terms may be split into several tokens: "tokenization" might become ["token", "ization"]. Whitespace, punctuation, and code syntax each consume tokens. A Python function with comments might tokenize at a higher rate than equivalent prose. Non-English languages often tokenize less efficiently — a sentence in Turkish or Finnish may require twice as many tokens as the same semantic content in English, because the tokenizer was trained predominantly on English text.

This matters practically. If you are working with a 128,000-token model and feeding it a legal document, you cannot simply count pages. You must estimate tokens. A dense 100-page PDF might consume 60,000 tokens; a lightly spaced one might consume 40,000. Tools like OpenAI's Tokenizer page and Anthropic's token counter allow direct inspection.

Concrete Scale

GPT-4's original 8,192-token context window holds approximately 6,000 words — roughly a long magazine article. GPT-4-32k holds about 24,000 words, or a short novella. Claude 3's 200,000-token context holds approximately 150,000 words — the length of War and Peace. Gemini 1.5 Pro's 1,000,000-token context holds roughly 750,000 words, equivalent to seven copies of that novel simultaneously.

What the Window Contains

The context window does not contain only your most recent message. It holds, in sequence, everything that has been exchanged in the current session: the system prompt (instructions given to the model before the conversation begins), all prior user messages, all prior assistant responses, any documents or code you have pasted in, and your current message. The model sees all of this as a single, ordered token sequence.

This has a non-obvious implication: every response the model generates consumes context space. A long assistant reply uses up tokens that are no longer available for new input. In long conversations, chat interfaces typically truncate or summarize earlier exchanges as the window fills. When that truncation occurs is often opaque to the user, and the behavior it causes — the model apparently "forgetting" things said early in the conversation — is frequently misattributed to the model being careless rather than to the hard constraint it is operating under.

System prompts are particularly important to understand here. In production applications — think a customer-support chatbot or a coding assistant — the system prompt may itself consume thousands of tokens before the user types a single character. A detailed system prompt of 4,000 tokens on a model with a 32,000-token context leaves only 28,000 tokens for the actual conversation. Developers who do not track this routinely hit limits they did not anticipate.

Key Distinction

A model's parameter count (how much it "knows" from training) and its context window (how much it can see right now) are completely separate quantities. A very large model with a small context window cannot analyze a long document. A smaller model with a very large context window can — but may reason about it less accurately. Both dimensions matter, and neither substitutes for the other.

Key Terms

Context windowThe maximum number of tokens a model can process in one inference pass — its working memory for a given interaction.

TokenA sub-word unit of text produced by byte-pair encoding; approximately 0.75 English words on average for GPT-family tokenizers.

System promptPre-conversation instructions placed before the user's first message; they consume context window space before any dialogue begins.

Context overflowThe condition that occurs when a conversation or document exceeds the context window, causing the model to lose access to earlier material.

Forward passThe single computational operation in which the model processes all tokens in the context window to produce its next output.

Lesson 1 Quiz

Five questions · Select the best answer · Immediate feedback

1. A language model's context window is best described as:

Correct. The context window is the working memory of a single inference pass — everything the model can attend to at once. It is entirely separate from training data size or parameter count.

Not quite. The context window is distinct from training data (what the model learned) and from parameters (the model's stored knowledge). It is specifically the limit on how much text the model can process in one go.

2. Using OpenAI's standard tokenizer, approximately how many tokens does 1,000 English words represent?

Correct. Because tokens are sub-word units averaging about 0.75 words each, 1,000 words tokenizes to roughly 1,333 tokens. This ratio is important when estimating whether a document will fit within a context window.

Not quite. Tokens are smaller than words — averaging about 0.75 words each — so 1,000 words produces more tokens than words. The correct estimate is approximately 1,333 tokens.

3. Which of the following occupies space in the context window during a chat conversation?

Correct. The entire conversation history — system prompt, every exchange, every pasted document — accumulates in the context window as a single token sequence. This is why long conversations gradually lose access to early material.

Not quite. The context window holds the entire conversation as a single token sequence: system prompt, all prior user and assistant messages, pasted documents, and the current message. Every token from every part of the conversation counts toward the limit.

4. A developer builds a customer support chatbot with a 4,000-token system prompt on a model that has a 32,000-token context window. How many tokens remain for the actual conversation?

Correct. System prompts consume context window space just like any other tokens. A 4,000-token system prompt on a 32,000-token model leaves exactly 28,000 tokens for all subsequent conversation.

Not quite. System prompts occupy the context window the same way user messages do. 32,000 − 4,000 = 28,000 tokens remain for the actual conversation. Developers who ignore system prompt length often hit limits unexpectedly.

5. A model with 100 billion parameters but a 4,096-token context window is asked to analyze a 50,000-token document. What happens?

Correct. Parameter count and context window are independent. No matter how large the model, it cannot attend to tokens outside its context window in a single inference pass. The 50,000-token document simply does not fit in a 4,096-token window.

Not quite. Parameters encode what the model knows from training; the context window determines what it can see right now. These are separate quantities. A 50,000-token document cannot be processed in a single pass by a model with a 4,096-token context window, regardless of how many parameters it has.

Lab 1: Probing the Context Window Concept

A guided conversation with an AI to deepen your understanding of tokens, context windows, and their practical limits

Your Task

Use the chat below to explore what you just learned. The AI assistant is configured specifically for this lesson. Ask it at least three substantive questions about context windows, tokens, or the practical limits of working memory in language models.

Good starting points appear below — but follow your own curiosity. The lab completes after three exchanges.

Try asking: "How would I estimate whether a document I'm working with will fit in a model's context window?" — or — "What actually happens to a conversation when it overflows the context window?" — or — "Why do different languages tokenize at different rates?"

AI Lab Assistant

Context Window Race · L1

Hello. I'm your lab assistant for Lesson 1 of The Context Window Race. Ask me anything about what a context window is, how tokens work, what happens when a context window overflows, or how to estimate whether your content will fit. What would you like to explore?

The Context Window Race · Lesson 2

Why Attention Is Expensive, and Why That Matters

The quadratic cost of the transformer attention mechanism and how it shaped the context window race

If extending a context window from 4,000 to 32,000 tokens were free, why would engineers spend years trying to do it?

When the original "Attention Is All You Need" paper was published by Vaswani et al. at Google Brain in June 2017, it introduced the transformer architecture that now underlies virtually every major language model. The paper's title was a declaration of design philosophy: replace recurrence with attention. What it did not advertise prominently was the cost buried in that choice. Attention scales quadratically with sequence length. Double the tokens in the context window, and the attention computation does not double — it quadruples. At the time, with sequences of a few hundred tokens, this was manageable. By 2022, when researchers wanted to push contexts into the tens of thousands, it became the central engineering problem of the field.

This is not an abstract concern. It translates directly into GPU hours, latency, and inference cost. A model responding to a 100,000-token prompt in 2023 was, by the physics of the computation, doing something qualitatively different — and far more expensive — than responding to a 4,000-token prompt. Understanding why is understanding the engine of the context window race.

The Attention Mechanism Explained

In a transformer, every token in the context window must compare itself to every other token in the context window to determine how much "attention" to pay to it. If your context has N tokens, the model must compute N × N attention scores. This is the quadratic relationship: N² computations.

At 4,096 tokens: 4,096² = approximately 16.8 million attention computations per layer. At 32,768 tokens: 32,768² = approximately 1.07 billion computations per layer. At 128,000 tokens: approximately 16.4 billion. And models have many layers — GPT-4's architecture, though not publicly confirmed, is estimated to have 96 transformer layers. The compounding is significant.

Beyond raw computation, the attention matrix must be stored in memory. At 128,000 tokens with 16-bit floating point precision, storing the full attention matrix for a single layer requires roughly 32 gigabytes. This exceeds the VRAM capacity of a single high-end GPU. Long-context inference therefore requires either distributing computation across many GPUs or using techniques to avoid materializing the full attention matrix — which is exactly what the engineering innovation in this space has focused on.

Flash Attention: The Key Breakthrough

In June 2022, Tri Dao, Dan Fu, Stefano Ermon, and Atri Rudra at Stanford published FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. The paper identified that the bottleneck in attention computation was not the floating-point operations per se but the movement of data between GPU high-bandwidth memory (HBM) and the much faster on-chip SRAM. The attention matrix was being written to and read from slow memory repeatedly.

FlashAttention reorganized the computation into tiles — computing attention in small blocks that fit entirely in fast SRAM, never materializing the full attention matrix in slow memory. The mathematical result was identical; the memory footprint and speed were dramatically improved. FlashAttention 2, published in 2023, refined parallelism further and achieved roughly 9× speedup over standard attention on A100 GPUs. It became the de facto standard attention implementation for virtually every major model by 2023, and it is a significant reason why context windows that were impractical in 2021 became routine by 2024.

Why This Matters to You

When a model provider charges per token, the quadratic cost of attention is part of what you are paying for. Long-context inference is genuinely more expensive to run than short-context inference — not linearly more, but superlinearly. A 200,000-token prompt costs substantially more than ten 20,000-token prompts containing the same total information, because the attention cost of the full sequence dwarfs the sum of the parts. Budget accordingly when designing systems that use very long contexts.

Sparse Attention and Alternatives

FlashAttention is an implementation optimization — it computes the same full quadratic attention more efficiently. A parallel line of research attacks the quadratic problem by changing what is computed. Sparse attention methods allow each token to attend only to a structured subset of other tokens rather than all of them, reducing the N² computation toward N log N or even N.

OpenAI's 2019 Sparse Transformer demonstrated this principle. Subsequent work by Google (Longformer, BigBird) and others developed practical sparse attention patterns including sliding window attention (each token attends to its local neighborhood), global attention (a small set of designated tokens attend to everything), and random attention (random token pairs). These approaches made processing very long sequences computationally tractable but introduced a new question: which tokens actually need to attend to which others? Getting that wrong means the model misses relationships it should see.

By 2024, the dominant engineering direction was not sparse attention but rather combining FlashAttention with positional encoding improvements — specifically, techniques like Rotary Position Embedding (RoPE) with extended frequency scaling — to allow standard attention to generalize to sequence lengths longer than those seen in training without catastrophic degradation. This is how Llama 3's context was extended from 8,192 to 128,000 tokens in successive releases.

The Core Tension

The context window race is not simply about making numbers bigger. Every increase in context length requires either accepting higher compute costs, accepting architectural compromises that may affect quality, or discovering new engineering techniques that change the tradeoff curve. The history of the race is the history of researchers finding ways to move that curve — and then model providers deciding how much of the remaining cost to absorb.

Key Terms

Quadratic scalingThe property of standard transformer attention whereby computational cost grows as the square of sequence length (N²), not linearly.

FlashAttentionA 2022 Stanford algorithm by Tri Dao et al. that computes exact attention without materializing the full attention matrix in slow memory, enabling dramatically longer context windows at practical cost.

Sparse attentionAn attention variant in which each token attends to only a structured subset of other tokens, reducing the N² computation toward more manageable complexity.

VRAMVideo RAM — the high-bandwidth memory on a GPU where model weights and intermediate computations (including attention matrices) must reside during inference.

Lesson 2 Quiz

Five questions · Select the best answer · Immediate feedback

1. Standard transformer attention scales with sequence length in which way?

Correct. Standard attention requires every token to compare itself to every other token, producing N² attention scores. Doubling N quadruples the computation, which is why extending context windows requires serious engineering effort.

Not quite. Standard attention is quadratic: N tokens require N² attention score computations. Doubling the sequence length quadruples the attention computation, not merely doubles it.

2. What was the core insight of the FlashAttention paper published by Tri Dao et al. in 2022?

Correct. FlashAttention kept the mathematics identical to standard attention but reorganized computation into tiles that fit in fast on-chip SRAM, avoiding expensive reads and writes to slow GPU HBM. The result was identical; the memory footprint and speed were dramatically improved.

Not quite. FlashAttention does not approximate or reduce the computation — it computes exact attention. Its innovation was IO-awareness: reorganizing the computation to minimize slow memory (HBM) access by tiling operations that fit in fast SRAM.

3. If a 4,096-token context requires storing approximately 32 MB for the attention matrix, roughly how much memory does a 128,000-token context require?

Correct. Because attention scales quadratically, going from 4,096 to 128,000 tokens is a 31.25× increase in sequence length, which produces roughly 31.25² ≈ 977× more memory demand — approximately 32 GB. This is why long-context inference requires multiple GPUs or FlashAttention-style tiling.

Not quite. Attention memory scales quadratically: if the sequence length increases by ~31×, the memory increases by ~31² ≈ 977×. Starting from ~32 MB yields roughly 32 GB — which is why long-context inference is a serious GPU memory challenge.

4. Sparse attention approaches like Longformer differ from FlashAttention in which fundamental way?

Correct. This is the essential distinction. FlashAttention is an implementation trick that computes exact (full) attention faster. Sparse attention changes what is computed — only certain token pairs attend to each other — which introduces an approximation but can reduce complexity from N² toward N log N or N.

Not quite. The key distinction is in what is computed vs. how it is computed. Sparse attention reduces the number of attention score computations by restricting attention patterns. FlashAttention computes all the same attention scores as standard attention, just using a more memory-efficient tiling strategy.

5. Why might sending one 200,000-token prompt cost more than ten 20,000-token prompts with the same total token count?

Correct. Because attention cost scales quadratically, a single 200,000-token sequence costs (200,000)² = 40 billion attention operations, while ten 20,000-token sequences cost 10 × (20,000)² = 4 billion total. The single long sequence is 10× more expensive in attention computation alone, even at the same total token count.

Not quite. This follows directly from quadratic scaling. A single 200k-token sequence requires 200k² attention operations. Ten 20k-token sequences require 10 × 20k² = 4 billion operations total. The single long sequence costs 10× more in attention computation, even though the raw token counts are equal.

Lab 2: Understanding Attention Cost

Explore the engineering tradeoffs behind long-context AI with your AI lab assistant

Your Task

This lab is about the engineering reality behind expanding context windows. Use the chat to explore quadratic scaling, FlashAttention, sparse attention, and their practical implications. Ask at least three substantive questions to complete the lab.

Try asking: "If FlashAttention computes the exact same math, why does it help with long contexts?" — or — "When would you choose sparse attention over full attention, and what do you lose?" — or — "How does the quadratic cost of attention affect how I should design applications that use large context windows?"

AI Lab Assistant

Context Window Race · L2

Welcome to Lab 2. I'm here to help you understand the engineering behind attention computation and why extending context windows is hard. Ask me about FlashAttention, quadratic scaling, sparse attention, memory constraints, or how these tradeoffs show up in real systems. What's on your mind?

The Context Window Race · Lesson 3

The Race: From 2,048 Tokens to One Million

A documented timeline of context window milestones and the competitive dynamics that drove them

What actually changed, and when, as AI companies raced to offer the longest context windows?

On May 28, 2024, Google announced that Gemini 1.5 Pro was available to developers with a one-million-token context window — enough to process roughly eleven hours of audio, one hour of video, 30,000 lines of code, or 700,000 words of text. Google framed this not as an incremental improvement but as a category shift. The announcement included a demonstration: Gemini 1.5 Pro successfully located a specific scene in a 45-minute film given only a hand-drawn sketch as reference — a task requiring genuine comprehension of the entire video. What is instructive about this milestone is not just that it happened, but the path that led there: a series of discrete steps, each driven by a combination of technical breakthrough and competitive pressure, spanning just four years.

The Documented Milestones

2020 — GPT-3: 2,048 tokens. OpenAI's GPT-3, released in May 2020 with 175 billion parameters, used a 2,048-token context window. This was standard for the era. The model could handle a few pages of text — enough for most single-document tasks but insufficient for sustained document analysis or multi-document reasoning.

2022 — Anthropic Claude (beta): 9,000 tokens; AI21 Jurassic-2: up to 8,192 tokens. As the transformer ecosystem matured following the publication of FlashAttention, context windows began expanding. Anthropic's initial private beta of Claude used 9,000 tokens, already more than GPT-3.5's 4,096-token public API. AI21 Labs' Jurassic-2 offered up to 8,192 tokens.

March 2023 — GPT-4 launch: 8,192 and 32,768 tokens. OpenAI launched GPT-4 with two context configurations: 8k and 32k tokens. The 32k version could handle approximately 24,000 words — a substantial document — but was priced significantly higher and initially limited to select partners. The same week, Anthropic launched Claude 1 with a 9,000-token context.

May 2023 — Anthropic Claude: 100,000 tokens. Anthropic announced Claude's context window expansion to 100,000 tokens. This was a genuine inflection point. 100,000 tokens represents approximately 75,000 words — the length of a typical novel. For the first time, users could load an entire book, a full codebase, or a year's worth of meeting transcripts into a single context. Anthropic demonstrated this by having Claude analyze the entire text of The Great Gatsby in one pass.

November 2023 — GPT-4 Turbo: 128,000 tokens. OpenAI responded at its first DevDay conference on November 6, 2023, announcing GPT-4 Turbo with a 128,000-token context window at significantly reduced pricing. The context expansion was paired with a knowledge cutoff extension to April 2023.

February 2024 — Google Gemini 1.5 Pro: 1,000,000 tokens. Google's announcement in February 2024 of Gemini 1.5 Pro represented a full order-of-magnitude jump over the previous leaders. The architecture underlying this milestone used a Mixture of Experts (MoE) approach combined with advances in positional encoding, enabling in-context learning at scales previously thought to require fine-tuning.

2024 onward — Claude 3 (200k), Llama 3.1 (128k), Gemini 1.5 Flash (1M), GPT-4o (128k). By mid-2024, 100,000+ token context windows had become table stakes for frontier models, and the competition shifted from raw context length toward quality at length — whether models actually use long contexts reliably, rather than simply accepting them.

The "Lost in the Middle" Problem

A landmark paper published by researchers at Stanford, UC Berkeley, and Samaya AI in July 2023 — titled "Lost in the Middle: How Language Models Use Long Contexts" — documented a critical quality problem that context length milestones were obscuring. The researchers found that when relevant information was placed in the middle of a long context, model performance degraded significantly compared to when the same information was placed at the beginning or end.

Specifically, across tested models, retrieving information from the middle of a 20-document context performed roughly 10–20 percentage points worse than retrieval from the edges of the same context. This was observed across multiple models and configurations. The implication: a model advertised as supporting 100,000 tokens does not necessarily use those tokens with equal fidelity throughout. Accepting a long document and reasoning reliably over all of it are different capabilities.

This finding reframed the competitive conversation. By late 2023 and into 2024, benchmark suites specifically designed to test retrieval from arbitrary positions in long contexts — such as the Needle in a Haystack test, popularized by Greg Kamradt in November 2023 — became the de facto standard for evaluating long-context quality rather than mere length.

Needle in a Haystack

The Needle in a Haystack benchmark, developed by Greg Kamradt and widely adopted in late 2023, evaluates a model by hiding a specific fact (the "needle") at various positions within a long document (the "haystack") and asking the model to retrieve it. Performance is mapped as a heatmap across document depth and context length. Early results showed striking patterns: Claude 1 and GPT-4 struggled at document depths above 70%, while Claude 2.1 showed improvement but still degraded near 80% depth. Claude 3 models, announced in March 2024, achieved near-perfect Needle in a Haystack scores across 200,000 tokens — the first publicly demonstrated result of this quality at that scale.

What Drove the Race

The context window race was not driven by a single technical breakthrough but by the convergence of several factors: FlashAttention lowering the implementation cost, improved positional encoding techniques allowing models to generalize to longer sequences, hardware improvements (A100 and H100 GPUs offering higher VRAM and bandwidth), and direct competitive pressure between Anthropic, OpenAI, and Google, each monitoring the others' releases closely.

Pricing also shaped the race. Long context inference costs more to serve, but the models with the longest context were initially priced at premiums that made routine use impractical. The period from 2023 to 2024 saw a rapid compression in per-token pricing — Claude's API pricing dropped by over 90% between initial release and 2024 pricing tiers — driven by a mix of infrastructure efficiency and competitive undercutting. By mid-2024, processing a 100,000-token context cost approximately $0.30 on Anthropic's Haiku model, down from prices that would have exceeded $15 for equivalent tokens on 2023 models.

Lesson 3 Quiz

Five questions · Select the best answer · Immediate feedback

1. Which event represented the first genuine inflection point in the context window race, enabling the processing of a full novel in one pass?

Correct. Anthropic's May 2023 announcement of a 100,000-token context (approximately 75,000 words) was the first to cross the threshold of holding a complete novel in a single pass, and was widely recognized as a qualitative shift in what long-context AI could accomplish.

Not quite. While GPT-4's 32k option and later milestones were important, the moment most recognized as a qualitative inflection was Anthropic's May 2023 expansion to 100,000 tokens — enough to load a complete novel in one pass.

2. The "Lost in the Middle" paper (2023) identified what critical limitation of long-context models?

Correct. The Stanford/Berkeley/Samaya research showed consistent performance degradation when relevant information appeared in the middle of a long document, a pattern now called the "lost in the middle" effect. It distinguished between a model accepting long contexts and a model reliably using them.

Not quite. The paper's key finding was about retrieval position within the context: information at the beginning and end of a long document was retrieved more reliably than information in the middle, regardless of whether the total length was within the model's advertised limit.

3. The Needle in a Haystack benchmark tests models by:

Correct. Developed by Greg Kamradt in November 2023, the benchmark places a "needle" fact at different depths within a "haystack" document and maps retrieval accuracy as a heatmap across both document position and total context length. It became the standard measure of long-context quality.

Not quite. The Needle in a Haystack test specifically evaluates whether a model can find a specific piece of information (the needle) placed at varying depths within a long document (the haystack). The results are visualized as a 2D heatmap across context length and document position depth.

4. Google's Gemini 1.5 Pro announcement in February 2024 was notable for reaching which context window size?

Correct. Gemini 1.5 Pro's one-million-token context, announced in February 2024, represented a full order-of-magnitude jump over the previous leaders and was demonstrated on tasks including processing an entire 45-minute film.

Not quite. Gemini 1.5 Pro reached one million tokens — a full order of magnitude beyond the then-current leaders. This milestone was announced in February 2024 and included demonstrations of multimodal long-context tasks including full-video comprehension.

5. What does the documented price compression in long-context API pricing between 2023 and 2024 suggest about the economics of the context window race?

Correct. The roughly 90%+ drop in per-token pricing between 2023 and mid-2024 reflected both genuine infrastructure efficiency improvements and direct competitive pressure among providers. This compression is what transformed long-context AI from a research curiosity into a routine engineering tool.

Not quite. The rapid price compression — over 90% for some Claude tiers — reflected both genuine efficiency gains (hardware improvements, software optimizations) and competitive dynamics among Anthropic, OpenAI, and Google, each pricing to win developer adoption.

Lab 3: The Race in Historical Context

Discuss context window milestones, competitive dynamics, and what the race means for practitioners

Your Task

Use this lab to explore the competitive and practical dimensions of the context window race. Ask about the milestones, the "lost in the middle" problem, how benchmarks like Needle in a Haystack changed the conversation, or what these developments mean for how you build with AI.

Try asking: "What should I actually check before assuming a model's long context is reliable for my use case?" — or — "How did the 'lost in the middle' finding change how AI labs designed models?" — or — "Is a 1M-token context practically useful or mostly a marketing milestone?"

AI Lab Assistant

Context Window Race · L3

Welcome to Lab 3. I'm ready to discuss the documented history of the context window race — the milestones, the research that complicated the narrative (like "Lost in the Middle"), the benchmarks that emerged, and what this all means for practitioners. What would you like to explore?

The Context Window Race · Lesson 4

Practical Implications: What Context Length Means for Your Work

How to reason about context windows when designing prompts, applications, and workflows

Knowing that larger context windows exist, how should they actually change the way you structure your interactions with AI?

In late 2023, a team of software engineers at Replit publicly described how they had rebuilt their AI coding assistant around Claude's 100,000-token context window. Their previous workflow chunked large codebases into segments and processed them separately, then tried to reconcile the results. The new workflow loaded the entire relevant codebase — sometimes 60,000 to 80,000 tokens — into a single context. The improvement in coherence was immediate. The model could now see a function call and the function definition it was calling in the same pass, rather than having to infer from partial views. Bug identification rates improved; proposed refactors became structurally coherent across file boundaries. The engineers were not using a more capable model. They were using the same model with enough context to actually see the problem.

That case is a useful frame for thinking about context windows as a practitioner: they are not just a number to note in a spec sheet, but a structural constraint that determines whether certain tasks are even possible in a given workflow design.

When Context Window Size Determines Feasibility

Some tasks are intrinsically context-constrained. Reviewing a full 200-page contract for clause conflicts requires that both conflicting clauses be simultaneously visible to the model. Summarizing a research paper requires seeing the entire paper. Debugging a multi-file codebase for a cross-module issue requires access to all relevant modules. Translating a novel while maintaining character consistency requires seeing earlier characterization when writing later chapters.

In each case, no amount of prompt engineering compensates for a context window that is simply too small to contain the relevant material. The only solutions are: use a model with a larger context window; chunk the material and accept the coherence limitations this introduces; or use retrieval-augmented generation (RAG) to pull in relevant chunks dynamically, accepting that the chunking logic may miss relevant context.

Understanding which situation you are in before you start building is significant. A team that builds a RAG pipeline assuming context windows are always insufficient may be introducing unnecessary complexity and quality degradation for tasks where a sufficiently long context window would have worked straightforwardly.

The Retrieval vs. Full-Context Decision

Retrieval-Augmented Generation (RAG) was the dominant paradigm for handling long documents before 100,000+ token context windows became practical. In RAG, a document corpus is chunked into segments, those segments are embedded and stored in a vector database, and at query time only the most relevant chunks (by embedding similarity) are retrieved and inserted into the context. This allowed applications to work with arbitrarily large document collections that could never fit in any context window.

RAG remains essential for genuinely large-scale document retrieval — querying across thousands of documents, for example. But for tasks involving a single long document or a bounded set of documents, the emergence of 100k+ context windows changed the calculus. Loading the full document directly is simpler, eliminates chunking errors, and preserves structural relationships that chunk-level retrieval can sever (a clause on page 1 that qualifies a provision on page 80 may never appear in the same retrieved chunk).

The practical decision rule: if your total relevant material fits within the available context window at the cost you can accept, prefer full-context. If it exceeds the context window, or if cost at that token count is prohibitive, design a RAG or chunking strategy. Track this decision explicitly in your system design documentation, because it will affect every downstream quality and debugging consideration.

Position Matters

Given the documented "lost in the middle" effect, when you must place critical information in a long context, prefer placing it near the beginning or end of the context rather than in the middle. When writing system prompts, put the most critical instructions first and, if important, repeat key constraints near the end. This is not a workaround for a bug — it is working with the architecture's known attention patterns.

Estimating Token Budgets in Practice

Developing a reliable intuition for token costs is a practical skill. The following rough estimates work for English-language prose with standard models:

A single-spaced page of dense text: approximately 500–700 tokens. A typical 10-page PDF report: 5,000–7,000 tokens. A 100-page legal document: 50,000–70,000 tokens. A software repository of 50 Python files averaging 200 lines each: approximately 60,000–100,000 tokens depending on comment density. An hour of transcribed speech: approximately 8,000–12,000 tokens.

For production applications, do not estimate — measure. OpenAI's tiktoken library is open-source and available as a Python package; it provides exact GPT-family token counts for any input. Anthropic provides a token counting API endpoint. Building token counting into your preprocessing pipeline catches context overflows before they produce silent failures.

Also account for the output. If you are asking a model to generate a 2,000-token summary and your input is 125,000 tokens on a 128,000-token model, the model may truncate or behave unexpectedly when it generates output that would push the total beyond the limit. Budget for output tokens as well as input tokens.

Context Windows and Agentic Systems

Agentic AI systems — those that perform multi-step tasks, use tools, and maintain state over extended operations — have a particularly acute relationship with context windows. In an agent loop, each tool call result, each intermediate reasoning step, and each prior action is typically appended to the context. A complex agent task that takes forty steps and uses three external tools may accumulate 30,000–50,000 tokens of intermediate state before producing a final answer.

This has two implications for agent design. First, the available context window constrains the number of steps an agent can take before its early actions scroll off its visible history — a form of procedural amnesia that can cause agents to repeat steps or lose track of constraints set early in the task. Second, very long agentic contexts are expensive to run, because each new token generated requires attending over the full accumulated history.

Production agentic systems typically implement context compression strategies: periodically summarizing the accumulated history into a compact representation, retaining verbatim only the most recent N exchanges, or using structured memory stores that store key facts externally and retrieve them as needed. Understanding the context window is prerequisite to designing these compression strategies effectively.

Quick Reference

Before starting any AI-assisted task: (1) Estimate your total input token count. (2) Add your system prompt token count. (3) Add an estimate for expected output. (4) Compare to the model's context window. (5) If you are above 80% of capacity, redesign — context that is close to the limit increases the risk of truncation and quality degradation. If you are well within limit, consider whether RAG complexity is necessary at all.

Lesson 4 Quiz

Five questions · Select the best answer · Immediate feedback

1. Based on the documented Replit case, what was the primary benefit of switching to a full-context approach using Claude's 100,000-token window?

Correct. The Replit team's key insight was structural: by loading the full codebase into one context, the model could see relationships between code elements that chunked processing severed. The improvement was in coherence, not model capability per se.

Not quite. The Replit case illustrated a coherence benefit: the model could see a function call and its definition in the same context window, rather than processing them in separate chunks and losing the structural relationship. Speed and cost were not the primary drivers.

2. When is RAG (Retrieval-Augmented Generation) still preferable to loading full documents into a large context window?

Correct. RAG remains essential when material exceeds the context window or when cost makes large-context inference impractical. For tasks where a single bounded document set fits within the available window at acceptable cost, full-context loading is often simpler and more coherent.

Not quite. RAG remains valuable when document collections genuinely exceed context window capacity or when per-token costs make large contexts impractical. Neither approach is universally superior — the choice depends on the task, the material size, and the cost constraints.

3. Applying the "lost in the middle" finding practically: where should you place the most critical instructions in a long system prompt?

Correct. The documented pattern is that models attend most reliably to content at the beginning and end of a long context. Placing critical instructions first, and reinforcing them near the end, works with this architectural tendency rather than against it.

Not quite. Research consistently shows that middle-position content is retrieved less reliably. Critical instructions should go at the beginning of the context, and if especially important, they can be reiterated near the end as well.

4. A developer has a 128,000-token context window and a task that requires 125,000 tokens of input plus an expected 5,000-token output. What is the primary concern?

Correct. Both input and output tokens count toward the context window. With 125,000 input tokens and a 5,000-token expected output, the total of 130,000 tokens exceeds a 128,000-token limit. The model will likely truncate the output or behave unexpectedly. Always budget for output as well as input.

Not quite. Output tokens count against the context window just as input tokens do. The total of 125,000 (input) + 5,000 (output) = 130,000 tokens exceeds the 128,000-token limit. This will cause truncation or degraded behavior. Budget for both input and output.

5. In long agentic AI systems, why do engineers implement context compression strategies such as periodic summarization of accumulated history?

Correct. Agentic loops accumulate context with each step — tool results, reasoning traces, prior actions — until the window fills. Without compression, the agent suffers procedural amnesia (early constraints scroll off) and incurs rapidly increasing inference costs from attending over a growing history. Compression strategies preserve useful state while keeping the context manageable.

Not quite. In agent loops, each step adds to the context. Without compression, the context eventually fills, causing the agent to lose access to early instructions or decisions (procedural amnesia), and each new step becomes more expensive to compute as the attention history grows. Compression strategies address both problems.

Lab 4: Applying Context Window Thinking

Work through real workflow decisions with your AI lab assistant

Your Task

This is the applied lab for Module 1. Bring a real scenario from your own work — or use one of the prompts below — and work through the context window considerations with the AI assistant. The goal is to develop concrete judgment about when and how context limits affect your specific use case.

Try asking: "I want to use AI to review a 150-page contract — walk me through how I should think about context window choice for this task." — or — "How would I design an AI agent for a multi-step research task given what I now know about context accumulation?" — or — "What token budget should I build in for a customer support chatbot with a detailed system prompt?"

AI Lab Assistant

Context Window Race · L4

Welcome to the final lab for Module 1. Now we get applied. Bring me a real task or scenario — document review, code analysis, agent design, chatbot architecture, whatever you're actually building — and let's work through the context window considerations together. Or ask me to walk through one of the example scenarios. What's your use case?

Module 1 Test

15 questions · 80% to pass · Covers all four lessons

1. A context window is best defined as:

Correct.

Not quite. The context window is the working memory of a single inference pass — the maximum tokens processed at once, separate from training data or architectural properties.

2. Approximately how many English words does one token represent, using GPT-family tokenizers?

Correct. Tokens are sub-word units averaging ~0.75 English words, so 1,000 words ≈ 1,333 tokens.

Not quite. Tokens average about 0.75 English words — they are smaller than words, not larger.

3. Which of these does NOT occupy space in a model's context window during a conversation?

Correct. Training data is baked into the model's weights — it does not appear as tokens in the context window during inference.

Not quite. The training data is encoded into model weights during training and does not appear as tokens in the context window. Everything else listed — system prompts, prior messages, pasted documents — does occupy context space.

4. Standard transformer attention scales with sequence length in what way?

Correct. Attention requires every token to compare with every other token: N × N = N² operations.

Not quite. Standard attention is quadratic: N tokens require N² attention computations.

5. FlashAttention, published in 2022, improved long-context inference by:

Correct. FlashAttention is an IO-aware implementation that computes exact attention in tiles that fit in fast SRAM, avoiding costly HBM reads and writes.

Not quite. FlashAttention computes exact (not approximate) attention, using the same number of operations as standard attention but reorganizing them into tiles that fit in fast on-chip memory.

6. Anthropic first expanded Claude's context window to 100,000 tokens in:

Correct. Anthropic announced the 100,000-token Claude context in May 2023, the first context window large enough to hold a complete novel.

Not quite. Anthropic's 100,000-token announcement came in May 2023, between the GPT-4 launch (March 2023) and GPT-4 Turbo (November 2023).

7. The "Lost in the Middle" research paper found that model performance on retrieval tasks was:

Correct. The paper documented consistent degradation in retrieval accuracy when target information appeared in the middle of a long context versus at the edges.

Not quite. The paper's central finding was that middle-position information was retrieved significantly less reliably than beginning- or end-position information.

8. Google's Gemini 1.5 Pro reached which context window size at its February 2024 announcement?

Correct. Gemini 1.5 Pro's one-million-token context was announced February 2024, representing a full order-of-magnitude jump.

Not quite. Gemini 1.5 Pro reached one million tokens — the largest announced context window at that date.

9. The Needle in a Haystack benchmark was developed by Greg Kamradt in what month and year?

Correct. Kamradt's Needle in a Haystack benchmark emerged in November 2023 and quickly became the standard for evaluating long-context quality.

Not quite. The benchmark was popularized in November 2023, following the "Lost in the Middle" paper and the GPT-4 Turbo / Claude long-context announcements.

10. A model parameter count of 100 billion and a context window of 4,096 tokens means the model:

Correct. Parameters and context window are independent dimensions. Parameter count encodes learned knowledge; context window determines what can be attended to in a single inference pass.

Not quite. Parameters and context window are completely separate. No matter how many parameters a model has, it cannot attend to tokens outside its context window in a single pass.

11. For a standard English-language document, approximately how many tokens does a dense single-spaced page contain?

Correct. A dense single-spaced page of English prose typically runs 500–700 tokens — a useful reference point for estimating context requirements.

Not quite. A dense single-spaced page of English runs approximately 500–700 tokens. At ~1,333 tokens per 1,000 words, a 400-word page yields roughly 530 tokens.

12. RAG (Retrieval-Augmented Generation) remains preferable to full-context loading when:

Correct. RAG is the right choice when material exceeds the context window capacity or when the per-token cost of loading everything makes full-context inference impractical.

Not quite. The core decision criterion is whether the material fits within the available context window at acceptable cost. If it does, full-context loading is often simpler and more coherent. If it does not, RAG is necessary.

13. Byte-pair encoding (BPE) tokenization tends to be LESS efficient (more tokens per word) for:

Correct. Because tokenizers are trained primarily on English text, morphologically complex or less-represented languages tokenize less efficiently — the same semantic content requires more tokens than the equivalent English text.

Not quite. BPE tokenizers trained predominantly on English text are less efficient for morphologically rich or underrepresented languages. A Turkish or Finnish sentence may require twice as many tokens as the equivalent English content.

14. In an agentic AI system, why do early instructions sometimes get "forgotten" during a long multi-step task?

Correct. In agent loops, the growing context of tool results and intermediate steps eventually pushes early tokens out of the context window entirely. Once outside the window, those tokens are invisible to the model — not deprioritized, but truly absent.

Not quite. It is not that the model ignores early tokens — it literally cannot see them once they fall outside the context window. Each appended step brings the total closer to the limit, and early content eventually scrolls off entirely.

15. A developer has a 128,000-token model and plans a task with 90,000 tokens of input and 5,000 tokens of expected output. The recommended safe approach is to:

Correct. Total context consumption is input plus output: 90,000 + 5,000 = 95,000 tokens, which is about 74% of the 128,000-token window. This is within the safe range, but the developer should track it and ensure the system prompt (if any) is accounted for as well.

Not quite. Output tokens count against the context window. The correct check is total input + output = 95,000 tokens, which is ~74% of capacity — fine, but the full budget including any system prompt must be tracked. The 80% guideline exists because very high utilization increases risk of truncation and quality issues.