In 1876, when Alexander Graham Bell filed his telephone patent, the pressing engineering problem was not whether voice could travel over wire β it already could β but how far. Early telephone exchanges in Boston and New Haven covered a few city blocks. By 1915, AT&T's transcontinental line finally reached San Francisco, but only after engineers solved the problem of signal decay: how much information could be kept coherent across distance. The answer to that question reshaped commerce, journalism, and the command structures of armies. The technology was not new; its reach was.
The same pattern is now playing out in artificial intelligence, measured not in miles but in tokens. In 2020, GPT-3 could hold roughly 4,000 tokens in working memory β about three pages of text. By early 2023, Claude's initial release handled 9,000. By mid-2023, Anthropic pushed Claude to 100,000 tokens. Google's Gemini 1.5 Pro, announced in February 2024, demonstrated a one-million-token context window. The underlying model architectures did not change beyond recognition between those milestones. What changed β radically, consequentially β was reach.
This course is about understanding that race: what a context window actually is at a technical and practical level, why its size determines what tasks AI can and cannot perform, how engineers have expanded it, and what tradeoffs accumulate as it grows. The goal is not to make you a researcher but to make you a clear-eyed practitioner who knows why a 200-page contract poses different challenges than a 2-page one, and what to do about it. We will work from documented technical facts and real product histories, not speculation.
If you finish every module, here's who you become:
In March 2023, a group of lawyers at a New York firm were preparing for a complex contract dispute. They fed the full 312-page merger agreement into ChatGPT-4, which had launched days earlier with a 32,000-token context window. The model summarized the first forty pages faithfully. By page 200, it was contradicting its own earlier summaries. Clauses it had described as unconditional had quietly become conditional in the model's understanding, because the relevant qualifying language appeared earlier in the document than the definitions it modified β and that earlier material had, in effect, scrolled off the model's working memory. The lawyers caught it. Many users do not. The problem had a name by then: context overflow. It would drive one of the most intense engineering competitions in modern software history.
The episode illustrates something fundamental: a language model's capabilities are not simply a function of its training. They are also, acutely, a function of what it can attend to in a given moment. The context window is that moment β its length, its fidelity, and its limits.
A context window is the total amount of text β measured in tokens β that a language model can process in a single forward pass. Everything the model knows about your current conversation or document must fit inside this window. Nothing outside it exists, from the model's perspective, during inference.
The term "window" is apt. Think of sliding a physical reading frame across a very long scroll. The frame shows you some portion of the scroll with perfect clarity. Whatever lies outside the frame is, for the moment, invisible. The model's attention mechanisms β the computational heart of transformer architecture β operate on the tokens inside this frame and nothing else.
This is fundamentally different from how human memory works. You can ask a person to recall chapter three of a book they finished two weeks ago, and they will do so imperfectly but meaningfully. A language model has no such long-term episodic store during a single inference call. It has only what fits in the window.
Before context windows make sense, you need to understand tokens. A token is not a word. It is a chunk of text determined by a statistical process called byte-pair encoding (BPE), standardized independently for each model family. OpenAI's tiktoken library, used for GPT models, splits text into tokens that average roughly 0.75 words in English β meaning 1,000 words is approximately 1,333 tokens.
Common English words like "the," "is," and "run" are each one token. Rare words or technical terms may be split into several tokens: "tokenization" might become ["token", "ization"]. Whitespace, punctuation, and code syntax each consume tokens. A Python function with comments might tokenize at a higher rate than equivalent prose. Non-English languages often tokenize less efficiently β a sentence in Turkish or Finnish may require twice as many tokens as the same semantic content in English, because the tokenizer was trained predominantly on English text.
This matters practically. If you are working with a 128,000-token model and feeding it a legal document, you cannot simply count pages. You must estimate tokens. A dense 100-page PDF might consume 60,000 tokens; a lightly spaced one might consume 40,000. Tools like OpenAI's Tokenizer page and Anthropic's token counter allow direct inspection.
GPT-4's original 8,192-token context window holds approximately 6,000 words β roughly a long magazine article. GPT-4-32k holds about 24,000 words, or a short novella. Claude 3's 200,000-token context holds approximately 150,000 words β the length of War and Peace. Gemini 1.5 Pro's 1,000,000-token context holds roughly 750,000 words, equivalent to seven copies of that novel simultaneously.
The context window does not contain only your most recent message. It holds, in sequence, everything that has been exchanged in the current session: the system prompt (instructions given to the model before the conversation begins), all prior user messages, all prior assistant responses, any documents or code you have pasted in, and your current message. The model sees all of this as a single, ordered token sequence.
This has a non-obvious implication: every response the model generates consumes context space. A long assistant reply uses up tokens that are no longer available for new input. In long conversations, chat interfaces typically truncate or summarize earlier exchanges as the window fills. When that truncation occurs is often opaque to the user, and the behavior it causes β the model apparently "forgetting" things said early in the conversation β is frequently misattributed to the model being careless rather than to the hard constraint it is operating under.
System prompts are particularly important to understand here. In production applications β think a customer-support chatbot or a coding assistant β the system prompt may itself consume thousands of tokens before the user types a single character. A detailed system prompt of 4,000 tokens on a model with a 32,000-token context leaves only 28,000 tokens for the actual conversation. Developers who do not track this routinely hit limits they did not anticipate.
A model's parameter count (how much it "knows" from training) and its context window (how much it can see right now) are completely separate quantities. A very large model with a small context window cannot analyze a long document. A smaller model with a very large context window can β but may reason about it less accurately. Both dimensions matter, and neither substitutes for the other.
Use the chat below to explore what you just learned. The AI assistant is configured specifically for this lesson. Ask it at least three substantive questions about context windows, tokens, or the practical limits of working memory in language models.
Good starting points appear below β but follow your own curiosity. The lab completes after three exchanges.
When the original "Attention Is All You Need" paper was published by Vaswani et al. at Google Brain in June 2017, it introduced the transformer architecture that now underlies virtually every major language model. The paper's title was a declaration of design philosophy: replace recurrence with attention. What it did not advertise prominently was the cost buried in that choice. Attention scales quadratically with sequence length. Double the tokens in the context window, and the attention computation does not double β it quadruples. At the time, with sequences of a few hundred tokens, this was manageable. By 2022, when researchers wanted to push contexts into the tens of thousands, it became the central engineering problem of the field.
This is not an abstract concern. It translates directly into GPU hours, latency, and inference cost. A model responding to a 100,000-token prompt in 2023 was, by the physics of the computation, doing something qualitatively different β and far more expensive β than responding to a 4,000-token prompt. Understanding why is understanding the engine of the context window race.
In a transformer, every token in the context window must compare itself to every other token in the context window to determine how much "attention" to pay to it. If your context has N tokens, the model must compute N Γ N attention scores. This is the quadratic relationship: NΒ² computations.
At 4,096 tokens: 4,096Β² = approximately 16.8 million attention computations per layer. At 32,768 tokens: 32,768Β² = approximately 1.07 billion computations per layer. At 128,000 tokens: approximately 16.4 billion. And models have many layers β GPT-4's architecture, though not publicly confirmed, is estimated to have 96 transformer layers. The compounding is significant.
Beyond raw computation, the attention matrix must be stored in memory. At 128,000 tokens with 16-bit floating point precision, storing the full attention matrix for a single layer requires roughly 32 gigabytes. This exceeds the VRAM capacity of a single high-end GPU. Long-context inference therefore requires either distributing computation across many GPUs or using techniques to avoid materializing the full attention matrix β which is exactly what the engineering innovation in this space has focused on.
In June 2022, Tri Dao, Dan Fu, Stefano Ermon, and Atri Rudra at Stanford published FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. The paper identified that the bottleneck in attention computation was not the floating-point operations per se but the movement of data between GPU high-bandwidth memory (HBM) and the much faster on-chip SRAM. The attention matrix was being written to and read from slow memory repeatedly.
FlashAttention reorganized the computation into tiles β computing attention in small blocks that fit entirely in fast SRAM, never materializing the full attention matrix in slow memory. The mathematical result was identical; the memory footprint and speed were dramatically improved. FlashAttention 2, published in 2023, refined parallelism further and achieved roughly 9Γ speedup over standard attention on A100 GPUs. It became the de facto standard attention implementation for virtually every major model by 2023, and it is a significant reason why context windows that were impractical in 2021 became routine by 2024.
When a model provider charges per token, the quadratic cost of attention is part of what you are paying for. Long-context inference is genuinely more expensive to run than short-context inference β not linearly more, but superlinearly. A 200,000-token prompt costs substantially more than ten 20,000-token prompts containing the same total information, because the attention cost of the full sequence dwarfs the sum of the parts. Budget accordingly when designing systems that use very long contexts.
FlashAttention is an implementation optimization β it computes the same full quadratic attention more efficiently. A parallel line of research attacks the quadratic problem by changing what is computed. Sparse attention methods allow each token to attend only to a structured subset of other tokens rather than all of them, reducing the NΒ² computation toward N log N or even N.
OpenAI's 2019 Sparse Transformer demonstrated this principle. Subsequent work by Google (Longformer, BigBird) and others developed practical sparse attention patterns including sliding window attention (each token attends to its local neighborhood), global attention (a small set of designated tokens attend to everything), and random attention (random token pairs). These approaches made processing very long sequences computationally tractable but introduced a new question: which tokens actually need to attend to which others? Getting that wrong means the model misses relationships it should see.
By 2024, the dominant engineering direction was not sparse attention but rather combining FlashAttention with positional encoding improvements β specifically, techniques like Rotary Position Embedding (RoPE) with extended frequency scaling β to allow standard attention to generalize to sequence lengths longer than those seen in training without catastrophic degradation. This is how Llama 3's context was extended from 8,192 to 128,000 tokens in successive releases.
The context window race is not simply about making numbers bigger. Every increase in context length requires either accepting higher compute costs, accepting architectural compromises that may affect quality, or discovering new engineering techniques that change the tradeoff curve. The history of the race is the history of researchers finding ways to move that curve β and then model providers deciding how much of the remaining cost to absorb.
This lab is about the engineering reality behind expanding context windows. Use the chat to explore quadratic scaling, FlashAttention, sparse attention, and their practical implications. Ask at least three substantive questions to complete the lab.
On May 28, 2024, Google announced that Gemini 1.5 Pro was available to developers with a one-million-token context window β enough to process roughly eleven hours of audio, one hour of video, 30,000 lines of code, or 700,000 words of text. Google framed this not as an incremental improvement but as a category shift. The announcement included a demonstration: Gemini 1.5 Pro successfully located a specific scene in a 45-minute film given only a hand-drawn sketch as reference β a task requiring genuine comprehension of the entire video. What is instructive about this milestone is not just that it happened, but the path that led there: a series of discrete steps, each driven by a combination of technical breakthrough and competitive pressure, spanning just four years.
2020 β GPT-3: 2,048 tokens. OpenAI's GPT-3, released in May 2020 with 175 billion parameters, used a 2,048-token context window. This was standard for the era. The model could handle a few pages of text β enough for most single-document tasks but insufficient for sustained document analysis or multi-document reasoning.
2022 β Anthropic Claude (beta): 9,000 tokens; AI21 Jurassic-2: up to 8,192 tokens. As the transformer ecosystem matured following the publication of FlashAttention, context windows began expanding. Anthropic's initial private beta of Claude used 9,000 tokens, already more than GPT-3.5's 4,096-token public API. AI21 Labs' Jurassic-2 offered up to 8,192 tokens.
March 2023 β GPT-4 launch: 8,192 and 32,768 tokens. OpenAI launched GPT-4 with two context configurations: 8k and 32k tokens. The 32k version could handle approximately 24,000 words β a substantial document β but was priced significantly higher and initially limited to select partners. The same week, Anthropic launched Claude 1 with a 9,000-token context.
May 2023 β Anthropic Claude: 100,000 tokens. Anthropic announced Claude's context window expansion to 100,000 tokens. This was a genuine inflection point. 100,000 tokens represents approximately 75,000 words β the length of a typical novel. For the first time, users could load an entire book, a full codebase, or a year's worth of meeting transcripts into a single context. Anthropic demonstrated this by having Claude analyze the entire text of The Great Gatsby in one pass.
November 2023 β GPT-4 Turbo: 128,000 tokens. OpenAI responded at its first DevDay conference on November 6, 2023, announcing GPT-4 Turbo with a 128,000-token context window at significantly reduced pricing. The context expansion was paired with a knowledge cutoff extension to April 2023.
February 2024 β Google Gemini 1.5 Pro: 1,000,000 tokens. Google's announcement in February 2024 of Gemini 1.5 Pro represented a full order-of-magnitude jump over the previous leaders. The architecture underlying this milestone used a Mixture of Experts (MoE) approach combined with advances in positional encoding, enabling in-context learning at scales previously thought to require fine-tuning.
2024 onward β Claude 3 (200k), Llama 3.1 (128k), Gemini 1.5 Flash (1M), GPT-4o (128k). By mid-2024, 100,000+ token context windows had become table stakes for frontier models, and the competition shifted from raw context length toward quality at length β whether models actually use long contexts reliably, rather than simply accepting them.
A landmark paper published by researchers at Stanford, UC Berkeley, and Samaya AI in July 2023 β titled "Lost in the Middle: How Language Models Use Long Contexts" β documented a critical quality problem that context length milestones were obscuring. The researchers found that when relevant information was placed in the middle of a long context, model performance degraded significantly compared to when the same information was placed at the beginning or end.
Specifically, across tested models, retrieving information from the middle of a 20-document context performed roughly 10β20 percentage points worse than retrieval from the edges of the same context. This was observed across multiple models and configurations. The implication: a model advertised as supporting 100,000 tokens does not necessarily use those tokens with equal fidelity throughout. Accepting a long document and reasoning reliably over all of it are different capabilities.
This finding reframed the competitive conversation. By late 2023 and into 2024, benchmark suites specifically designed to test retrieval from arbitrary positions in long contexts β such as the Needle in a Haystack test, popularized by Greg Kamradt in November 2023 β became the de facto standard for evaluating long-context quality rather than mere length.
The Needle in a Haystack benchmark, developed by Greg Kamradt and widely adopted in late 2023, evaluates a model by hiding a specific fact (the "needle") at various positions within a long document (the "haystack") and asking the model to retrieve it. Performance is mapped as a heatmap across document depth and context length. Early results showed striking patterns: Claude 1 and GPT-4 struggled at document depths above 70%, while Claude 2.1 showed improvement but still degraded near 80% depth. Claude 3 models, announced in March 2024, achieved near-perfect Needle in a Haystack scores across 200,000 tokens β the first publicly demonstrated result of this quality at that scale.
The context window race was not driven by a single technical breakthrough but by the convergence of several factors: FlashAttention lowering the implementation cost, improved positional encoding techniques allowing models to generalize to longer sequences, hardware improvements (A100 and H100 GPUs offering higher VRAM and bandwidth), and direct competitive pressure between Anthropic, OpenAI, and Google, each monitoring the others' releases closely.
Pricing also shaped the race. Long context inference costs more to serve, but the models with the longest context were initially priced at premiums that made routine use impractical. The period from 2023 to 2024 saw a rapid compression in per-token pricing β Claude's API pricing dropped by over 90% between initial release and 2024 pricing tiers β driven by a mix of infrastructure efficiency and competitive undercutting. By mid-2024, processing a 100,000-token context cost approximately $0.30 on Anthropic's Haiku model, down from prices that would have exceeded $15 for equivalent tokens on 2023 models.
Use this lab to explore the competitive and practical dimensions of the context window race. Ask about the milestones, the "lost in the middle" problem, how benchmarks like Needle in a Haystack changed the conversation, or what these developments mean for how you build with AI.
In late 2023, a team of software engineers at Replit publicly described how they had rebuilt their AI coding assistant around Claude's 100,000-token context window. Their previous workflow chunked large codebases into segments and processed them separately, then tried to reconcile the results. The new workflow loaded the entire relevant codebase β sometimes 60,000 to 80,000 tokens β into a single context. The improvement in coherence was immediate. The model could now see a function call and the function definition it was calling in the same pass, rather than having to infer from partial views. Bug identification rates improved; proposed refactors became structurally coherent across file boundaries. The engineers were not using a more capable model. They were using the same model with enough context to actually see the problem.
That case is a useful frame for thinking about context windows as a practitioner: they are not just a number to note in a spec sheet, but a structural constraint that determines whether certain tasks are even possible in a given workflow design.
Some tasks are intrinsically context-constrained. Reviewing a full 200-page contract for clause conflicts requires that both conflicting clauses be simultaneously visible to the model. Summarizing a research paper requires seeing the entire paper. Debugging a multi-file codebase for a cross-module issue requires access to all relevant modules. Translating a novel while maintaining character consistency requires seeing earlier characterization when writing later chapters.
In each case, no amount of prompt engineering compensates for a context window that is simply too small to contain the relevant material. The only solutions are: use a model with a larger context window; chunk the material and accept the coherence limitations this introduces; or use retrieval-augmented generation (RAG) to pull in relevant chunks dynamically, accepting that the chunking logic may miss relevant context.
Understanding which situation you are in before you start building is significant. A team that builds a RAG pipeline assuming context windows are always insufficient may be introducing unnecessary complexity and quality degradation for tasks where a sufficiently long context window would have worked straightforwardly.
Retrieval-Augmented Generation (RAG) was the dominant paradigm for handling long documents before 100,000+ token context windows became practical. In RAG, a document corpus is chunked into segments, those segments are embedded and stored in a vector database, and at query time only the most relevant chunks (by embedding similarity) are retrieved and inserted into the context. This allowed applications to work with arbitrarily large document collections that could never fit in any context window.
RAG remains essential for genuinely large-scale document retrieval β querying across thousands of documents, for example. But for tasks involving a single long document or a bounded set of documents, the emergence of 100k+ context windows changed the calculus. Loading the full document directly is simpler, eliminates chunking errors, and preserves structural relationships that chunk-level retrieval can sever (a clause on page 1 that qualifies a provision on page 80 may never appear in the same retrieved chunk).
The practical decision rule: if your total relevant material fits within the available context window at the cost you can accept, prefer full-context. If it exceeds the context window, or if cost at that token count is prohibitive, design a RAG or chunking strategy. Track this decision explicitly in your system design documentation, because it will affect every downstream quality and debugging consideration.
Given the documented "lost in the middle" effect, when you must place critical information in a long context, prefer placing it near the beginning or end of the context rather than in the middle. When writing system prompts, put the most critical instructions first and, if important, repeat key constraints near the end. This is not a workaround for a bug β it is working with the architecture's known attention patterns.
Developing a reliable intuition for token costs is a practical skill. The following rough estimates work for English-language prose with standard models:
A single-spaced page of dense text: approximately 500β700 tokens. A typical 10-page PDF report: 5,000β7,000 tokens. A 100-page legal document: 50,000β70,000 tokens. A software repository of 50 Python files averaging 200 lines each: approximately 60,000β100,000 tokens depending on comment density. An hour of transcribed speech: approximately 8,000β12,000 tokens.
For production applications, do not estimate β measure. OpenAI's tiktoken library is open-source and available as a Python package; it provides exact GPT-family token counts for any input. Anthropic provides a token counting API endpoint. Building token counting into your preprocessing pipeline catches context overflows before they produce silent failures.
Also account for the output. If you are asking a model to generate a 2,000-token summary and your input is 125,000 tokens on a 128,000-token model, the model may truncate or behave unexpectedly when it generates output that would push the total beyond the limit. Budget for output tokens as well as input tokens.
Agentic AI systems β those that perform multi-step tasks, use tools, and maintain state over extended operations β have a particularly acute relationship with context windows. In an agent loop, each tool call result, each intermediate reasoning step, and each prior action is typically appended to the context. A complex agent task that takes forty steps and uses three external tools may accumulate 30,000β50,000 tokens of intermediate state before producing a final answer.
This has two implications for agent design. First, the available context window constrains the number of steps an agent can take before its early actions scroll off its visible history β a form of procedural amnesia that can cause agents to repeat steps or lose track of constraints set early in the task. Second, very long agentic contexts are expensive to run, because each new token generated requires attending over the full accumulated history.
Production agentic systems typically implement context compression strategies: periodically summarizing the accumulated history into a compact representation, retaining verbatim only the most recent N exchanges, or using structured memory stores that store key facts externally and retrieve them as needed. Understanding the context window is prerequisite to designing these compression strategies effectively.
Before starting any AI-assisted task: (1) Estimate your total input token count. (2) Add your system prompt token count. (3) Add an estimate for expected output. (4) Compare to the model's context window. (5) If you are above 80% of capacity, redesign β context that is close to the limit increases the risk of truncation and quality degradation. If you are well within limit, consider whether RAG complexity is necessary at all.
This is the applied lab for Module 1. Bring a real scenario from your own work β or use one of the prompts below β and work through the context window considerations with the AI assistant. The goal is to develop concrete judgment about when and how context limits affect your specific use case.