When OpenAI released GPT-3 to select API partners in June 2020, the model arrived with a hard ceiling: 2,048 tokens of combined prompt and completion. At roughly 1,500 words of English text, this meant a model that could write a sharp magazine article but could not hold an entire short story in its head at once.
Developers immediately felt the constraint. Writing assistants had to split long documents into chunks. Summarization pipelines truncated legal briefs mid-sentence. Code generation tools lost context after a few hundred lines. The architecture was brilliant β the ceiling was brutal.
A token is not simply a word. OpenAI's tokenizer splits text into subword units: "running" might be one token, while "unbelievable" might be three. In practice, English text runs at roughly 0.75 words per token, so 2,048 tokens β 1,500 words β about six pages of double-spaced text.
The 2,048 limit covered the entire context: system instructions, the conversation so far, and the model's response. If a developer's system prompt consumed 300 tokens, only 1,748 remained for user input plus output. Every token spent on context instructions was a token unavailable for content.
The number traces to practical GPU memory constraints in 2020-era hardware. Transformer attention is quadratic: doubling the context length roughly quadruples the memory required for the attention matrix. At the scale of GPT-3's 175 billion parameters, 2,048 tokens was close to the feasible maximum given the A100 GPUs available at the time.
Early API adopters built elaborate workarounds. Summarize-then-extend pipelines compressed earlier sections of a document into brief summaries before passing new sections to the model β at the cost of fidelity. Sliding window approaches fed the model overlapping chunks, hoping for coherent joins. Neither was reliable.
Copy.ai, one of the first GPT-3-powered writing products (launched October 2020), limited its "Blog Post Wizard" output to short sections precisely because generating anything longer risked context overflow. Users received disconnected paragraphs that required manual stitching.
Legal-tech startups attempting to summarize contracts discovered that a standard non-disclosure agreement β often 3,000β4,000 words β had to be chopped into segments. Clauses at the boundary of a chunk were frequently misrepresented in summaries because the model lost the preceding definitional context.
In August 2021 OpenAI released Codex, the model powering GitHub Copilot. Codex shipped with a 4,096-token context β double GPT-3's limit. The reason was domain-specific: code files are long, function signatures must be visible when generating bodies, and repository context matters. OpenAI made a deliberate choice to extend context for code because the use-case demanded it.
Copilot's technical preview (launched June 2021, public beta September 2021) immediately showed what an extended window could do: the model could "see" an entire small Python file and generate a new function consistent with existing class structure. This was qualitatively different from anything possible at 2,048 tokens.
The jump from 2,048 to 4,096 tokens was not primarily a model capability improvement β it was a memory and engineering decision. The underlying transformer architecture was capable; the constraint was always hardware and inference cost, not fundamental model intelligence.
You're consulting for a 2021-era startup building a legal document summarizer on GPT-3's API. The context window is 2,048 tokens. Explore the trade-offs with your AI assistant β ask about chunking strategies, token budgeting, and what gets lost when context is exceeded.
On March 14, 2023, OpenAI released GPT-4 with two context variants: a standard 8,192-token version and a limited-access 32,768-token version. Four days later, Anthropic released Claude 1 with a 9,000-token context. Within a week, the industry's working definition of a "large" context window had moved fourfold.
The numbers felt abstract until practitioners tried them. A GPT-4-32K session could hold a full academic paper. A Claude 1 session could process a lengthy legal agreement without chunking. Developers who had spent two years engineering elaborate splitting pipelines suddenly found those pipelines obsolete.
OpenAI's decision to launch GPT-4 with two context sizes was deliberate. The standard 8,192-token window was available to all API customers; the 32,768-token window ("GPT-4-32K") was restricted to select partners and priced at roughly double the per-token rate. The price difference reflected real inference cost: the larger attention matrix required substantially more GPU memory per request.
The 8K window was four times GPT-3's limit and sufficient for most professional documents. At 8,192 tokens (β6,100 words), it could process a long feature article, a short legal agreement, or a multi-turn conversation spanning dozens of exchanges β all without chunking.
Anthropic's Claude 1, released on March 14, 2023 (the same day as GPT-4), offered a 9,000-token context β slightly larger than GPT-4's standard tier. Anthropic's public technical documentation emphasized that Claude was trained with Constitutional AI methods, but less discussed was the engineering effort behind maintaining coherence across a 9K window.
The model demonstrated notably stronger performance on tasks requiring the model to hold earlier information in mind β cross-referencing a document's definitions against later clauses, for example. This suggested that context window size and context utilization were distinct: a model could technically accept 9,000 tokens but lose meaningful information from the beginning of a long prompt before reaching the end.
A Stanford NLP research paper published in November 2023 (Liu et al., "Lost in the Middle: How Language Models Use Long Contexts") documented a consistent pattern: models performed significantly better when critical information was placed at the beginning or end of the context window, and significantly worse when it was buried in the middle. A longer window did not guarantee proportionally better performance β position within the window mattered enormously.
With GPT-4-32K, a new class of task became feasible without engineering workarounds. A 32,768-token window holds approximately 24,000 English words β roughly half a full-length novel, or a 90-page academic dissertation. Real documented uses that emerged in 2023 included:
Full-contract review: Law firms running GPT-4-32K on merger agreements (typically 40β80 pages) without splitting. The model could cross-reference defined terms from page 2 while analyzing obligations on page 60.
Extended code analysis: Software teams feeding entire Python modules (2,000β3,000 lines) into a single prompt to detect inconsistencies across functions β something the Codex 4K window made impossible.
Medical literature synthesis: Research groups feeding multiple clinical trial abstracts simultaneously and requesting comparative analysis, removing the summarize-then-compare pipeline that introduced errors.
In March 2023, OpenAI priced GPT-4-8K at $0.03 per 1K prompt tokens and GPT-4-32K at $0.06 per 1K prompt tokens β a 2Γ premium for the larger window. This pricing architecture made context window size an explicit economic variable for the first time, forcing teams to quantify exactly how much extra context was worth per task.
It's mid-2023. You're a product manager deciding whether to pay the 2Γ premium for GPT-4-32K or optimize your pipeline for the 8K standard. Explore the trade-offs with your AI assistant β discuss which use cases justify 32K, how to structure prompts to avoid "Lost in the Middle" failure modes, and when chunking remains the smarter choice.
On May 11, 2023, Anthropic released Claude 1.3 with a context window of 100,000 tokens β approximately 75,000 words, or a full-length novel. The announcement was treated with some skepticism: prior model generations had demonstrated that holding a large context and using it reliably were separate problems.
Anthropic's technical team documented a specific test: feeding the model the entire text of the novel The Great Gatsby (72,000 words), then asking questions about specific passages. The model answered accurately, including locating specific sentences. The 100K window was not just bigger β it appeared to be usable throughout its range.
To make the number concrete: 100,000 tokens at the standard 0.75 words-per-token ratio β 75,000 words. This is sufficient to hold:
β The complete text of a standard merger agreement (typically 40β80 pages, 20,000β40,000 words)
β A full software codebase for a small application (10,000β30,000 lines of Python)
β An entire novel (Gatsby: 47,094 words; most genre fiction: 70,000β100,000 words)
β Hundreds of pages of financial filings or clinical trial reports
For the first time, a model could process an entire professional-scale document in a single inference call with no splitting, no summarization, no pipeline β just input and output.
Anthropic's May 2023 announcement described feeding Claude the full text of The Great Gatsby alongside a randomly inserted, fictional sentence about a pizza order, then asking the model to find the pizza sentence. The model located it accurately across multiple trials β providing early evidence that the 100K window was usable throughout, not just at the edges. This informal test became the template for what later became the formal "needle-in-a-haystack" benchmark used across the industry.
The immediate practical consequence was the obsolescence of entire engineering categories that had dominated AI product development since 2020. Three specific pipelines became less necessary overnight:
MapReduce summarization β the practice of summarizing document chunks independently then summarizing the summaries β had been a standard LangChain workflow. With 100K context, the summarization could happen on the full document, eliminating information loss at chunk boundaries.
Semantic chunking β splitting documents at semantically coherent boundaries rather than arbitrary token counts β had been an active research area. Tools like LlamaIndex had built entire product value propositions around intelligent chunking. The 100K window dramatically reduced the contexts in which chunking was necessary.
Vector database retrieval as the primary context management strategy faced reconsideration. When a full document could fit in context, the trade-off between embedding-based retrieval (fast, imprecise) and full-context analysis (slow, comprehensive) shifted toward full-context for high-stakes professional tasks.
The 100K window came with real costs. Processing a full 100K context was substantially slower than a 4K or 8K request β early tests reported latencies of 30β60 seconds for full-context inference, compared to 2β5 seconds for short prompts. API pricing reflected the computational cost: Anthropic's Claude 100K pricing at launch was $11.02 per million prompt tokens (compared to OpenAI's GPT-4-8K at $30 per million prompt tokens at the time, making Claude 100K relatively competitive despite the larger window).
The latency issue was not trivial. A 45-second response time is acceptable for a lawyer reviewing a contract overnight but unacceptable for a customer service chatbot. The 100K window opened new categories of batch, asynchronous AI tasks while leaving real-time interactive use cases better served by smaller, faster contexts.
Researchers at Anthropic and elsewhere noted in 2023 that even with a 100K context, retrieval-augmented generation (RAG) retained significant advantages for certain task types: when relevant information constituted only a tiny fraction of a very large corpus, retrieving the relevant chunks remained faster and cheaper than stuffing the full corpus into context. The 100K window did not eliminate RAG β it shifted the economic threshold at which RAG became preferable.
It's mid-2023 and you're a legal-tech engineer evaluating whether to rebuild your contract analysis pipeline around Claude's 100K context. Your current system uses a MapReduce summarization pipeline with LangChain. Explore the trade-offs with your AI assistant β when is full-context analysis worth its latency cost, and when should you keep chunking?
On February 15, 2024, Google DeepMind previewed Gemini 1.5 Pro with a context window of 1,000,000 tokens β one million. The research preview announcement included a specific demonstration: the model was fed the entire source code of a video game and asked to locate and explain a specific bug. It found it. The model was also fed 11 hours of audio and asked about a specific 30-second segment. It answered accurately.
The jump from 100,000 to 1,000,000 tokens was not incremental. It required a fundamentally different architectural approach, specifically Google's implementation of Mixture-of-Experts (MoE) combined with improvements to positional encoding that allowed attention to remain coherent across one million token positions.
Standard transformer attention scales as O(nΒ²) with context length β the fundamental constraint that capped GPT-3 at 2,048 tokens. One million tokens in a standard attention mechanism would require an attention matrix of 10ΒΉΒ² cells β computationally impossible even on modern hardware.
Gemini 1.5 Pro used two key architectural approaches. First, Mixture-of-Experts (MoE) routing reduced the active compute per token by activating only a subset of the model's parameters for each token, making per-token compute feasible at scale. Second, Google's research team employed ring attention techniques that allowed attention computation to be distributed across multiple TPU devices without each device holding the full attention matrix in memory.
The Google DeepMind technical report (February 2024) described Gemini 1.5 Pro as achieving "near-perfect recall" on needle-in-a-haystack tasks up to 1M tokens β a remarkable claim that external researchers at the time subjected to skeptical independent testing.
Independent researchers using the LMSYS Chatbot Arena and community needle-in-a-haystack tests largely confirmed Gemini 1.5 Pro's claims up to approximately 500K tokens, with performance degrading somewhat in the 500Kβ1M range. The degradation was real but substantially smaller than the "lost in the middle" effect seen in earlier models β a genuine qualitative improvement in context utilization, not just context capacity.
On March 4, 2024, Anthropic released the Claude 3 model family (Haiku, Sonnet, Opus) β all with 200,000-token context windows. This represented a doubling of Claude's prior 100K window and positioned Anthropic competitively against Google's 1M preview, which was not yet in general availability.
Anthropic's technical documentation for Claude 3 emphasized a different metric than raw context size: context faithfulness β specifically, how well the model maintained accurate recall across the full 200K range. Independent evaluations (conducted by researchers at AI safety organizations and published on platforms like AI Alignment Forum in MarchβApril 2024) found Claude 3 Opus performed more consistently across its 200K window than earlier models had across their smaller windows.
The practical uses that become feasible only above the 100K threshold β and which Gemini 1.5 Pro's technical report specifically documented β fall into distinct categories:
Full codebase analysis: A 1M token window can hold approximately 750,000 words of code. A large open-source project like CPython's core library is approximately 400,000 lines β manageable in a single 1M context. Developers can ask "find all places where this variable is used and explain the data flow" across an entire real-world project.
Hour-scale audio and video: Google's February 2024 demonstration fed 11 hours of audio in a single context. At roughly 1,500 tokens per minute of audio transcription, 11 hours β 990,000 tokens β fitting within 1M. This enables tasks like "find all segments in this recorded conference where the speaker contradicts an earlier statement."
Multi-document legal and financial analysis: A full merger acquisition due diligence set β encompassing hundreds of documents β can exceed 500,000 words. At 1M tokens, the entire set fits in a single context, enabling cross-document analysis without retrieval pipelines.
OpenAI had responded to Claude's 100K window in November 2023 with the release of GPT-4 Turbo, featuring a 128,000-token context window β a substantial jump from the 32K maximum previously available. GPT-4 Turbo's announcement emphasized not just context size but price: it was priced at $0.01 per 1K prompt tokens, one-third the GPT-4-8K price, making large-context inference significantly more economical.
By early 2024, the context window race had clearly entered a new phase: the constraint was no longer whether a window was "large enough" for most professional documents, but how well the model used what it could see, and how economically it could process it.
By mid-2024 the research community had solidified a key distinction: context capacity (how many tokens the model accepts) versus context utilization (how well the model actually uses those tokens). Gemini 1.5 Pro's 1M window with near-perfect recall represented a genuine breakthrough in utilization, not just capacity. This distinction would define the next phase of the context window race β not who had the largest window, but whose model actually paid attention to all of it.
It's mid-2024. You're an AI engineering lead evaluating whether Gemini 1.5 Pro's 1M-token context makes sense for your company's codebase analysis and legal due-diligence workflows. Explore the utilization vs. capacity question with your AI assistant β when does the million-token context genuinely help, and when is a well-implemented RAG pipeline still the smarter choice?