In June 2023, Anthropic released Claude 2 with a 100,000-token context window. Tech journalists called it staggering. Twelve months later, Google's Gemini 1.5 Pro demonstrated 1,000,000 tokens. The number had grown tenfold in a single year. The race wasn't slowing — it was accelerating.
The history of context window growth is one of the fastest capability progressions in AI. In 2020, GPT-3 launched with 2,048 tokens — enough for a short essay. By early 2023, GPT-4 reached 32,768 tokens in its extended variant. Then the dam broke.
The jump from 32K to 100K tokens was significant, but it was Anthropic's and Google's subsequent moves that reframed expectations entirely.
Unlike transistor counts, context window growth has been driven by architectural choices, not just hardware. The core bottleneck is the attention mechanism: standard transformers scale in compute as O(n²) with sequence length. Doubling the context quadruples the compute required for attention alone.
The leaps above each represented a significant engineering decision — not just a chip upgrade. Google's 1M-token Gemini relied on a custom architecture called Multi-Head Latent Attention combined with aggressive sparse-attention techniques that allowed the model to skip computing attention over distant, less relevant tokens.
This means context scaling curves look episodic: long plateaus interrupted by sudden jumps when a new architectural technique is proven at scale.
The historical pattern suggests that context length milestones arrive in waves tied to architectural breakthroughs rather than steady increments. Understanding what architectural technique enabled each leap is the key to predicting where the next ceiling will be set — and when it might be broken.
Context length has become a marquee specification — the AI equivalent of megapixels in early smartphone cameras. When Anthropic announced 100K, OpenAI's 8K window suddenly looked dated. When Google announced 1M, Anthropic's 200K lost its "world's largest" distinction within months.
This competition has had a direct user benefit: prices per token have fallen sharply as labs compete, and what once required enterprise contracts is now available in standard API tiers. The trend strongly suggests that by 2026–2027, multi-million-token contexts will be unremarkable baseline capabilities rather than premium features.
In February 2024, Google's research team published detailed benchmarks showing Gemini 1.5 Pro could recall a specific line of dialogue from a 1-hour audio file inserted into context alongside a multi-hour video — simultaneously. The demo was not cherry-picked: the researchers used the "needle in a haystack" methodology pioneered by community researcher Greg Kamradt in late 2023 to stress-test recall at arbitrary positions within the context.
You've just read about context window growth from 2K to 2M tokens. Now explore the deeper implications with your lab assistant. Ask about what drove each architectural leap, why the growth curve is "episodic" rather than smooth, or what business use cases became newly possible at specific size thresholds.
When Google ran Gemini 1.5 Pro at 1 million tokens, the compute bill for a single inference call was staggering. Processing one million tokens at full attention would require roughly 10¹² floating-point operations — more than a thousand trillion computations for a single prompt. Labs don't publish exact numbers, but estimates from researchers at NVIDIA suggest that naive full-attention at 1M tokens would cost more per query than a standard model costs per day. The race toward infinite context runs directly into the wall of physics.
The most successful approach to breaking the O(n²) ceiling has been sparse attention: instead of computing attention between all token pairs, the model selectively attends only to the most relevant tokens. Several variants have reached production:
Sliding window attention (used in Mistral's models): each token only attends to a fixed window of nearby tokens plus a set of "global" tokens designated as important. This reduces attention cost to near-linear.
Flash Attention (Tri Dao, 2022, adopted by most major labs): a GPU kernel optimization that doesn't reduce the theoretical complexity but dramatically reduces memory bandwidth usage, enabling longer contexts on existing hardware. Flash Attention 2 and 3 have each provided roughly 2–4× throughput improvements.
Ring Attention (DeepMind / UC Berkeley, 2023): distributes the KV cache across multiple GPU devices in a ring topology, allowing the combined VRAM of many chips to hold extraordinarily long contexts. This is the technique that made Gemini 1.5's million-token context physically feasible.
In July 2023, researchers from Stanford and UC Berkeley published a paper titled "Lost in the Middle: How Language Models Use Long Contexts." They tested multiple large language models on multi-document question answering and found that performance was significantly higher when the relevant document appeared at the beginning or end of the context — and dropped sharply when it was placed in the middle. This effect held even for models explicitly designed for long context. The finding has directly shaped how production RAG systems are architected: critical information is deliberately placed at the start or end of prompts, not the middle.
GPU VRAM has grown — the NVIDIA H100 ships with 80GB, the H200 with 141GB — but these numbers are dwarfed by the KV cache requirements of multi-million-token contexts for large models. A rough calculation: for a 70-billion-parameter model with a 1M-token context, the KV cache requires approximately 700GB to 1.4TB of memory depending on precision, compared to the model weights themselves at roughly 140GB in BF16.
This has driven two parallel approaches: KV cache compression techniques that reduce memory per token (e.g., grouped-query attention, which Llama 3 uses), and offloading to CPU RAM or SSD storage with async prefetch. Neither fully solves the problem at very long contexts, but each extends the practical ceiling.
Tri Dao's Flash Attention paper (2022, Stanford) was immediately adopted by Hugging Face, Mosaic ML, and most major labs within six months of publication. Its core insight — that the bottleneck is memory bandwidth rather than raw FLOPs — led to a kernel rewrite that made 10–20× longer contexts viable on the same hardware. Flash Attention 3, released in 2024, optimized specifically for H100 architecture and achieved approximately 75% of theoretical peak FP16 FLOP utilization.
The engineering ceilings don't make context window growth impossible — they make it expensive and architecturally demanding. The pattern from previous breakthroughs (Flash Attention, Ring Attention, sparse attention variants) suggests that the next leap will come from a novel technique that sidesteps a current bottleneck, not from incremental hardware improvements alone.
The practical implication for AI users: the 2M-token ceiling isn't a hard physical limit, but expanding beyond it will require either a significant architectural innovation or a substantial increase in inference infrastructure cost — likely both.
You've learned about the three main engineering ceilings: attention compute cost, KV cache memory, and retrieval degradation. Now dig into the tradeoffs with your lab assistant. Ask about how specific techniques like Flash Attention or Ring Attention trade one problem for another, or what would need to be true for a 10M-token model to be economically viable.
In late 2023, two research directions diverged sharply. One camp — Google, Anthropic — bet that context windows would grow large enough to hold everything relevant. The other — led by startups like Mem.ai and integrated into OpenAI's GPT-4 plugins — argued that retrieval-augmented generation was the correct architecture: keep the context window short and fast, and dynamically fetch only what's needed. By 2025, the industry had stopped choosing between them. The winning answer turned out to be: both.
There are now three distinct architectural approaches to giving AI models access to large amounts of information, and they are increasingly used together rather than as alternatives:
RAG has a decisive advantage when the total relevant knowledge base exceeds even the largest context windows. A legal research firm with 50 years of case law — millions of documents — cannot fit that corpus into any context window. RAG allows querying this corpus dynamically.
OpenAI formalized this approach in March 2023 with the release of the Retrieval plugin for ChatGPT, allowing the model to query external document stores via semantic search. This was followed by the more tightly integrated file search capability in the Assistants API, released in November 2023, which automated chunking, embedding, and retrieval behind a simple API interface.
However, RAG has a fundamental weakness: if the retrieval step fails to surface the relevant chunk, the model cannot answer correctly even if it would have been able to reason correctly given the full text. The 2023 "lost in the middle" problem is a retrieval problem — not just a context problem.
In 2024, Notion released "Notion AI Q&A," which used RAG over a user's entire Notion workspace. The system embedded all pages, used vector search to retrieve relevant chunks, and fed them to a language model. A key engineering challenge Notion's team documented publicly: at very large workspaces (hundreds of thousands of notes), retrieval precision dropped. Their solution combined keyword search with semantic search in a hybrid retrieval approach — demonstrating that even for practical deployments, RAG architecture must be carefully tuned rather than assumed to work at scale.
For tasks requiring reasoning across an entire document — not just locating a fact — RAG often fails. If someone asks a model to "identify all the inconsistencies across this 80-page contract," RAG may retrieve individual sections but miss cross-section contradictions that are only apparent when holding the entire document simultaneously.
This is the use case that long context windows are purpose-built for. Google's 2024 Gemini 1.5 technical report explicitly demonstrated this with a "needle in a haystack" variant requiring multi-hop reasoning: finding information A, then using it to locate information B, when both were buried in 500 pages. RAG consistently failed this multi-hop task; full-context succeeded.
The current frontier combines all three approaches. Agentic memory systems — demonstrated in frameworks like LangChain's memory module and AutoGPT's memory architecture — use a tiered approach: short-term in-context working memory, medium-term RAG over session history, and long-term structured memory in a persistent database.
OpenAI took a step toward standardizing this in February 2024 with the launch of Memory for ChatGPT: a system that automatically writes facts from conversations to a user-level persistent store and injects them into future system prompts. By May 2025, this had expanded to include user-controlled memory management and enterprise-level workspace memory.
The research direction that may ultimately unify these approaches is learned retrieval: training the model itself to decide what to retrieve, when, and how to integrate retrieved information — rather than treating retrieval as a separate pipeline step. This approach, explored in Meta AI's MemGPT paper (2023) and subsequent work, aims to give models human-like memory management capabilities.
The question "how large will context windows get?" is becoming less central than "how should context, retrieval, and persistent memory be orchestrated together?" The most capable 2025 systems are hybrids that use each approach where it performs best — suggesting that future AI capability gains will come as much from memory architecture as from raw context size increases.
You've learned about three memory approaches — in-context, RAG, and persistent memory — and why the future is hybrid. Now apply that understanding. Describe a real-world use case to your lab assistant and work through which memory architecture (or combination) best fits it, and why. Consider retrieval precision, cost, and the multi-hop reasoning requirement.
When context windows crossed 100K tokens in 2023, a specific category of task became newly possible: submitting an entire codebase in a single prompt. When they crossed 1M tokens in 2024, a different category emerged: submitting an entire company's document archive and asking questions across it. Each threshold doesn't just improve existing use cases — it creates entirely new ones that were architecturally impossible before.
Context window size isn't a smooth continuous variable for applications — specific capabilities become possible at specific thresholds, making each major milestone categorically rather than incrementally different.
Genomics research: In 2024, Google DeepMind demonstrated Gemini 1.5's ability to reason over genomic sequences exceeding 700,000 nucleotides in a single context — a use case that required full-sequence context because mutations early in a sequence affect interpretation of mutations later in it. RAG-based approaches had failed to capture these long-range dependencies.
Video understanding: Google's Project Astra (2024 Google I/O demonstration) used Gemini's long context to process a live video stream alongside conversation history, allowing the model to remember what it had seen earlier in a session and reference it in later answers. The demo showed the model correctly recalling where a specific object had been placed minutes earlier in the session.
Full repository code review: Cognition AI's Devin agent (released March 2024) relied on long context to maintain awareness of the entire codebase it was editing across a multi-hour software development session — tracking dependencies, prior edits, and test results simultaneously. This was explicitly cited as a capability gated on sufficiently long context in Cognition's public documentation.
In early 2024, researchers tested Gemini 1.5 Pro's code understanding by submitting the entire CPython source repository (~1.2M tokens) as context and asking questions about cross-file dependencies, historical architectural decisions embedded in comments, and function relationships that required reading dozens of files simultaneously. The model answered correctly on the majority of questions that required multi-file reasoning — a task that had previously required a human engineer with deep project familiarity. This benchmark directly influenced how software companies began evaluating long-context models for code review automation.
The trajectory is clear: context windows will continue growing, costs will continue falling, and applications that are economically impractical today will become routine. The practitioner implication is to design for the context window you'll have, not the one you have now — but to do so thoughtfully.
Gemini 1.5 Pro's 1M-token context was announced in February 2024 at prices that made it impractical for many applications. By mid-2024, Google had reduced costs by more than 75%. The gap between "technically possible" and "economically deployable" closes faster than most practitioners expect — and often on a different timeline than the capability announcement itself suggests.
You've seen how specific context thresholds unlock specific application categories, and you've learned the practitioner implications. Now apply the framework to a real or hypothetical product or workflow you care about. Work with your lab assistant to identify what context threshold changes the application category, what the cost curve means for timing, and how to structure a hybrid architecture that's viable now and scalable forward.