Module 8 · Lesson 1

From 4K to Millions: The Scaling Trajectory

How context windows grew from a technical footnote to a competitive battleground — and where the trajectory points next.

What does the historical rate of context growth tell us about where these windows are headed?

In June 2023, Anthropic released Claude 2 with a 100,000-token context window. Tech journalists called it staggering. Twelve months later, Google's Gemini 1.5 Pro demonstrated 1,000,000 tokens. The number had grown tenfold in a single year. The race wasn't slowing — it was accelerating.

The Timeline of Context Expansion

The history of context window growth is one of the fastest capability progressions in AI. In 2020, GPT-3 launched with 2,048 tokens — enough for a short essay. By early 2023, GPT-4 reached 32,768 tokens in its extended variant. Then the dam broke.

The jump from 32K to 100K tokens was significant, but it was Anthropic's and Google's subsequent moves that reframed expectations entirely.

2020

GPT-3 — 2,048 tokensOpenAI's landmark model ships with a context window equivalent to roughly 1,500 words. Fine for Q&A; inadequate for long documents.

2022

Claude 1 — 9,000 tokens / GPT-4 — 8K → 32KAnthropic's first Claude ships with ~9K tokens. OpenAI extends GPT-4 to 32,768 in a limited "GPT-4-32k" variant. Longer contexts arrive but remain slow and expensive.

2023

Claude 2 — 100,000 tokensAnthropic announces the 100K context in May 2023. Users can submit an entire novel in a single prompt. Competitors scramble to respond.

2024 Q1

Gemini 1.5 Pro — 1,000,000 tokens (preview)Google's February 2024 announcement of a 1M-token context in Gemini 1.5 Pro is the single largest leap in the race. In tests, the model correctly recalled a specific scene from 402 pages of documents.

2024 Q3

Gemini 1.5 Pro — 2,000,000 tokens (GA)Google extends the window to 2M tokens in general availability, making it accessible via API. The equivalent of roughly 1,500 research papers simultaneously in context.

2025

Claude 3.7 Sonnet — 200K; GPT-4o — 128K standardThe 100K–200K range becomes the competitive floor. Even mid-tier models routinely offer what was record-breaking just 18 months prior.

Why Growth Doesn't Follow Moore's Law Neatly

Unlike transistor counts, context window growth has been driven by architectural choices, not just hardware. The core bottleneck is the attention mechanism: standard transformers scale in compute as O(n²) with sequence length. Doubling the context quadruples the compute required for attention alone.

The leaps above each represented a significant engineering decision — not just a chip upgrade. Google's 1M-token Gemini relied on a custom architecture called Multi-Head Latent Attention combined with aggressive sparse-attention techniques that allowed the model to skip computing attention over distant, less relevant tokens.

This means context scaling curves look episodic: long plateaus interrupted by sudden jumps when a new architectural technique is proven at scale.

~500×

Growth 2020–2024

From GPT-3's 2K to Gemini's 1M tokens in four years

O(n²)

Attention Cost

Standard transformers — doubling context quadruples attention compute

Current Peak (2025)

Gemini 1.5 Pro in GA — the largest publicly available context window

~$0.007

Cost per 1K tokens (input)

Gemini 1.5 Pro pricing as of early 2025 — a fraction of 2023 rates

Key Insight

The historical pattern suggests that context length milestones arrive in waves tied to architectural breakthroughs rather than steady increments. Understanding what architectural technique enabled each leap is the key to predicting where the next ceiling will be set — and when it might be broken.

The Competitive Dynamic

Context length has become a marquee specification — the AI equivalent of megapixels in early smartphone cameras. When Anthropic announced 100K, OpenAI's 8K window suddenly looked dated. When Google announced 1M, Anthropic's 200K lost its "world's largest" distinction within months.

This competition has had a direct user benefit: prices per token have fallen sharply as labs compete, and what once required enterprise contracts is now available in standard API tiers. The trend strongly suggests that by 2026–2027, multi-million-token contexts will be unremarkable baseline capabilities rather than premium features.

Real Event

In February 2024, Google's research team published detailed benchmarks showing Gemini 1.5 Pro could recall a specific line of dialogue from a 1-hour audio file inserted into context alongside a multi-hour video — simultaneously. The demo was not cherry-picked: the researchers used the "needle in a haystack" methodology pioneered by community researcher Greg Kamradt in late 2023 to stress-test recall at arbitrary positions within the context.

Lesson 1 Quiz

The Scaling Trajectory · 3 questions

Which model first publicly demonstrated a 1,000,000-token context window in early 2024?

Correct. Google announced Gemini 1.5 Pro's 1M-token context in February 2024, later extended to 2M in general availability.

Not quite. Gemini 1.5 Pro was the first to publicly demonstrate and then release a 1,000,000-token context window.

Why does standard transformer attention scale as O(n²) with context length?

Correct. Standard attention requires computing a score between every pair of tokens — if you have n tokens, that's n×n comparisons, hence O(n²) cost.

Not quite. The quadratic scaling comes from the pairwise comparison: every token must attend to every other token, producing n² attention operations.

What was the approximate context window of GPT-3 at launch in 2020?

Correct. GPT-3 launched with a 2,048-token context — about 1,500 words — which was considered large at the time but now looks tiny against current models.

Not quite. GPT-3's original context window was 2,048 tokens — enough for roughly a short essay, not a full document.

Lab 1 — Mapping the Milestones

Discuss context window scaling history with your AI lab assistant

Your Task

You've just read about context window growth from 2K to 2M tokens. Now explore the deeper implications with your lab assistant. Ask about what drove each architectural leap, why the growth curve is "episodic" rather than smooth, or what business use cases became newly possible at specific size thresholds.

Starter: "What was the key architectural technique that let Google reach 1M tokens, and why couldn't they just extend the previous approach?"

Lab Assistant

Context Window History

Ready to explore context window scaling history. Ask me about the architectural breakthroughs, what made each milestone significant, or how cost curves changed alongside capability — I'll ground every answer in documented events and research.

Module 8 · Lesson 2

The Engineering Ceilings Ahead

The hard physics, memory constraints, and attention-cost problems that will shape — and possibly cap — context window growth.

If architectural breakthroughs drove the past, what barriers must the next breakthrough overcome?

When Google ran Gemini 1.5 Pro at 1 million tokens, the compute bill for a single inference call was staggering. Processing one million tokens at full attention would require roughly 10¹² floating-point operations — more than a thousand trillion computations for a single prompt. Labs don't publish exact numbers, but estimates from researchers at NVIDIA suggest that naive full-attention at 1M tokens would cost more per query than a standard model costs per day. The race toward infinite context runs directly into the wall of physics.

The Three Core Ceilings

Ceiling 1

Attention Compute (O(n²))

Standard self-attention requires computing every token's relationship with every other token. At 1M tokens, that's 10¹² pairwise computations per attention layer. Modern models have dozens of layers. Without sparse or approximate attention, million-token contexts are economically impossible at scale.

Ceiling 2

GPU Memory (KV Cache)

During inference, models store key-value (KV) cache entries for every token in context. At 1M tokens with a large model, the KV cache alone can require hundreds of gigabytes of GPU VRAM — exceeding what any single current GPU can hold. Distributing this across chips introduces latency.

Ceiling 3

Retrieval Degradation

Even when compute allows long contexts, model accuracy on information placed in the middle of the context degrades sharply. The "lost in the middle" phenomenon, documented by Stanford researchers in 2023, shows models reliably recall information at the beginning and end of context but miss facts buried in the middle.

Sparse Attention: The Leading Solution

The most successful approach to breaking the O(n²) ceiling has been sparse attention: instead of computing attention between all token pairs, the model selectively attends only to the most relevant tokens. Several variants have reached production:

Sliding window attention (used in Mistral's models): each token only attends to a fixed window of nearby tokens plus a set of "global" tokens designated as important. This reduces attention cost to near-linear.

Flash Attention (Tri Dao, 2022, adopted by most major labs): a GPU kernel optimization that doesn't reduce the theoretical complexity but dramatically reduces memory bandwidth usage, enabling longer contexts on existing hardware. Flash Attention 2 and 3 have each provided roughly 2–4× throughput improvements.

Ring Attention (DeepMind / UC Berkeley, 2023): distributes the KV cache across multiple GPU devices in a ring topology, allowing the combined VRAM of many chips to hold extraordinarily long contexts. This is the technique that made Gemini 1.5's million-token context physically feasible.

Real Research: "Lost in the Middle"

In July 2023, researchers from Stanford and UC Berkeley published a paper titled "Lost in the Middle: How Language Models Use Long Contexts." They tested multiple large language models on multi-document question answering and found that performance was significantly higher when the relevant document appeared at the beginning or end of the context — and dropped sharply when it was placed in the middle. This effect held even for models explicitly designed for long context. The finding has directly shaped how production RAG systems are architected: critical information is deliberately placed at the start or end of prompts, not the middle.

The Memory Wall

GPU VRAM has grown — the NVIDIA H100 ships with 80GB, the H200 with 141GB — but these numbers are dwarfed by the KV cache requirements of multi-million-token contexts for large models. A rough calculation: for a 70-billion-parameter model with a 1M-token context, the KV cache requires approximately 700GB to 1.4TB of memory depending on precision, compared to the model weights themselves at roughly 140GB in BF16.

This has driven two parallel approaches: KV cache compression techniques that reduce memory per token (e.g., grouped-query attention, which Llama 3 uses), and offloading to CPU RAM or SSD storage with async prefetch. Neither fully solves the problem at very long contexts, but each extends the practical ceiling.

Flash Attention Impact

Tri Dao's Flash Attention paper (2022, Stanford) was immediately adopted by Hugging Face, Mosaic ML, and most major labs within six months of publication. Its core insight — that the bottleneck is memory bandwidth rather than raw FLOPs — led to a kernel rewrite that made 10–20× longer contexts viable on the same hardware. Flash Attention 3, released in 2024, optimized specifically for H100 architecture and achieved approximately 75% of theoretical peak FP16 FLOP utilization.

What This Means for Future Growth

The engineering ceilings don't make context window growth impossible — they make it expensive and architecturally demanding. The pattern from previous breakthroughs (Flash Attention, Ring Attention, sparse attention variants) suggests that the next leap will come from a novel technique that sidesteps a current bottleneck, not from incremental hardware improvements alone.

The practical implication for AI users: the 2M-token ceiling isn't a hard physical limit, but expanding beyond it will require either a significant architectural innovation or a substantial increase in inference infrastructure cost — likely both.

Lesson 2 Quiz

Engineering Ceilings · 3 questions

What is the "lost in the middle" phenomenon documented by Stanford/UC Berkeley researchers in 2023?

Correct. The Stanford/Berkeley paper showed models reliably recall information at the beginning and end of context but struggle with facts buried in the middle — a finding that directly shaped RAG system design.

Not quite. The phenomenon is about positional recall within a single context window: models are less reliable at retrieving information from the middle of their context versus the beginning or end.

What was the core insight behind Flash Attention that made it so impactful?

Correct. Flash Attention kept the same O(n²) theoretical complexity but recognized that memory bandwidth — not compute — was the actual bottleneck, and redesigned the kernel accordingly.

Not quite. Flash Attention doesn't change the O(n²) theoretical complexity. Its key insight was that GPU memory bandwidth was the real bottleneck, and it rewrote the attention kernel to minimize memory traffic.

Which technique distributes the KV cache across multiple GPU devices to enable million-token contexts?

Correct. Ring Attention, developed by DeepMind and UC Berkeley researchers in 2023, distributes the KV cache across GPUs in a ring topology — enabling the combined VRAM of many chips to handle contexts that no single chip could hold.

Not quite. Ring Attention is the technique that distributes KV cache across multiple GPU devices in a ring topology, making million-token contexts physically feasible for Gemini 1.5.

Lab 2 — Engineering the Ceiling

Probe the technical barriers to context window growth with your lab assistant

Your Task

You've learned about the three main engineering ceilings: attention compute cost, KV cache memory, and retrieval degradation. Now dig into the tradeoffs with your lab assistant. Ask about how specific techniques like Flash Attention or Ring Attention trade one problem for another, or what would need to be true for a 10M-token model to be economically viable.

Starter: "If Ring Attention solves the memory problem by spreading the KV cache across GPUs, what new problems does that create for latency and inference cost?"

Lab Assistant

Context Engineering Ceilings

Ready to explore the engineering limits of context windows. Ask me about sparse attention tradeoffs, KV cache compression techniques, the lost-in-the-middle problem's practical impact, or what the path to 10M-token contexts would require.

Module 8 · Lesson 3

Memory, Retrieval, and the Hybrid Future

Why the future of AI memory isn't just a bigger context window — and the emerging architectures that combine context with external retrieval and persistent memory systems.

When does a bigger context window stop being the right solution — and what replaces it?

In late 2023, two research directions diverged sharply. One camp — Google, Anthropic — bet that context windows would grow large enough to hold everything relevant. The other — led by startups like Mem.ai and integrated into OpenAI's GPT-4 plugins — argued that retrieval-augmented generation was the correct architecture: keep the context window short and fast, and dynamically fetch only what's needed. By 2025, the industry had stopped choosing between them. The winning answer turned out to be: both.

Three Approaches to AI Memory

There are now three distinct architectural approaches to giving AI models access to large amounts of information, and they are increasingly used together rather than as alternatives:

In-Context (Raw Context) Everything the model needs is placed directly in the context window at inference time. Maximum fidelity, no retrieval latency, but bounded by context size and cost at scale.

Retrieval-Augmented Generation (RAG) A retrieval system (typically vector search over embeddings) selects relevant chunks and inserts them into a shorter context. Scales to arbitrary document collections; dependent on retrieval quality.

Persistent Memory / Long-Term Memory Structured key-value stores, episodic memory databases, or fine-tuned parametric knowledge that persists across sessions. Requires write operations; enables true personalization over time.

When RAG Wins Over Raw Context

RAG has a decisive advantage when the total relevant knowledge base exceeds even the largest context windows. A legal research firm with 50 years of case law — millions of documents — cannot fit that corpus into any context window. RAG allows querying this corpus dynamically.

OpenAI formalized this approach in March 2023 with the release of the Retrieval plugin for ChatGPT, allowing the model to query external document stores via semantic search. This was followed by the more tightly integrated file search capability in the Assistants API, released in November 2023, which automated chunking, embedding, and retrieval behind a simple API interface.

However, RAG has a fundamental weakness: if the retrieval step fails to surface the relevant chunk, the model cannot answer correctly even if it would have been able to reason correctly given the full text. The 2023 "lost in the middle" problem is a retrieval problem — not just a context problem.

Real Deployment: Notion AI Memory

In 2024, Notion released "Notion AI Q&A," which used RAG over a user's entire Notion workspace. The system embedded all pages, used vector search to retrieve relevant chunks, and fed them to a language model. A key engineering challenge Notion's team documented publicly: at very large workspaces (hundreds of thousands of notes), retrieval precision dropped. Their solution combined keyword search with semantic search in a hybrid retrieval approach — demonstrating that even for practical deployments, RAG architecture must be carefully tuned rather than assumed to work at scale.

When Raw Context Wins Over RAG

For tasks requiring reasoning across an entire document — not just locating a fact — RAG often fails. If someone asks a model to "identify all the inconsistencies across this 80-page contract," RAG may retrieve individual sections but miss cross-section contradictions that are only apparent when holding the entire document simultaneously.

This is the use case that long context windows are purpose-built for. Google's 2024 Gemini 1.5 technical report explicitly demonstrated this with a "needle in a haystack" variant requiring multi-hop reasoning: finding information A, then using it to locate information B, when both were buried in 500 pages. RAG consistently failed this multi-hop task; full-context succeeded.

Emerging Hybrid Architectures

The current frontier combines all three approaches. Agentic memory systems — demonstrated in frameworks like LangChain's memory module and AutoGPT's memory architecture — use a tiered approach: short-term in-context working memory, medium-term RAG over session history, and long-term structured memory in a persistent database.

OpenAI took a step toward standardizing this in February 2024 with the launch of Memory for ChatGPT: a system that automatically writes facts from conversations to a user-level persistent store and injects them into future system prompts. By May 2025, this had expanded to include user-controlled memory management and enterprise-level workspace memory.

The research direction that may ultimately unify these approaches is learned retrieval: training the model itself to decide what to retrieve, when, and how to integrate retrieved information — rather than treating retrieval as a separate pipeline step. This approach, explored in Meta AI's MemGPT paper (2023) and subsequent work, aims to give models human-like memory management capabilities.

Key Insight

The question "how large will context windows get?" is becoming less central than "how should context, retrieval, and persistent memory be orchestrated together?" The most capable 2025 systems are hybrids that use each approach where it performs best — suggesting that future AI capability gains will come as much from memory architecture as from raw context size increases.

Lesson 3 Quiz

Memory, Retrieval & Hybrid Architectures · 3 questions

For which type of task does a raw long context window decisively outperform RAG?

Correct. When reasoning requires holding the entire document simultaneously — such as identifying cross-section contradictions in a contract or multi-hop fact-finding — RAG often fails because it retrieves chunks independently, missing relationships between them.

Not quite. Long context windows excel at reasoning that requires the entire document to be held simultaneously, such as finding relationships between passages that RAG would retrieve as separate, unconnected chunks.

What did OpenAI launch in February 2024 that represented a move toward persistent AI memory?

Correct. OpenAI's Memory feature, launched February 2024, automatically writes facts from conversations to a user-level persistent store and injects them into future system prompts — the first mass-market implementation of cross-session AI memory.

Not quite. In February 2024, OpenAI launched "Memory for ChatGPT" — a system that automatically extracts and stores facts from conversations and uses them in future sessions.

What was the key engineering problem Notion AI's team publicly documented when deploying RAG over large workspaces?

Correct. Notion's engineering team documented that pure semantic search degraded at scale, and their solution combined keyword and semantic retrieval in a hybrid system — a now-common pattern in production RAG deployments.

Not quite. Notion documented that retrieval precision degraded at large scales (hundreds of thousands of notes), and their fix was a hybrid retrieval approach combining keyword and semantic search.

Lab 3 — Memory Architecture Tradeoffs

Design hybrid memory systems with your AI lab assistant

Your Task

You've learned about three memory approaches — in-context, RAG, and persistent memory — and why the future is hybrid. Now apply that understanding. Describe a real-world use case to your lab assistant and work through which memory architecture (or combination) best fits it, and why. Consider retrieval precision, cost, and the multi-hop reasoning requirement.

Starter: "I'm building a legal research assistant that needs to handle both case-level full-document analysis AND queries across a database of 500,000 past cases. What memory architecture should I use?"

Lab Assistant

Memory Architecture Design

Ready to work through memory architecture design. Describe your use case and I'll help you reason through the tradeoffs between in-context, RAG, and persistent memory — drawing on documented production patterns from real deployments.

Module 8 · Lesson 4

Applications Unlocked by Extreme Contexts

The real use cases that become possible — or dramatically better — when context windows reach millions of tokens, and what this means for how you should be building with AI today.

Which applications are only possible at certain context thresholds — and what threshold is next?

When context windows crossed 100K tokens in 2023, a specific category of task became newly possible: submitting an entire codebase in a single prompt. When they crossed 1M tokens in 2024, a different category emerged: submitting an entire company's document archive and asking questions across it. Each threshold doesn't just improve existing use cases — it creates entirely new ones that were architecturally impossible before.

Use Cases Unlocked by Threshold

Context window size isn't a smooth continuous variable for applications — specific capabilities become possible at specific thresholds, making each major milestone categorically rather than incrementally different.

~8K

Single document Q&AA short report, a research paper, a legal brief. Basic summarization and question-answering over individual documents. Available since GPT-4's initial 8K context in 2023.

~32K

Small codebase analysis / multi-document synthesisA microservice, a small library, or a handful of related documents. First meaningful cross-document reasoning becomes possible. GPT-4-32k enabled this in mid-2023.

~100K

Full novel / entire codebase / transcript analysisA complete novel, a medium-sized software project, a full day's worth of meeting transcripts. Character consistency analysis, codebase-wide refactoring, or full conversation review. Claude 2 unlocked this in mid-2023.

~1M

Full audio/video / large repository / multi-year document archiveA 1-hour video in its entirety, a large open-source repository (e.g., CPython's ~1.2M token codebase), or a company's full document archive for a specific domain. Gemini 1.5 Pro enabled this in 2024.

~10M+

Entire scientific literature / full legal case historyNot yet available commercially. Would enable querying the complete published literature of a scientific subfield, or a law firm's complete case history. Requires architectural breakthroughs beyond current sparse attention approaches.

Real Applications at the Frontier

Genomics research: In 2024, Google DeepMind demonstrated Gemini 1.5's ability to reason over genomic sequences exceeding 700,000 nucleotides in a single context — a use case that required full-sequence context because mutations early in a sequence affect interpretation of mutations later in it. RAG-based approaches had failed to capture these long-range dependencies.

Video understanding: Google's Project Astra (2024 Google I/O demonstration) used Gemini's long context to process a live video stream alongside conversation history, allowing the model to remember what it had seen earlier in a session and reference it in later answers. The demo showed the model correctly recalling where a specific object had been placed minutes earlier in the session.

Full repository code review: Cognition AI's Devin agent (released March 2024) relied on long context to maintain awareness of the entire codebase it was editing across a multi-hour software development session — tracking dependencies, prior edits, and test results simultaneously. This was explicitly cited as a capability gated on sufficiently long context in Cognition's public documentation.

The CPython Benchmark

In early 2024, researchers tested Gemini 1.5 Pro's code understanding by submitting the entire CPython source repository (~1.2M tokens) as context and asking questions about cross-file dependencies, historical architectural decisions embedded in comments, and function relationships that required reading dozens of files simultaneously. The model answered correctly on the majority of questions that required multi-file reasoning — a task that had previously required a human engineer with deep project familiarity. This benchmark directly influenced how software companies began evaluating long-context models for code review automation.

What Practitioners Should Do Now

The trajectory is clear: context windows will continue growing, costs will continue falling, and applications that are economically impractical today will become routine. The practitioner implication is to design for the context window you'll have, not the one you have now — but to do so thoughtfully.

Audit your RAG pipelines for tasks that actually require full-document context. If your retrieval frequently misses multi-hop relationships, the answer may be a longer context window, not better retrieval.
Benchmark the "lost in the middle" effect for your specific use case. Don't assume long context equals accurate recall — structure your prompts to put critical information at the beginning or end, and test empirically.
Track cost per million tokens, not just capability. Gemini 1.5 Pro's 1M context became practically usable only when costs dropped to ~$7/1M input tokens. The next major application unlocks will follow cost reductions, not just capability announcements.
Plan for hybrid architectures. The winning systems combine long context, RAG, and persistent memory. Design your data architecture now so you can adopt each as it becomes cost-effective for your use case.
Watch the 10M-token threshold. The applications it unlocks — full scientific literature, complete legal archives — will create category-defining products in regulated industries. The companies that have already digitized and structured their knowledge assets will be first to benefit.

The Cost Curve Matters as Much as the Capability Curve

Gemini 1.5 Pro's 1M-token context was announced in February 2024 at prices that made it impractical for many applications. By mid-2024, Google had reduced costs by more than 75%. The gap between "technically possible" and "economically deployable" closes faster than most practitioners expect — and often on a different timeline than the capability announcement itself suggests.

Lesson 4 Quiz

Applications Unlocked by Extreme Contexts · 3 questions

Approximately how many tokens is the entire CPython source repository, used to benchmark Gemini 1.5 Pro's code understanding?

Correct. CPython's source repository is approximately 1.2 million tokens — fitting within Gemini 1.5 Pro's context window and enabling cross-file code reasoning that required simultaneous access to dozens of files.

Not quite. The CPython source repository is approximately 1.2 million tokens — large enough to require Gemini 1.5 Pro's 1M+ context window for full-repository reasoning.

Why did Google DeepMind use long context (rather than RAG) for genomic sequence analysis in 2024?

Correct. Genomic sequences have long-range dependencies where context at position 1 affects interpretation at position 700,000 — precisely the multi-hop, cross-section reasoning that full context excels at and chunked RAG fails to capture.

Not quite. The reason is about long-range dependencies in genomic sequences: a mutation early in the sequence affects how you interpret mutations much later. RAG retrieves chunks independently and misses this dependency.

What practical advice does the lesson give about placing critical information in long prompts, based on the "lost in the middle" research?

Correct. The practical takeaway from the lost-in-the-middle research is to structure prompts so critical information appears at the beginning or end of the context, where model recall is most reliable — not buried in the middle.

Not quite. The Stanford/Berkeley research showed models reliably recall from the beginning and end of context but not the middle — so the practical advice is to put critical information at those positions.

Lab 4 — Planning for Future Context

Work through real application design with your AI lab assistant

Your Task

You've seen how specific context thresholds unlock specific application categories, and you've learned the practitioner implications. Now apply the framework to a real or hypothetical product or workflow you care about. Work with your lab assistant to identify what context threshold changes the application category, what the cost curve means for timing, and how to structure a hybrid architecture that's viable now and scalable forward.

Starter: "I work in pharmaceutical research. What context threshold would unlock the ability to query the complete published literature for a specific drug target, and what's the realistic timeline for that becoming economically deployable?"

Lab Assistant

Future Context Applications

Ready to work through future context application design. Tell me about your domain or use case and I'll help you map which context thresholds unlock which capabilities, analyze the cost curves, and design a hybrid architecture that works today while planning for tomorrow's capabilities.

Module 8 — Final Test

Where Context Length Is Going · 15 questions · Pass at 80%

1. What was GPT-3's context window size at launch in 2020?

Correct. GPT-3 launched with a 2,048-token context in 2020.

GPT-3 launched with 2,048 tokens — about 1,500 words — in 2020.

2. Which model first publicly demonstrated a 100,000-token context window?

Correct. Anthropic's Claude 2, announced May 2023, was the first widely available model with a 100K-token context.

Claude 2 (Anthropic, May 2023) was the first to publicly demonstrate a 100K-token context window.

3. Why is standard transformer attention described as O(n²) in computational cost?

Correct. Each token must attend to every other token — n tokens × n tokens = n² attention operations per layer.

Standard attention requires n×n pairwise comparisons — every token against every other token — giving O(n²) complexity.

4. What architectural technique did Google use to make Gemini 1.5 Pro's 1M-token context physically feasible across multiple GPUs?

Correct. Ring Attention distributes the KV cache across multiple GPUs in a ring topology, allowing the combined VRAM of many chips to hold the enormous KV cache required for million-token contexts.

Ring Attention is the technique that distributes the KV cache across GPUs in a ring topology, enabling million-token contexts.

5. The "lost in the middle" phenomenon shows that model recall accuracy is highest for information placed where in the context?

Correct. The Stanford/Berkeley research found models reliably recall information at the beginning and end of context, but accuracy degrades for information placed in the middle.

The "lost in the middle" research showed models recall best from the beginning and end of context — not the middle.

6. What was Flash Attention's core insight, and what bottleneck did it target?

Correct. Tri Dao's Flash Attention recognized that memory bandwidth was the real bottleneck and rewrote the GPU kernel accordingly — without changing the O(n²) theoretical complexity.

Flash Attention's insight was that memory bandwidth (not raw FLOPs) was the bottleneck, and it rewrote the GPU kernel to minimize memory traffic while keeping the same theoretical complexity.

7. For which task category does RAG have a decisive structural advantage over raw long context?

Correct. When the total knowledge base exceeds any context window — millions of documents — RAG is the only viable approach because even the largest contexts cannot hold everything simultaneously.

RAG's structural advantage is scale: when the knowledge base exceeds any context window's capacity, RAG is the only viable option for querying it.

8. What was the approximate context size at which full-novel and complete-codebase analysis first became possible?

Correct. The ~100K threshold, reached by Claude 2 in 2023, was the point at which full novels, medium-sized codebases, and full-day transcripts became operable in a single context.

The ~100K threshold (Claude 2, 2023) was the milestone that enabled full-novel and complete-codebase analysis in a single context window.

9. What did OpenAI's "Memory for ChatGPT" (February 2024) do that was new?

Correct. ChatGPT Memory automatically extracted and stored facts from conversations, making them available in future sessions — the first mass-market implementation of cross-session AI memory.

OpenAI's Memory feature automatically saved facts from conversations to a persistent store, injecting them into future sessions — enabling cross-session personalization.

10. Why did Google DeepMind use long context (not RAG) for genomic sequence analysis in 2024?

Correct. Genomic sequences have long-range dependencies that require the entire sequence to be held in context simultaneously — RAG's independent chunk retrieval misses these cross-position relationships.

Long-range genomic dependencies mean position 1 affects interpretation at position 700,000 — a relationship that RAG's chunked retrieval cannot capture.

11. Approximately how much did Google reduce Gemini 1.5 Pro's cost per token between its announcement (February 2024) and mid-2024?

Correct. Google reduced Gemini 1.5 Pro pricing by more than 75% between its February 2024 announcement and mid-2024, demonstrating how quickly the gap between "technically possible" and "economically deployable" can close.

Google reduced Gemini 1.5 Pro's costs by more than 75% in the months following its launch — a common pattern where capability announcements precede practical deployability.

12. What was the key finding when researchers tested Gemini 1.5 Pro on the entire CPython source repository (~1.2M tokens)?

Correct. Gemini 1.5 Pro answered the majority of multi-file reasoning questions correctly — questions that required simultaneously holding dozens of files and tracking cross-file dependencies.

The CPython benchmark showed Gemini 1.5 Pro correctly handling multi-file reasoning questions that required tracking cross-file dependencies across the full ~1.2M-token repository.

13. What is sliding window attention, and what is its primary advantage over full attention?

Correct. Sliding window attention (used in Mistral's models) limits each token's attention to nearby tokens plus global tokens, reducing the O(n²) cost to approximately O(n) — enabling much longer contexts on the same hardware.

Sliding window attention limits each token to attending within a fixed window plus global tokens, bringing cost from O(n²) to near-linear — the approach used in Mistral's architecture.

14. Which application category is identified as requiring a context threshold of ~10M tokens or more — a threshold not yet commercially available in 2025?

Correct. Querying complete scientific literature or full legal case histories requires ~10M+ tokens — a threshold that remains beyond current commercial availability and requires architectural breakthroughs beyond current sparse attention techniques.

The ~10M token threshold is where applications like querying complete scientific literature or full legal archives become possible — this remains beyond current commercial availability.

15. What does the lesson recommend as the key practitioner strategy for context window planning?

Correct. The recommended approach is to build hybrid architectures viable now (combining context, RAG, and persistent memory) while structuring data assets to take advantage of future context expansions — and tracking cost curves, not just capability announcements.

The lesson recommends hybrid architectures that work now while planning forward — and explicitly tracking cost curves alongside capability announcements, since cost reduction determines when capabilities become deployable.