L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Lesson 1 Β· Module 2

The 2,048-Token Era

GPT-3 and the hard ceilings that shaped early AI writing
Why did a 2,048-token ceiling define what language models could β€” and could not β€” do in 2020?

When OpenAI released GPT-3 to select API partners in June 2020, the model arrived with a hard ceiling: 2,048 tokens of combined prompt and completion. At roughly 1,500 words of English text, this meant a model that could write a sharp magazine article but could not hold an entire short story in its head at once.

Developers immediately felt the constraint. Writing assistants had to split long documents into chunks. Summarization pipelines truncated legal briefs mid-sentence. Code generation tools lost context after a few hundred lines. The architecture was brilliant β€” the ceiling was brutal.

What a Token Actually Is

A token is not simply a word. OpenAI's tokenizer splits text into subword units: "running" might be one token, while "unbelievable" might be three. In practice, English text runs at roughly 0.75 words per token, so 2,048 tokens β‰ˆ 1,500 words β€” about six pages of double-spaced text.

The 2,048 limit covered the entire context: system instructions, the conversation so far, and the model's response. If a developer's system prompt consumed 300 tokens, only 1,748 remained for user input plus output. Every token spent on context instructions was a token unavailable for content.

Why 2,048?

The number traces to practical GPU memory constraints in 2020-era hardware. Transformer attention is quadratic: doubling the context length roughly quadruples the memory required for the attention matrix. At the scale of GPT-3's 175 billion parameters, 2,048 tokens was close to the feasible maximum given the A100 GPUs available at the time.

Real-World Consequences in 2020–2021

Early API adopters built elaborate workarounds. Summarize-then-extend pipelines compressed earlier sections of a document into brief summaries before passing new sections to the model β€” at the cost of fidelity. Sliding window approaches fed the model overlapping chunks, hoping for coherent joins. Neither was reliable.

Copy.ai, one of the first GPT-3-powered writing products (launched October 2020), limited its "Blog Post Wizard" output to short sections precisely because generating anything longer risked context overflow. Users received disconnected paragraphs that required manual stitching.

Legal-tech startups attempting to summarize contracts discovered that a standard non-disclosure agreement β€” often 3,000–4,000 words β€” had to be chopped into segments. Clauses at the boundary of a chunk were frequently misrepresented in summaries because the model lost the preceding definitional context.

The Codex Exception β€” 4,096 Tokens

In August 2021 OpenAI released Codex, the model powering GitHub Copilot. Codex shipped with a 4,096-token context β€” double GPT-3's limit. The reason was domain-specific: code files are long, function signatures must be visible when generating bodies, and repository context matters. OpenAI made a deliberate choice to extend context for code because the use-case demanded it.

Copilot's technical preview (launched June 2021, public beta September 2021) immediately showed what an extended window could do: the model could "see" an entire small Python file and generate a new function consistent with existing class structure. This was qualitatively different from anything possible at 2,048 tokens.

Key Insight

The jump from 2,048 to 4,096 tokens was not primarily a model capability improvement β€” it was a memory and engineering decision. The underlying transformer architecture was capable; the constraint was always hardware and inference cost, not fundamental model intelligence.

Key Terms
TokenA subword unit of text processed by a language model; roughly 0.75 English words on average.
Context WindowThe total number of tokens a model can process simultaneously as input and output.
Attention MatrixThe quadratic data structure that relates every token to every other token; memory cost scales as O(nΒ²) with context length.
Sliding WindowA workaround in which overlapping chunks of a long document are processed sequentially to simulate a larger context.

Lesson 1 Quiz

The 2,048-Token Era β€” check your understanding
1. Approximately how many English words does a 2,048-token context window hold?
Correct. Tokens average roughly 0.75 English words, so 2,048 Γ— 0.75 β‰ˆ 1,536 words β€” approximately 1,500.
Not quite. English text averages about 0.75 words per token, placing 2,048 tokens at roughly 1,500 words.
2. Why was the 2,048-token limit a hardware constraint rather than a fundamental model limitation?
Correct. The quadratic cost of attention meant that doubling context roughly quadrupled GPU memory requirements β€” a real hardware bottleneck in 2020.
Not quite. The core issue is quadratic attention scaling: memory requirements grew as O(nΒ²) with context length, hitting the ceiling of available GPU memory.
3. What context window did OpenAI's Codex model (powering GitHub Copilot) ship with in 2021?
Correct. Codex doubled GPT-3's limit to 4,096 tokens, enabling it to process entire small source files.
Not quite. Codex shipped at 4,096 tokens β€” double GPT-3's limit β€” because code use cases demanded visibility into full file structures.

Lab 1 Β· Early Token Constraints

Interactive conversation β€” explore the 2020-era context ceiling

Your Task

You're consulting for a 2021-era startup building a legal document summarizer on GPT-3's API. The context window is 2,048 tokens. Explore the trade-offs with your AI assistant β€” ask about chunking strategies, token budgeting, and what gets lost when context is exceeded.

Suggested openers: "How should we chunk a 5,000-word contract for GPT-3?" Β· "What's a realistic system prompt token budget?" Β· "What information is most at risk when a contract is split across chunks?"
AI Lab Assistant
Context Window History Β· L1
Welcome to Lab 1. I'm your AI consultant for the 2,048-token era. You're building a legal document summarizer in 2021 β€” every token counts. Ask me about chunking strategies, what information gets dropped at context boundaries, or how to budget your system prompt. What's your first question?
Lesson 2 Β· Module 2

The 8K–32K Window

GPT-4 and Claude 1 redefine what "long context" means in 2023
How did the simultaneous arrival of 8K and 32K context models in early 2023 change the class of tasks AI could reliably complete?

On March 14, 2023, OpenAI released GPT-4 with two context variants: a standard 8,192-token version and a limited-access 32,768-token version. Four days later, Anthropic released Claude 1 with a 9,000-token context. Within a week, the industry's working definition of a "large" context window had moved fourfold.

The numbers felt abstract until practitioners tried them. A GPT-4-32K session could hold a full academic paper. A Claude 1 session could process a lengthy legal agreement without chunking. Developers who had spent two years engineering elaborate splitting pipelines suddenly found those pipelines obsolete.

GPT-4: Two Windows, One Launch

OpenAI's decision to launch GPT-4 with two context sizes was deliberate. The standard 8,192-token window was available to all API customers; the 32,768-token window ("GPT-4-32K") was restricted to select partners and priced at roughly double the per-token rate. The price difference reflected real inference cost: the larger attention matrix required substantially more GPU memory per request.

The 8K window was four times GPT-3's limit and sufficient for most professional documents. At 8,192 tokens (β‰ˆ6,100 words), it could process a long feature article, a short legal agreement, or a multi-turn conversation spanning dozens of exchanges β€” all without chunking.

8K
GPT-4 Standard
~6,100 words Β· March 2023
32K
GPT-4 Extended
~24,000 words Β· March 2023
9K
Claude 1
~6,750 words Β· March 2023
Claude 1 and the Constitutional Approach

Anthropic's Claude 1, released on March 14, 2023 (the same day as GPT-4), offered a 9,000-token context β€” slightly larger than GPT-4's standard tier. Anthropic's public technical documentation emphasized that Claude was trained with Constitutional AI methods, but less discussed was the engineering effort behind maintaining coherence across a 9K window.

The model demonstrated notably stronger performance on tasks requiring the model to hold earlier information in mind β€” cross-referencing a document's definitions against later clauses, for example. This suggested that context window size and context utilization were distinct: a model could technically accept 9,000 tokens but lose meaningful information from the beginning of a long prompt before reaching the end.

The "Lost in the Middle" Problem

A Stanford NLP research paper published in November 2023 (Liu et al., "Lost in the Middle: How Language Models Use Long Contexts") documented a consistent pattern: models performed significantly better when critical information was placed at the beginning or end of the context window, and significantly worse when it was buried in the middle. A longer window did not guarantee proportionally better performance β€” position within the window mattered enormously.

What 32K Actually Enabled

With GPT-4-32K, a new class of task became feasible without engineering workarounds. A 32,768-token window holds approximately 24,000 English words β€” roughly half a full-length novel, or a 90-page academic dissertation. Real documented uses that emerged in 2023 included:

Full-contract review: Law firms running GPT-4-32K on merger agreements (typically 40–80 pages) without splitting. The model could cross-reference defined terms from page 2 while analyzing obligations on page 60.

Extended code analysis: Software teams feeding entire Python modules (2,000–3,000 lines) into a single prompt to detect inconsistencies across functions β€” something the Codex 4K window made impossible.

Medical literature synthesis: Research groups feeding multiple clinical trial abstracts simultaneously and requesting comparative analysis, removing the summarize-then-compare pipeline that introduced errors.

Pricing Signal

In March 2023, OpenAI priced GPT-4-8K at $0.03 per 1K prompt tokens and GPT-4-32K at $0.06 per 1K prompt tokens β€” a 2Γ— premium for the larger window. This pricing architecture made context window size an explicit economic variable for the first time, forcing teams to quantify exactly how much extra context was worth per task.

Key Terms
Context UtilizationThe degree to which a model effectively uses all available tokens in its window, not merely whether the tokens fit.
Lost in the MiddleDocumented tendency of LLMs to underweight information placed in the middle of a long context relative to the beginning or end.
Per-Token PricingAPI pricing model in which users pay separately for prompt tokens consumed and completion tokens generated.

Lesson 2 Quiz

The 8K–32K Window β€” check your understanding
1. GPT-4 launched on March 14, 2023 with two context variants. What were they?
Correct. GPT-4 launched with 8,192-token (standard) and 32,768-token (extended, restricted access) variants.
Not quite. GPT-4's two launch variants were 8,192 tokens (standard, broadly available) and 32,768 tokens (GPT-4-32K, restricted access).
2. The "Lost in the Middle" phenomenon, documented by Stanford NLP researchers in 2023, refers to what?
Correct. Liu et al. (2023) showed that LLMs systematically underweight middle-of-context information compared to information at the start or end.
Not quite. "Lost in the Middle" describes the finding that models perform significantly worse when key information is buried in the center of a long context rather than positioned at the edges.
3. Why did OpenAI price GPT-4-32K at approximately 2Γ— the per-token rate of GPT-4-8K?
Correct. Because attention scales quadratically, a 4Γ— larger context window requires substantially more memory and compute per request β€” the price premium reflected real infrastructure costs.
Not quite. The price premium reflected genuine infrastructure costs: quadratic attention scaling means the 32K window's attention matrix is ~16Γ— larger in memory than the 8K window's, requiring more GPU resources per request.

Lab 2 Β· The 8K–32K Leap

Interactive conversation β€” explore what larger windows unlock and their limits

Your Task

It's mid-2023. You're a product manager deciding whether to pay the 2Γ— premium for GPT-4-32K or optimize your pipeline for the 8K standard. Explore the trade-offs with your AI assistant β€” discuss which use cases justify 32K, how to structure prompts to avoid "Lost in the Middle" failure modes, and when chunking remains the smarter choice.

Suggested openers: "When does a 32K context actually earn its price premium?" Β· "How should I structure a 20,000-word contract for the 8K model?" Β· "What's the 'Lost in the Middle' problem and how do I work around it?"
AI Lab Assistant
Context Window History Β· L2
Welcome to Lab 2. I'm your AI strategy consultant for the 8K–32K era of early 2023. You're deciding whether GPT-4-32K's premium is worth it for your specific use case. Ask me about task suitability, prompt structuring to minimize "Lost in the Middle" effects, or when to stick with chunking pipelines. What's your situation?
Lesson 3 Β· Module 2

The 100K Moment

Claude 2's 100,000-token window and the collapse of document chunking
What changed β€” practically and conceptually β€” when a 100,000-token context became commercially available in May 2023?

On May 11, 2023, Anthropic released Claude 1.3 with a context window of 100,000 tokens β€” approximately 75,000 words, or a full-length novel. The announcement was treated with some skepticism: prior model generations had demonstrated that holding a large context and using it reliably were separate problems.

Anthropic's technical team documented a specific test: feeding the model the entire text of the novel The Great Gatsby (72,000 words), then asking questions about specific passages. The model answered accurately, including locating specific sentences. The 100K window was not just bigger β€” it appeared to be usable throughout its range.

The Scale of 100,000 Tokens

To make the number concrete: 100,000 tokens at the standard 0.75 words-per-token ratio β‰ˆ 75,000 words. This is sufficient to hold:

β€” The complete text of a standard merger agreement (typically 40–80 pages, 20,000–40,000 words)
β€” A full software codebase for a small application (10,000–30,000 lines of Python)
β€” An entire novel (Gatsby: 47,094 words; most genre fiction: 70,000–100,000 words)
β€” Hundreds of pages of financial filings or clinical trial reports

For the first time, a model could process an entire professional-scale document in a single inference call with no splitting, no summarization, no pipeline β€” just input and output.

The "Needle in a Haystack" Test

Anthropic's May 2023 announcement described feeding Claude the full text of The Great Gatsby alongside a randomly inserted, fictional sentence about a pizza order, then asking the model to find the pizza sentence. The model located it accurately across multiple trials β€” providing early evidence that the 100K window was usable throughout, not just at the edges. This informal test became the template for what later became the formal "needle-in-a-haystack" benchmark used across the industry.

What Became Obsolete in May 2023

The immediate practical consequence was the obsolescence of entire engineering categories that had dominated AI product development since 2020. Three specific pipelines became less necessary overnight:

MapReduce summarization β€” the practice of summarizing document chunks independently then summarizing the summaries β€” had been a standard LangChain workflow. With 100K context, the summarization could happen on the full document, eliminating information loss at chunk boundaries.

Semantic chunking β€” splitting documents at semantically coherent boundaries rather than arbitrary token counts β€” had been an active research area. Tools like LlamaIndex had built entire product value propositions around intelligent chunking. The 100K window dramatically reduced the contexts in which chunking was necessary.

Vector database retrieval as the primary context management strategy faced reconsideration. When a full document could fit in context, the trade-off between embedding-based retrieval (fast, imprecise) and full-context analysis (slow, comprehensive) shifted toward full-context for high-stakes professional tasks.

100K
Claude 1.3 Context
May 2023 Β· ~75,000 words
49Γ—
Growth vs GPT-3
2,048 β†’ 100,000 tokens in 3 years
3Γ—
vs GPT-4-32K
Largest prior commercial window
The Latency and Cost Trade-Off

The 100K window came with real costs. Processing a full 100K context was substantially slower than a 4K or 8K request β€” early tests reported latencies of 30–60 seconds for full-context inference, compared to 2–5 seconds for short prompts. API pricing reflected the computational cost: Anthropic's Claude 100K pricing at launch was $11.02 per million prompt tokens (compared to OpenAI's GPT-4-8K at $30 per million prompt tokens at the time, making Claude 100K relatively competitive despite the larger window).

The latency issue was not trivial. A 45-second response time is acceptable for a lawyer reviewing a contract overnight but unacceptable for a customer service chatbot. The 100K window opened new categories of batch, asynchronous AI tasks while leaving real-time interactive use cases better served by smaller, faster contexts.

Context vs. Retrieval β€” A Persistent Debate

Researchers at Anthropic and elsewhere noted in 2023 that even with a 100K context, retrieval-augmented generation (RAG) retained significant advantages for certain task types: when relevant information constituted only a tiny fraction of a very large corpus, retrieving the relevant chunks remained faster and cheaper than stuffing the full corpus into context. The 100K window did not eliminate RAG β€” it shifted the economic threshold at which RAG became preferable.

Key Terms
MapReduce SummarizationA pipeline in which document chunks are summarized independently (map), then those summaries are summarized again (reduce) to fit within a context window.
Needle in a HaystackAn evaluation methodology testing whether a model can accurately retrieve a specific piece of information from anywhere within a long context window.
RAG (Retrieval-Augmented Generation)A technique that retrieves semantically relevant document chunks via vector similarity and injects them into the model's context, rather than passing entire documents.

Lesson 3 Quiz

The 100K Moment β€” check your understanding
1. Anthropic's May 2023 announcement demonstrated the 100K context window's reliability using which test?
Correct. Anthropic fed Claude the full text of The Great Gatsby plus a planted fictional sentence and asked the model to locate it β€” establishing what became the template for "needle in a haystack" evaluation.
Not quite. Anthropic's demonstration involved feeding Claude the entire text of The Great Gatsby alongside an inserted fictional sentence about a pizza order, then asking Claude to find it accurately.
2. Which engineering category became largely obsolete for single-document tasks after Claude's 100K context launch?
Correct. MapReduce summarization β€” which existed specifically to handle documents longer than the context window β€” became unnecessary for single documents that fit within 100K tokens.
Not quite. The 100K window specifically made MapReduce summarization (chunking documents and summarizing the summaries) unnecessary for most single-document tasks, since the full document could now be passed directly.
3. Why did RAG (retrieval-augmented generation) remain valuable even after the 100K window became available?
Correct. For large corpora where only a small fraction is relevant to any given query, RAG's retrieval step remained faster and more economical than full-context inference β€” the 100K window shifted, but did not eliminate, the RAG trade-off.
Not quite. RAG retained value when relevant content was a small fraction of a large corpus β€” retrieving the relevant chunks remained faster and cheaper than including the entire corpus in context, even at 100K tokens.

Lab 3 Β· The 100K Era

Interactive conversation β€” explore long-document AI and its limits

Your Task

It's mid-2023 and you're a legal-tech engineer evaluating whether to rebuild your contract analysis pipeline around Claude's 100K context. Your current system uses a MapReduce summarization pipeline with LangChain. Explore the trade-offs with your AI assistant β€” when is full-context analysis worth its latency cost, and when should you keep chunking?

Suggested openers: "When does full-context analysis beat my MapReduce pipeline for contracts?" Β· "How do I handle a 200,000-word corpus that's still too large for 100K?" Β· "What's the latency cost and how do I design around it?"
AI Lab Assistant
Context Window History Β· L3
Welcome to Lab 3. I'm your AI systems architect for the 100K era. You're deciding whether to migrate your LangChain-based MapReduce contract analysis pipeline to Claude's full-context approach. This is a real engineering decision with latency, cost, and accuracy dimensions. What aspect do you want to dig into first?
Lesson 4 Β· Module 2

The Million-Token Race

Gemini 1.5 Pro, Claude 3, and the architecture breakthroughs enabling context at scale
How did Google's Mixture-of-Experts architecture enable Gemini 1.5 Pro's 1-million-token context, and what genuinely new capabilities does this scale unlock?

On February 15, 2024, Google DeepMind previewed Gemini 1.5 Pro with a context window of 1,000,000 tokens β€” one million. The research preview announcement included a specific demonstration: the model was fed the entire source code of a video game and asked to locate and explain a specific bug. It found it. The model was also fed 11 hours of audio and asked about a specific 30-second segment. It answered accurately.

The jump from 100,000 to 1,000,000 tokens was not incremental. It required a fundamentally different architectural approach, specifically Google's implementation of Mixture-of-Experts (MoE) combined with improvements to positional encoding that allowed attention to remain coherent across one million token positions.

The Architecture That Made It Possible

Standard transformer attention scales as O(nΒ²) with context length β€” the fundamental constraint that capped GPT-3 at 2,048 tokens. One million tokens in a standard attention mechanism would require an attention matrix of 10ΒΉΒ² cells β€” computationally impossible even on modern hardware.

Gemini 1.5 Pro used two key architectural approaches. First, Mixture-of-Experts (MoE) routing reduced the active compute per token by activating only a subset of the model's parameters for each token, making per-token compute feasible at scale. Second, Google's research team employed ring attention techniques that allowed attention computation to be distributed across multiple TPU devices without each device holding the full attention matrix in memory.

The Google DeepMind technical report (February 2024) described Gemini 1.5 Pro as achieving "near-perfect recall" on needle-in-a-haystack tasks up to 1M tokens β€” a remarkable claim that external researchers at the time subjected to skeptical independent testing.

Lmsys Independent Testing β€” March 2024

Independent researchers using the LMSYS Chatbot Arena and community needle-in-a-haystack tests largely confirmed Gemini 1.5 Pro's claims up to approximately 500K tokens, with performance degrading somewhat in the 500K–1M range. The degradation was real but substantially smaller than the "lost in the middle" effect seen in earlier models β€” a genuine qualitative improvement in context utilization, not just context capacity.

Claude 3 and Anthropic's Response

On March 4, 2024, Anthropic released the Claude 3 model family (Haiku, Sonnet, Opus) β€” all with 200,000-token context windows. This represented a doubling of Claude's prior 100K window and positioned Anthropic competitively against Google's 1M preview, which was not yet in general availability.

Anthropic's technical documentation for Claude 3 emphasized a different metric than raw context size: context faithfulness β€” specifically, how well the model maintained accurate recall across the full 200K range. Independent evaluations (conducted by researchers at AI safety organizations and published on platforms like AI Alignment Forum in March–April 2024) found Claude 3 Opus performed more consistently across its 200K window than earlier models had across their smaller windows.

1M
Gemini 1.5 Pro
Feb 2024 preview Β· ~750K words
200K
Claude 3 Opus
March 2024 Β· ~150K words
128K
GPT-4 Turbo
Nov 2023 Β· ~96K words
What Only 1M+ Tokens Actually Enables

The practical uses that become feasible only above the 100K threshold β€” and which Gemini 1.5 Pro's technical report specifically documented β€” fall into distinct categories:

Full codebase analysis: A 1M token window can hold approximately 750,000 words of code. A large open-source project like CPython's core library is approximately 400,000 lines β€” manageable in a single 1M context. Developers can ask "find all places where this variable is used and explain the data flow" across an entire real-world project.

Hour-scale audio and video: Google's February 2024 demonstration fed 11 hours of audio in a single context. At roughly 1,500 tokens per minute of audio transcription, 11 hours β‰ˆ 990,000 tokens β€” fitting within 1M. This enables tasks like "find all segments in this recorded conference where the speaker contradicts an earlier statement."

Multi-document legal and financial analysis: A full merger acquisition due diligence set β€” encompassing hundreds of documents β€” can exceed 500,000 words. At 1M tokens, the entire set fits in a single context, enabling cross-document analysis without retrieval pipelines.

GPT-4 Turbo: OpenAI's 128K Response

OpenAI had responded to Claude's 100K window in November 2023 with the release of GPT-4 Turbo, featuring a 128,000-token context window β€” a substantial jump from the 32K maximum previously available. GPT-4 Turbo's announcement emphasized not just context size but price: it was priced at $0.01 per 1K prompt tokens, one-third the GPT-4-8K price, making large-context inference significantly more economical.

By early 2024, the context window race had clearly entered a new phase: the constraint was no longer whether a window was "large enough" for most professional documents, but how well the model used what it could see, and how economically it could process it.

The Utilization vs. Capacity Distinction β€” Crystallized

By mid-2024 the research community had solidified a key distinction: context capacity (how many tokens the model accepts) versus context utilization (how well the model actually uses those tokens). Gemini 1.5 Pro's 1M window with near-perfect recall represented a genuine breakthrough in utilization, not just capacity. This distinction would define the next phase of the context window race β€” not who had the largest window, but whose model actually paid attention to all of it.

Key Terms
Mixture-of-Experts (MoE)An architecture in which a model routes each token to a subset of specialized "expert" sub-networks rather than activating all parameters for every token, dramatically reducing per-token compute cost.
Ring AttentionA distributed attention computation technique that passes attention keys and values between devices in a ring topology, enabling attention over contexts too large for any single device's memory.
Context FaithfulnessThe degree to which a model's outputs accurately reflect information from across the full context window, as distinct from merely accepting that many tokens.

Lesson 4 Quiz

The Million-Token Race β€” check your understanding
1. Which architectural technique made Gemini 1.5 Pro's 1-million-token context feasible by activating only a subset of parameters per token?
Correct. MoE routing activates only a subset of expert sub-networks for each token, making the per-token compute cost tractable even at 1M-token contexts.
Not quite. The key technique was Mixture-of-Experts (MoE) routing β€” activating only a fraction of the model's parameters per token, reducing the per-token compute cost that would otherwise make 1M-context inference infeasible.
2. When Anthropic released Claude 3 in March 2024, what context window did all three variants (Haiku, Sonnet, Opus) offer?
Correct. All three Claude 3 models launched with 200,000-token context windows β€” double the prior Claude 1.3 limit.
Not quite. Claude 3 (Haiku, Sonnet, and Opus) all launched in March 2024 with 200,000-token context windows, doubling Claude's prior 100K limit.
3. What critical distinction did the research community crystallize by mid-2024 regarding large context windows?
Correct. The capacity vs. utilization distinction became central: a model could accept 1M tokens while still "losing" information in the middle β€” utilization quality, not raw capacity, became the key differentiator.
Not quite. The key distinction was between context capacity (how many tokens a model accepts) and context utilization (how faithfully and accurately it uses all those tokens) β€” a larger window doesn't automatically mean better use of the whole window.

Lab 4 Β· The Million-Token Era

Interactive conversation β€” explore when extreme contexts actually help

Your Task

It's mid-2024. You're an AI engineering lead evaluating whether Gemini 1.5 Pro's 1M-token context makes sense for your company's codebase analysis and legal due-diligence workflows. Explore the utilization vs. capacity question with your AI assistant β€” when does the million-token context genuinely help, and when is a well-implemented RAG pipeline still the smarter choice?

Suggested openers: "For a 400,000-line codebase analysis, would I actually use Gemini 1.5 Pro over a RAG pipeline?" Β· "How do I test whether a model is actually utilizing 1M tokens vs. just accepting them?" Β· "What's a needle-in-a-haystack test and how would I design one for my use case?"
AI Lab Assistant
Context Window History Β· L4
Welcome to Lab 4. I'm your AI architecture consultant for the million-token era. You're evaluating Gemini 1.5 Pro's 1M context against a well-tuned RAG pipeline for two use cases: full codebase analysis and M&A due diligence. The answer depends on details about your data, latency requirements, and how much you trust the model's utilization. What would you like to explore?

Module 2 Test

The History of Context Expansion Β· 15 questions Β· Pass at 80%
1. GPT-3 launched in 2020 with a maximum context window of how many tokens?
Correct. GPT-3 launched with a 2,048-token context window β€” approximately 1,500 English words.
GPT-3's context window was 2,048 tokens β€” approximately 1,500 English words.
2. Why does transformer attention scale as O(nΒ²) with context length?
Correct. Attention computes a relationship score between every pair of tokens β€” n tokens Γ— n tokens = nΒ² relationships in the attention matrix.
Attention scales quadratically because every token must attend to every other token, producing an n Γ— n attention matrix that quadruples in size when context length doubles.
3. Codex, the model powering GitHub Copilot's 2021 preview, featured a context window of:
Correct. Codex doubled GPT-3's limit with a 4,096-token window β€” sufficient to process an entire small source file.
Codex shipped with 4,096 tokens β€” double GPT-3's 2,048 β€” because code tasks demand visibility into full file structures.
4. Copy.ai's "Blog Post Wizard" in 2020 produced disconnected paragraphs requiring manual stitching because:
Correct. Generating anything longer than roughly 1,500 words required splitting content into sections; each section was generated without access to the full preceding content.
The 2,048-token window covered the entire context β€” prompt plus completion β€” forcing section-by-section generation without cross-section coherence.
5. GPT-4 launched in March 2023 with a standard context window of:
Correct. GPT-4's broadly available standard context was 8,192 tokens; the restricted GPT-4-32K offered 32,768.
GPT-4's standard (broadly available) context window was 8,192 tokens. The 32,768-token GPT-4-32K was a restricted, higher-cost variant.
6. The Stanford NLP paper "Lost in the Middle" (Liu et al., November 2023) documented what finding?
Correct. Liu et al. showed a U-shaped performance curve: models recall beginning and end of context well, but information placed in the middle is systematically underweighted.
"Lost in the Middle" documented that models show a U-shaped recall pattern β€” performing well at context edges but poorly on information positioned in the middle of a long context.
7. Anthropic released Claude 1.3 with a 100K-token context window in:
Correct. Claude 1.3 with its 100K context launched in May 2023, approximately two months after GPT-4.
Claude 1.3 with 100,000-token context was released in May 2023 β€” roughly two months after GPT-4's March 2023 launch.
8. Anthropic's demonstration of the Claude 100K context window involved:
Correct. The demonstration seeded The Great Gatsby's full text with a fake pizza-order sentence and showed Claude could locate it accurately β€” establishing the template for needle-in-a-haystack evaluation.
Anthropic's demo fed Claude the full text of The Great Gatsby plus an inserted fictional sentence, then asked Claude to find the planted sentence β€” demonstrating usable recall across the full 100K range.
9. MapReduce summarization pipelines became largely obsolete for single-document tasks after the 100K context because:
Correct. A 100K token window (~75,000 words) is sufficient for the vast majority of professional documents β€” full-context inference replaced chunked summarization for single-document tasks.
MapReduce pipelines existed to handle documents longer than the context window; when the window grew large enough to hold most complete professional documents, the pipeline became unnecessary for single-document tasks.
10. OpenAI released GPT-4 Turbo with a 128K context window in:
Correct. GPT-4 Turbo launched in November 2023 with a 128K context window and significantly lower per-token pricing than GPT-4.
GPT-4 Turbo with its 128,000-token context launched in November 2023, also featuring a significant price reduction to $0.01 per 1K prompt tokens.
11. Which context window did Gemini 1.5 Pro preview in February 2024?
Correct. Gemini 1.5 Pro's February 2024 preview announced a 1,000,000-token (one million) context window.
Gemini 1.5 Pro's February 2024 preview featured a 1,000,000-token (one million) context window β€” the largest commercially available at the time.
12. Ring attention allows 1M-context inference by:
Correct. Ring attention distributes attention keys and values across devices in a circular pattern, making it feasible to compute attention over sequences too large for any single device's memory.
Ring attention passes attention keys and values between devices in a ring topology, enabling full attention computation over very long sequences without requiring any single device to hold the full attention matrix.
13. Claude 3's context window (all three variants) at launch in March 2024 was:
Correct. All Claude 3 variants (Haiku, Sonnet, Opus) launched in March 2024 with 200,000-token contexts.
Claude 3 (Haiku, Sonnet, and Opus) all launched in March 2024 with 200,000-token context windows β€” double Claude 1.3's 100K.
14. What does "context utilization" mean, as distinct from "context capacity"?
Correct. Utilization asks: given that the model accepted N tokens, how well does it actually draw on information from all positions in that window? Capacity just measures the maximum N.
Context utilization describes how faithfully a model uses the information at all positions in its context β€” distinct from context capacity, which merely measures the maximum number of tokens accepted.
15. Which of the following was a documented use case that became feasible only with 1M+ token contexts β€” not practically achievable at 100K tokens?
Correct. A 400,000-line codebase like CPython's core library exceeds 100K tokens but fits within a 1M context β€” enabling full-codebase cross-referencing without retrieval pipelines.
Large open-source codebases (400,000+ lines) far exceed 100K tokens but fit within a 1M context β€” making full-codebase analysis without retrieval pipelines a distinctly 1M-era capability.