The Context Window Race

1. A model parameter count of 100 billion and a context window of 4,096 tokens means the model:

Correct. Parameters and context window are independent dimensions. Parameter count encodes learned knowledge; context window determines what can be attended to in a single inference pass.

Not quite. Parameters and context window are completely separate. No matter how many parameters a model has, it cannot attend to tokens outside its context window in a single pass.

2. What did Shi et al. (2023) find about the relationship between longer contexts with distractors and model hallucination rates?

Correct. Shi et al. (2023) found that longer contexts containing irrelevant distractor content significantly increase hallucination rates — models confabulate rather than accurately retrieve from long, noisy inputs.

Not correct. Shi et al. found that irrelevant distractor content in long contexts increases hallucination rates — a counter-intuitive finding that more context is not always better.

3. The "Lost in the Middle" paper found that recall performance had which shape when plotted against information position in the context?

Correct. The U-shaped curve reflects primacy bias (strong at the start) and recency bias (strong at the end), with poor performance for content in the middle.

The "Lost in the Middle" paper found a U-shaped recall curve — high at the beginning and end, poor in the middle.

4. Why did Google DeepMind use long context (not RAG) for genomic sequence analysis in 2024?

Correct. Genomic sequences have long-range dependencies that require the entire sequence to be held in context simultaneously — RAG's independent chunk retrieval misses these cross-position relationships.

Long-range genomic dependencies mean position 1 affects interpretation at position 700,000 — a relationship that RAG's chunked retrieval cannot capture.

5. LangChain was created by Harrison Chase in what month and year?

Correct. Harrison Chase created LangChain in October 2022, and it reached 1 million GitHub stars in under 18 months.

LangChain was created in October 2022 by Harrison Chase.

6. The attention mechanism in standard transformers scales as O(n²) in both time and memory because:

Correct. QKᵀ produces an n×n matrix — the defining quadratic operation.

The quadratic scaling comes from the n×n matrix in QKᵀ — all n queries against all n keys.

7. GPT-4's input price at launch (March 2023) was approximately $30 per million tokens. What did Claude 3 Haiku cost per million input tokens at launch in March 2024?

Correct. Claude 3 Haiku launched at $0.25/million input tokens in March 2024 — roughly a 99% reduction from GPT-4's original pricing.

Claude 3 Haiku cost $0.25 per million input tokens — representing roughly a 99% price reduction from GPT-4's March 2023 launch pricing.

8. The original RAG paper by Lewis et al. (2020) came from which organization?

Correct. Lewis et al. (2020) published the original RAG paper from Facebook AI Research, establishing the retrieval-augmented generation paradigm.

Not correct. The original RAG paper — Lewis et al. 2020 — came from Facebook AI Research (FAIR).

9. What is the three-stage architecture of a standard RAG system (in correct order)?

Correct. RAG follows: Indexing (embed and store documents), Retrieval (find similar chunks for the query), Generation (use retrieved chunks plus query as context for the model).

The three stages are Indexing (embed and store), Retrieval (find relevant chunks), then Generation (LLM produces answer from retrieved context).

10. The "lost in the middle" phenomenon shows that model recall accuracy is highest for information placed where in the context?

Correct. The Stanford/Berkeley research found models reliably recall information at the beginning and end of context, but accuracy degrades for information placed in the middle.

The "lost in the middle" research showed models recall best from the beginning and end of context — not the middle.

11. Longformer was published by researchers at:

Correct. Iz Beltagy and colleagues at the Allen Institute for AI published Longformer in April 2020.

Longformer came from Iz Beltagy et al. at the Allen Institute for AI, published in April 2020.

12. The SCROLLS benchmark (Shaham et al., 2022) tests which type of capability that NIAH does not?

Correct. SCROLLS tests summarization, QA, and natural language inference across real long documents — synthesis tasks that require more than single-fact recall.

Not correct. SCROLLS tests document-level synthesis: summarization, question answering, and natural language inference — not simple single-fact retrieval.

13. Gemini 1.5 Pro's improved long-context recall was achieved in part through which architectural design?

Correct. Gemini 1.5 Pro's technical report describes a mixture-of-experts design combined with modified positional encoding as key architectural contributors to its improved long-context performance.

Not correct. Gemini 1.5 Pro's technical report attributed improved long-context recall to a mixture-of-experts architecture combined with modified positional encoding.

14. Anthropic's Claude 1 required specialized infrastructure for its 100K context window (2023) primarily because:

Correct. The O(n²) memory growth at 100K tokens required novel infrastructure, not just more of the same hardware.

The O(n²) attention matrix at 100K tokens overwhelms standard GPU memory — specialized infrastructure was needed to manage this.

15. Why does transformer attention scale as O(n²) with context length?

Correct. Attention computes a relationship score between every pair of tokens — n tokens × n tokens = n² relationships in the attention matrix.

Attention scales quadratically because every token must attend to every other token, producing an n × n attention matrix that quadruples in size when context length doubles.

16. The "lost in the middle" research (2023) found that transformer models most reliably attend to information at what position in a long context?

Correct.

Review Lesson 1: the finding was that models attend more strongly to beginning and end, with a "lost in the middle" dip for information positioned centrally in long contexts.

17. According to Databricks' 2024 enterprise LLM engineering evaluation, what is the approximate effective context ceiling for reliable multi-step reasoning in most models?

Correct. Databricks' evaluations placed the practical effective ceiling for reliable multi-step reasoning at approximately 16k–32k tokens, regardless of advertised maximum context length.

Not correct. Databricks found the practical effective ceiling for reliable multi-step reasoning at approximately 16k–32k tokens for most models evaluated.

18. Google's Gemini 1.5 Pro was announced in February 2024 with what maximum context window in research preview?

Correct. Gemini 1.5 Pro launched in research preview with 1 million tokens in February 2024, later expanding to 2 million.

Gemini 1.5 Pro launched with 1 million tokens in research preview in February 2024.

19. FlashAttention's primary performance advantage over standard attention is that it reduces:

Correct. FlashAttention performs the same FLOPs but avoids materializing the n × n attention matrix in HBM, reducing memory traffic from O(n²) to O(n) per layer.

FlashAttention's key innovation is IO-awareness: it performs identical FLOPs but drastically reduces expensive HBM read/write operations by keeping intermediate values in fast SRAM.

20. The Stanford CRFM pilot studies on long-context meta-analysis found that AI assistance reduced manual data extraction time by approximately what percentage?

Correct. Stanford CRFM's 2024 pilot showed approximately 60–70% reduction in time for extracting effect sizes, sample sizes, and methodological details from collections of 50+ papers.

Review Lesson 3: the Stanford CRFM pilots showed approximately 60–70% reduction in manual extraction time for meta-analysis tasks.

Final Exam