In 2023, researchers at Stanford and UC Berkeley set out to answer a deceptively simple question: does feeding a language model more context actually help it find relevant information? They built a task called multi-document question answering — hide a gold-standard answer inside a set of documents, then ask the model to find it. What they discovered reshaped how engineers think about context windows entirely.
The paper — "Lost in the Middle: How Language Models Use Long Contexts" — published by Liu et al. in 2023, showed that performance dropped sharply when the relevant information was placed in the middle of a long context. Models consistently retrieved facts from the very beginning and very end of their input with high accuracy. Everything in between degraded.
The researchers tested GPT-3.5-Turbo (16k version), Claude 1.3 (100k version), and several open-source models. They varied where the gold document appeared among 10–30 distractor documents. The findings were consistent across all models tested.
When the relevant document was first or last in the context, accuracy was high — often above 70%. When it was placed in the middle positions, accuracy dropped to near-random performance in some configurations. This U-shaped curve held even when the total number of documents varied and even when models were explicitly told that the answer existed somewhere in the provided documents.
The effect scaled with context length. Longer contexts made the middle problem worse, not better. A model with a 100k-token window did not improve middle retrieval compared to a 4k window — it simply had more middle to lose things in.
Across all tested models, performance on the multi-document QA task followed a U-shaped curve relative to where the relevant document appeared in the context. Primary sources: Liu, N. F., Lin, K., Hewitt, J., et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 2024.
The mechanism is rooted in how transformer attention interacts with positional encoding and training data. During pretraining and fine-tuning, most examples are short enough that every position gets roughly equal attention. When inference extends far beyond training distribution lengths, the model's learned attention patterns over-index on recency (the end of context) and primacy (the beginning, where instructions live).
Additionally, instruction-following fine-tuning often places system prompts at the start and desired outputs at the end, reinforcing the model's tendency to anchor on those positions. The middle of a long context, by contrast, is structurally the least reinforced region during training.
This is not a bug that can be trivially patched. It reflects the statistical structure of how models are trained, and it persists even in models with purpose-built long-context architectures like Claude's constitutional training or GPT-4's system-level fine-tuning.
Engineers who paste 50 pages of documentation into a context window and assume the model "read" all of it are making a measurably wrong assumption. The model read the beginning and end. The middle is probabilistically degraded. This has direct implications for RAG design, prompt engineering, and any workflow that feeds large documents to LLMs.
You are designing a retrieval system that must surface facts from long legal documents. Your colleague claims that giving the model a 100k-token context window is sufficient — "it can read the whole contract." Use this chat to explore the lost-in-the-middle problem and develop a counter-argument backed by the Liu et al. findings.
In late 2023, developer Greg Kamradt published what became known as the "needle in a haystack" test — a deceptively simple evaluation that exposed long-context limitations in a visually striking way. Kamradt inserted a single unusual fact (the "needle") into a large document of Paul Graham essays (the "haystack") at controlled positions and depths, then asked models to retrieve it. The resulting heatmaps showed exactly where models failed.
When Anthropic and OpenAI ran similar evaluations on Claude 2.1 and GPT-4 Turbo respectively, they published the results publicly. Claude 2.1 showed degraded retrieval from certain middle-depth positions at 100k tokens. OpenAI's evaluations on GPT-4 Turbo (128k) revealed similar patterns. These weren't hidden results — they appeared in official technical documentation.
The standard NIAH (needle in a haystack) protocol varies two dimensions: context depth (how far into the total token budget the needle is placed, expressed as a percentage) and context length (total tokens in the input). A model is then scored on whether it can reproduce the needle verbatim or answer a question whose answer requires the needle.
Results are plotted as a 2D heatmap — context length on one axis, depth percentage on the other, with color indicating accuracy. A perfect model would show uniform high accuracy across the entire grid. Real models show characteristic failure patterns: often darker (worse) in the center-left region of the grid, corresponding to information placed early-middle in medium-to-long contexts.
By 2024, NIAH had become a standard evaluation component. Models from Mistral, Cohere, Google, and others all published NIAH heatmaps as part of their technical releases. The benchmark proved so useful precisely because it was simple enough to be reproducible and visually interpretable.
Anthropic's Claude 2.1 technical documentation (November 2023) noted that "without a prompt that encourages Claude to look for the needle," retrieval rates at certain positions dropped substantially. OpenAI's GPT-4 Turbo system card and independent evaluations by researchers including those at Databricks corroborated similar positional effects across different model families.
NIAH tests recall of a single planted fact, which is a useful but narrow proxy. Two additional benchmark suites address more realistic long-context tasks:
SCROLLS (Shaham et al., 2022) — Summarize and Complete Long Documents — is a collection of long-document NLP tasks including narrative QA, contract NLI, and long-form summarization. It tests whether models can synthesize and reason across documents, not just retrieve planted facts. Performance on SCROLLS correlates with real-world use cases more directly than NIAH.
HELMET (Yen et al., 2024, published by Princeton researchers) — How to Evaluate Long-context Models Effectively and Thoroughly — introduced a comprehensive suite that tested models at multiple context lengths on citation recall, summarization, re-ranking, and multi-hop reasoning. HELMET specifically found that NIAH scores could be misleading: some models achieved near-perfect NIAH scores while failing on more realistic long-context tasks. The benchmark exposed a gap between synthetic recall and genuine comprehension.
Taken together, NIAH, SCROLLS, and HELMET reveal a consistent picture: effective context utilization does not scale linearly with context window size. Models that advertise 128k or 1M token windows routinely underperform on tasks requiring integration of information from middle positions within those windows.
The practical ceiling — the token range within which models reliably use their full context — is consistently lower than the advertised maximum. Independent evaluations by Databricks (2024) placed this effective ceiling for most models at roughly 16k–32k tokens for reliable multi-step reasoning, even for models marketed with 100k+ windows.
| Benchmark | What It Tests | Key Finding |
|---|---|---|
| NIAH | Single-fact retrieval at variable depth/length | U-shaped / positional failure patterns; widely replicated |
| SCROLLS | Summarization, QA, NLI on real long docs | Performance degrades on longer-form integration tasks |
| HELMET | Multi-task suite at multiple context lengths | NIAH can overestimate real-world long-context capability |
When a vendor claims their model "supports" a given context length, ask which benchmark was used to verify performance at that length. A model can technically process 1M tokens while providing near-random retrieval accuracy for content placed beyond 32k. Context window size and context utilization quality are different measurements.
A vendor pitches you their new LLM: "200k context window, evaluated on NIAH with 99% accuracy." You need to decide whether this is sufficient for a document-review pipeline that must surface clauses from 80-page contracts.
Once the Liu et al. findings circulated, engineering teams at major AI labs and enterprise deployments started redesigning their retrieval pipelines. The response wasn't to abandon long-context models — it was to stop relying on brute-force context stuffing and to engineer for the model's actual recall geometry. Three patterns emerged as the dominant solutions.
Rather than feeding entire documents into the context window, RAG systems retrieve only the most semantically relevant chunks before inference. A vector database stores document embeddings; at query time, the top-k most similar chunks are retrieved and placed in context — typically at the beginning or end, where recall is highest.
RAG was described in the original Lewis et al. 2020 paper (Facebook AI Research) and adopted rapidly after long-context limitations became clear. By 2023, RAG had become the de facto pattern for enterprise document QA, explicitly because it avoided the lost-in-the-middle failure mode by keeping retrieved context short and relevance-ranked.
The key design decision in RAG is where retrieved chunks appear in the prompt. Teams at Databricks and LlamaIndex both published engineering guides in 2023–2024 recommending that the most critical retrieved context be placed either at the very beginning (after the system prompt) or at the very end (immediately before the question), never buried in the middle of multiple retrieved passages.
LlamaIndex documentation (2024) and Databricks' enterprise LLM engineering guide both explicitly recommend placing the most relevant retrieved passage either first or last among multiple retrieved chunks, citing the lost-in-the-middle finding as the empirical basis for this ordering.
When RAG is not an option and documents must be processed in full, prompt reordering strategies attempt to mitigate positional bias by restructuring what goes where. Common approaches include:
Critical-first ordering: Place the most important document sections at the beginning of the context, before filler material. If you know which clauses are legally significant, surface them first.
Map-reduce chunking: Divide long documents into chunks small enough to fit entirely within the reliable recall zone (typically under 8k tokens per chunk). Run the model over each chunk independently (the "map" phase), then aggregate partial answers (the "reduce" phase). LangChain's MapReduceDocumentsChain implements this pattern and documents it as a direct response to context length limitations.
Refine chains: Process chunks sequentially, each time asking the model to refine a running answer based on the new chunk. This keeps every piece of content in the primacy or recency position at least once during processing.
Liu et al. themselves tested a mitigation: explicitly instructing the model to search the entire context carefully before answering. The prompt addition "Search all provided documents thoroughly before answering. The relevant information may appear anywhere in the context" improved middle retrieval in some configurations — but did not eliminate the positional bias entirely. Gains were model-dependent and inconsistent.
A more robust prompting approach tested by Anthropic (documented in their 2023 Claude technical guidance) involved adding a meta-instruction: the user explicitly states that the answer exists somewhere in the context and asks the model to scan from beginning to end before responding. Anthropic's public documentation for Claude 2.1 specifically noted that "a system prompt encouraging Claude to search through all documents" improved NIAH performance at long context lengths.
Neither prompting fix replaces architectural solutions. They are complementary mitigations, not replacements for RAG or chunking.
Advanced RAG pipelines add a reranker between the retrieval step and the LLM. A cross-encoder reranker (e.g., Cohere's Rerank API or BGE-reranker from BAAI) scores each retrieved chunk for relevance to the specific query, then reorders them so the highest-relevance chunks appear at the beginning of the LLM's context. This combines semantic search retrieval with positional optimization — ensuring the most relevant material appears where recall is highest.
By 2024, reranking had become standard in production RAG deployments at companies including Notion, Salesforce, and enterprise users of Cohere's platform, specifically because it addressed the interaction between retrieval quality and positional recall.
Design your retrieval and prompt architecture assuming the model will reliably recall content only from the first ~20% and last ~20% of its filled context window. Place your most critical information in those zones. Use chunking or RAG to avoid needing the middle at all.
You are building a contract review tool that must identify risk clauses across 200-page legal agreements. You have access to a vector database, a cross-encoder reranker, and Claude with a 100k token window. Design a pipeline that maximizes recall of clauses regardless of their position in the original document.
When Google DeepMind published results for Gemini 1.5 Pro in February 2024, they included NIAH results showing near-perfect retrieval across 1 million tokens — a striking contrast to the degraded heatmaps that had characterized models just a year earlier. The claim was extraordinary: had the lost-in-the-middle problem been solved?
The answer, as independent researchers quickly established, was: partially, and for specific task types. NIAH performance had genuinely improved for well-defined single-fact retrieval. But on more complex multi-step reasoning tasks — the ones HELMET was designed to measure — significant positional effects persisted even in frontier models. The boundary had moved, but the problem had not disappeared.
Google's Gemini 1.5 Pro technical report (Reid et al., 2024) demonstrated near-perfect NIAH recall across 1M tokens. This was achieved through architectural changes including a mixture-of-experts design and modified positional encoding. The result held across multiple independent validations including those conducted by Artificial Analysis (a model evaluation firm) in March 2024.
Anthropic's Claude 3 family (released March 2024) showed substantially improved NIAH performance compared to Claude 2.1, with Anthropic's technical documentation showing high retrieval accuracy at 200k tokens across most depth positions. Claude 3 Opus specifically showed reduced positional degradation in multi-document QA configurations.
However, both sets of improvements were most pronounced on synthetic single-fact retrieval. When HELMET-style multi-task evaluations were applied at long context lengths, both model families still showed performance degradation relative to their in-context performance at shorter lengths. The gap narrowed significantly; it did not close.
Independent evaluations by Artificial Analysis confirmed Gemini 1.5 Pro's strong NIAH performance across 1M tokens. However, their analysis noted that "NIAH performance does not directly translate to equivalent gains on complex reasoning tasks at the same context lengths." Gemini 1.5 Pro's multi-step reasoning tasks showed degradation beginning around 256k tokens in some configurations.
Rotary Positional Embeddings (RoPE) and extensions: RoPE (Su et al., 2021) replaced absolute positional embeddings with relative ones that generalize better to lengths beyond training distribution. Extensions including YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) further extended RoPE's effective length. LongRoPE was adopted in Microsoft's Phi-3 models, enabling a claimed 128k effective context with improved middle retrieval.
Mixture of Experts (MoE): MoE architectures (as used in Gemini 1.5) allow different expert networks to specialize in processing different parts of the input. This may partially explain the improved positional recall — different experts may develop specialization for different context regions.
Ring attention and distributed context processing: Techniques including ring attention (Liu et al., 2023, from UC Berkeley) allow attention computation to be distributed across multiple devices, enabling longer sequences without the quadratic memory bottleneck. This is a computational enabler, not a direct fix for positional recall, but it makes longer reliable contexts tractable.
Despite architectural progress, several persistent limitations remain documented as of 2024–2025:
Multi-hop reasoning degradation: Tasks requiring the model to connect information from multiple non-adjacent positions in long context — e.g., "given the definition in section 2 and the exception in section 47, does the clause in section 83 apply?" — show significant degradation at long context lengths even in frontier models. This was documented in HELMET and in independent work by researchers at CMU and MIT.
Hallucination rate increases: Research by Shi et al. (2023) showed that longer contexts with irrelevant distractors increase hallucination rates, as models increasingly confabulate rather than accurately retrieving from the middle of long inputs. This effect persists in newer models at longer context lengths.
Cost and latency: Even if positional recall improves, processing 1M tokens is expensive and slow. In 2024, a single 1M-token Claude 3 inference cost roughly $15–$60 depending on model tier. Most production systems cannot afford to process full document stores via context stuffing even if recall were perfect.
| Model / Year | NIAH Status | Complex Reasoning at Max Context |
|---|---|---|
| GPT-3.5-Turbo (2023) | Clear U-shaped failure | Significant degradation |
| Claude 2.1 (2023) | Partial middle degradation | Significant degradation |
| Gemini 1.5 Pro (2024) | Near-perfect at 1M tokens | Degradation from ~256k tokens |
| Claude 3 Opus (2024) | Substantially improved | Reduced but persistent degradation |
The trajectory is clear: models are improving at middle retrieval, and the effective context ceiling is rising. But the gap between advertised context length and reliable complex reasoning at that length will likely persist for the foreseeable future. The architectural fixes that improved NIAH performance have not yet fully translated to multi-hop reasoning gains at the same scale.
The practical recommendation for 2024–2025 remains the same as in 2023: use RAG and chunking for production document QA systems, benchmark on task-relevant evaluations (not just NIAH), and treat vendor context window claims with empirical skepticism until you've run your own tests on your own documents.
Single-fact NIAH retrieval has improved dramatically in frontier models. Multi-step reasoning across very long contexts remains genuinely difficult. RAG and chunking are not legacy patterns — they remain the right engineering choices for production systems that require reliable recall at scale.
A new frontier model is announced with "perfect 1M-token recall demonstrated on NIAH." Your CTO wants to retire the company's RAG pipeline and switch to context-stuffing the entire document store. Use this lab to develop a technically grounded case for or against this decision.