In 2024, researchers at Epoch AI published a paper with an unusually blunt conclusion: high-quality human text on the internet will be exhausted as a training resource somewhere between 2026 and 2032. Not depleted in the everyday sense — the words would still exist. But every usable token would already have been seen by a model. Training a new, larger model on the same corpus would add nothing new.
The paper was titled "Will We Run Out of Data?" Its answer was effectively yes — at least for the data that had powered the previous decade of progress.
The first large language models trained on millions of tokens. GPT-2, released in 2019, trained on roughly 40 billion tokens — a dataset called WebText, scraped from outbound Reddit links. At the time, 40B tokens felt enormous.
By 2023, training runs had scaled by three orders of magnitude. LLaMA 1 trained on 1.4 trillion tokens. GPT-4's training data is not publicly disclosed, but estimates range from 10 to 100 trillion tokens. The Common Crawl — a freely available web snapshot — contains roughly 250 trillion raw tokens, though heavy deduplication and quality filtering reduces usable content to perhaps 30–50 trillion tokens.
The models got better, in part, because the datasets got bigger. But datasets cannot grow forever. The internet is large but finite, and the pace of new human writing is far slower than the pace of compute scaling.
Data exhaustion does not mean the internet disappears. It means the marginal value of additional web scrapes approaches zero. Once a model has seen every Wikipedia article, every digitised book in Project Gutenberg, every Stack Overflow thread, every ArXiv preprint — seeing them a second or third time during training yields diminishing returns. More tokens from the same distribution do not meaningfully improve capability.
The practical ceiling is not the raw byte count of all text ever written. It is the subset that meets quality thresholds — no spam, no machine-generated boilerplate, no duplicate content — and that is available for legal access. That subset is considerably smaller than the raw total, and it is what AI labs have spent years racing to acquire.
The Epoch AI analysis modelled data production rates against training data consumption rates. Under the assumption that compute continues scaling at historical rates, demand for high-quality tokens will outstrip supply before 2030. The paper identified this as one of the two most significant near-term bottlenecks in AI progress — the other being compute cost.
Not all text is equally useful for training. Researchers distinguish roughly three tiers based on empirical results:
| Tier | Examples | Training Value | Availability |
|---|---|---|---|
| High quality | Books, academic papers, curated Wikipedia, legal documents | Very high — dense reasoning, precise language | Limited and shrinking rapidly |
| Medium quality | News articles, blog posts, GitHub code | Moderate — variable reasoning depth | Large but largely consumed |
| Low quality | Social media, comment threads, spam | Low — noise often exceeds signal | Vast but requires heavy filtering |
The highest-value tier — books and academic writing — is also the most constrained. Google Books scanned roughly 25 million books through its library program, but the vast majority are under copyright and legally contested. The "Books3" dataset used to train several open-weight models is now the subject of active lawsuits. Access to premium data is narrowing even as demand grows.
The data exhaustion problem is the primary structural motivation for synthetic data research. If human text is finite and largely consumed, then generating new training data artificially becomes not a niche technique but a foundational necessity. Every subsequent lesson in this module builds on this constraint.
You're going to interrogate the data exhaustion problem. Ask the assistant to help you think through the numbers: How much quality-filtered text actually exists? What growth rate does new human writing add? Why can't models just keep reusing the same data? Push on the assumptions behind the Epoch AI projections.
In late 2023, the New York Times filed a copyright infringement lawsuit against OpenAI and Microsoft, alleging that GPT-4 had been trained on millions of its articles without license or compensation. The suit cited specific examples in which the model reproduced lengthy passages verbatim. It was one of the largest and most high-profile in a growing wave of litigation that included authors, visual artists, musicians, and source code authors.
By mid-2024, the legal landscape for training data had changed fundamentally. Labs that had previously operated on an implicit assumption of "scrape now, litigate never" now faced real financial exposure — and, more practically, real uncertainty about which data they could use in future training runs.
The volume of AI copyright litigation since 2022 is unprecedented in tech legal history. Several cases set or are in the process of setting important precedents:
US copyright law's fair use doctrine allows limited use of copyrighted material under four factors: purpose and character of use, nature of the copyrighted work, amount taken, and effect on the market for the original. AI training raises genuinely novel questions about each factor.
AI companies argue that training is "transformative" — the model does not store or reproduce the original text, it learns statistical patterns. Rights holders counter that when a model can reproduce verbatim passages and displaces the market for original work (as the Times lawsuit demonstrates), the reproduction argument fails.
No US court has issued a final ruling on the core fair use question for LLM training as of mid-2024. The uncertainty itself is a form of practical constraint: labs must now budget for legal risk when choosing datasets.
The EU AI Act, which entered into force in August 2024, requires "general-purpose AI" providers to publish summaries of training data used to develop foundation models. This creates regulatory pressure on data sourcing beyond the US litigation context — and gives rights holders a formal mechanism to identify whether their content was used.
Even when a case doesn't result in an immediate injunction, litigation changes lab behaviour. The Books3 dataset — roughly 196,000 copyrighted books scraped from a shadowy file-sharing site — was quietly removed from the RedPajama dataset in 2023 after legal scrutiny. Hugging Face removed it from public access. Labs that had trained on it faced retroactive exposure.
The practical effect: the high-quality book corpus that sat at the top of the training data quality tier became substantially less accessible. Future models either pay licensing fees, use only public-domain books, or substitute synthetic text — which is precisely why synthetic data research accelerated sharply after 2023.
Some labs have moved toward paid data licensing. OpenAI signed deals with the Associated Press and with Axel Springer (publisher of Business Insider and Politico) to license news content. These agreements are commercially significant but highlight the constraint: high-quality data at scale is no longer free. The economics of training now include data acquisition costs that did not exist at GPT-2's scale.
You're a researcher advising an AI lab about to begin a new training run. The lab wants to use the highest-quality data possible. Use the assistant to reason through which data sources are legally safe, which are risky, and what the licensing alternatives look like. Think about the tradeoffs between quality, legality, and cost.
When Google DeepMind released the Chinchilla paper in March 2022, it upended a dominant assumption in the field. The prevailing wisdom — demonstrated by GPT-3 — was that you should train the largest model you could afford on as much data as was available. Chinchilla showed this was wrong. A smaller model, trained on proportionally more data, consistently outperformed a larger model undertrained on less data. Compute and data needed to scale together.
The implication was immediate: the field had been massively underinvesting in data. But as labs scrambled to acquire more tokens, a second problem emerged. The web's easily scraped text was not uniformly useful. Much of it was low-quality boilerplate, duplicate content, machine-generated spam, and text that, when fed to a model in bulk, actually degraded performance on reasoning tasks.
Hoffmann et al. (2022) trained over 400 language models of different sizes on different amounts of data, holding compute budget fixed. Their finding: for a given compute budget, the optimal strategy is to train a model roughly half the size of what had become standard, but on twice as many tokens. GPT-3 (175B parameters) trained on 300B tokens; the Chinchilla-optimal model for the same compute would have been ~70B parameters trained on 1.4T tokens.
The Chinchilla scaling laws became the blueprint for subsequent models. LLaMA, Mistral, and the Gemma family all reflect Chinchilla-optimal or near-optimal data-to-parameter ratios. But following the Chinchilla prescription requires substantially more data — data that, per Lesson 1, is increasingly scarce.
The paper established that for compute-optimal training: model size and training tokens should scale in roughly equal proportion. This means doubling your compute budget should go half toward a larger model and half toward more data — not simply toward a bigger model trained on the same corpus.
The most striking evidence for data quality effects came from the FineWeb dataset released by Hugging Face in 2024. FineWeb is a 15 trillion token web dataset built from Common Crawl, but with aggressive quality filtering. In benchmark comparisons, models trained on FineWeb's 1.3T quality-filtered subset outperformed models trained on much larger unfiltered corpora on reading comprehension and reasoning tasks.
Hugging Face quantified the gain: FineWeb-Edu, a further filtered subset focusing on educational content, produced roughly 10% higher scores on MMLU and ARC benchmarks than equivalently-sized models trained on generic web text. The quantity of tokens used was smaller; the quality was higher.
Microsoft Research's Phi series provided perhaps the most dramatic demonstration of data quality effects. Phi-1, released in June 2023, was a 1.3 billion parameter model — tiny by contemporary standards — that achieved GPT-3.5 level performance on Python coding benchmarks. It was trained on a carefully curated mix of "textbook quality" data: a filtered subset of Stack Overflow, synthetic coding exercises generated by GPT-4, and high-quality documentation.
Phi-1's training data totalled roughly 7 billion tokens. For comparison, models 100 times larger trained on generic web data did not clearly outperform it on its target tasks. The Microsoft Research team wrote explicitly: "the quality of the data is much more important than the quantity."
Phi-2 (2.7B parameters, December 2023) and Phi-3 (3.8B–14B parameters, April 2024) extended this approach, consistently achieving performance competitive with models 5–10x larger. The Phi papers represent the clearest published evidence that data quality is the primary lever — and that synthetic, curated data can substitute for orders of magnitude more raw web text.
A 1.3B parameter model trained on 7B tokens of curated/synthetic "textbook quality" data matched models trained on vastly more parameters and raw web data. This was not a minor gain — it was a paradigm-shifting result that directly motivated the synthetic data research wave of 2023–2024.
Training on noisy data is not neutral — it can actively degrade model capability. The mechanisms are well understood: noisy data increases gradient variance during training, causing the model to learn conflicting signals. Content that is grammatically correct but logically incoherent (much automated web content) teaches the model surface fluency without underlying reasoning structure.
OpenAI's data team described filtering decisions for GPT-4's training as among the most consequential choices made during development. The specific decisions remain proprietary, but multiple researchers who left OpenAI have noted in interviews that the quality filtering pipeline — not the model architecture or compute budget — was often the limiting factor in capability improvements.
You're going to work through data quality assessment with the assistant. The goal: develop intuition for what makes training data "high quality" in the technical sense — not just in vague terms, but in terms of the specific properties that affect benchmark performance. Use the Chinchilla and Phi findings as anchors.
In September 2023, Google DeepMind and Google Research published results for Med-PaLM 2, a large language model adapted for medical question answering. On the US Medical Licensing Examination, Med-PaLM 2 scored at expert doctor level — higher than the average human physician. It was a striking result. But the paper's methods section revealed the constraint underlying all medical AI: the model had been fine-tuned on a carefully curated dataset of expert-annotated medical questions that took teams of clinicians months to construct and was far too small for general pretraining.
The internet contains enormous amounts of health-related text — patient forums, medical news, wellness blogs. Almost none of it meets the quality threshold needed to train a model that will give advice in clinical settings. The gap between available medical text and trustworthy medical training data is vast.
Not all specialised domains face identical constraints. The data scarcity problem varies considerably by domain type:
| Domain | Why Scarcity Is Severe | Real-World Consequence |
|---|---|---|
| Clinical medicine | Patient records are protected by HIPAA/GDPR; high-quality clinical notes require expert annotation; error risk is high | Medical AI systems require expensive curation pipelines; most deployed systems perform poorly on edge cases |
| Legal reasoning | Court documents are public but jurisdiction-specific; legal reasoning requires tracking precise precedent chains that generic web text does not contain | AI legal tools hallucinate citations at high rates; Bar association studies show 30–40% error rates on complex reasoning tasks |
| Scientific/technical | Most research is paywalled; preprints vary enormously in quality; domain expertise required to filter noise from valid content | Models trained on general web text generate plausible-sounding but scientifically invalid procedures and results |
| Low-resource languages | Many languages lack significant digital presence; what exists is often low-quality or misrepresentative of actual usage | AI performance on non-English languages degrades dramatically; OECD estimates 80% of world's languages are effectively unserved |
In May 2023, two New York attorneys — Steven Schwartz and Peter LoDuca — submitted a court brief that cited six precedents in support of their argument. The citations were entirely fabricated. The ChatGPT system they had used generated plausible-sounding but entirely nonexistent case names, docket numbers, and holdings. Judge P. Kevin Castel fined both attorneys and their firm.
The Schwartz/LoDuca case became widely cited not because it was unique but because it was caught and litigated. AI legal tools trained primarily on general web text have no robust grounding in the actual corpus of legal precedent — which sits largely behind expensive subscription databases like Westlaw and LexisNexis. Those databases contain, collectively, perhaps 10 trillion tokens of high-quality legal text that have never been part of any open training run.
Hallucination in specialised domains is not primarily a model architecture problem — it is a training data problem. When a model has seen only sparse or low-quality examples of a domain, it fills gaps with statistically plausible but factually wrong content. The cure is domain-specific high-quality data, which is precisely what is scarce.
The BLOOM model (2022), developed by the BigScience consortium, attempted to address multilingual capability by training on 46 natural languages and 13 programming languages. Despite this, the researchers documented stark disparities: English and high-resource European languages received orders of magnitude more tokens than languages like Swahili, Yoruba, or Indic languages despite the latter being spoken by hundreds of millions of people.
The disparity is not a funding problem or a values problem — it is a data infrastructure problem. Swahili has roughly 200 million speakers but perhaps 5 billion tokens of quality-filtered digital text available. English has perhaps 100 trillion quality tokens available. A model trained proportionally to speaker population would require synthetic generation to close the gap — there simply is not enough human-written Swahili text to train an equivalent model.
The specialised data desert creates a second, more urgent motivation for synthetic data beyond the general internet exhaustion problem. Even if internet text were inexhaustible, it would not solve the medical, legal, scientific, or low-resource language problems — because those problems are not about total volume, they are about the absence of domain-specific, high-quality, expert-grounded content.
Synthetic data generation — using existing capable models to produce new training examples in under-resourced domains — is the only currently viable path to closing these gaps at scale. The approach has documented risks (which subsequent modules address) but it is increasingly the only option. There are simply not enough doctors writing clinical notes, lawyers writing annotated precedent analyses, or Swahili authors producing digital text to close these gaps through data collection alone.
Real training data is running out for four compounding reasons: the finite and largely-consumed internet corpus (L1), legal constraints that shrink accessible high-quality data further (L2), quality effects that mean raw volume cannot substitute for curated content (L3), and specialised domain deserts where no amount of web scraping produces usable training data (L4). Together, these four forces make synthetic data generation not an optional enhancement — but a structural necessity for continued AI progress.
Choose a specialised domain — medical AI, legal AI, a low-resource language, or scientific AI — and work with the assistant to design a data acquisition and generation strategy. What real data exists and how can it be accessed? Where is it insufficient? What would a synthetic data pipeline need to produce to fill the gap? What quality controls would you need?