Module 2 · Lesson 1

The Internet Has Already Been Written

Every webpage, book scan, and forum post that can be scraped — has been scraped. So what comes next?

How close are we to exhausting human-written text on the internet, and what does that ceiling mean for AI training?

In 2024, researchers at Epoch AI published a paper with an unusually blunt conclusion: high-quality human text on the internet will be exhausted as a training resource somewhere between 2026 and 2032. Not depleted in the everyday sense — the words would still exist. But every usable token would already have been seen by a model. Training a new, larger model on the same corpus would add nothing new.

The paper was titled "Will We Run Out of Data?" Its answer was effectively yes — at least for the data that had powered the previous decade of progress.

How the Training Data Boom Happened

The first large language models trained on millions of tokens. GPT-2, released in 2019, trained on roughly 40 billion tokens — a dataset called WebText, scraped from outbound Reddit links. At the time, 40B tokens felt enormous.

By 2023, training runs had scaled by three orders of magnitude. LLaMA 1 trained on 1.4 trillion tokens. GPT-4's training data is not publicly disclosed, but estimates range from 10 to 100 trillion tokens. The Common Crawl — a freely available web snapshot — contains roughly 250 trillion raw tokens, though heavy deduplication and quality filtering reduces usable content to perhaps 30–50 trillion tokens.

The models got better, in part, because the datasets got bigger. But datasets cannot grow forever. The internet is large but finite, and the pace of new human writing is far slower than the pace of compute scaling.

~10T

Tokens in a top-tier 2023 training run

~100T

Estimated quality-filtered internet tokens total

2026–32

Epoch AI projected exhaustion window

What "Exhaustion" Actually Means

Data exhaustion does not mean the internet disappears. It means the marginal value of additional web scrapes approaches zero. Once a model has seen every Wikipedia article, every digitised book in Project Gutenberg, every Stack Overflow thread, every ArXiv preprint — seeing them a second or third time during training yields diminishing returns. More tokens from the same distribution do not meaningfully improve capability.

The practical ceiling is not the raw byte count of all text ever written. It is the subset that meets quality thresholds — no spam, no machine-generated boilerplate, no duplicate content — and that is available for legal access. That subset is considerably smaller than the raw total, and it is what AI labs have spent years racing to acquire.

Key Finding — Epoch AI (2024)

The Epoch AI analysis modelled data production rates against training data consumption rates. Under the assumption that compute continues scaling at historical rates, demand for high-quality tokens will outstrip supply before 2030. The paper identified this as one of the two most significant near-term bottlenecks in AI progress — the other being compute cost.

The Three Tiers of Human Text

Not all text is equally useful for training. Researchers distinguish roughly three tiers based on empirical results:

Tier	Examples	Training Value	Availability
High quality	Books, academic papers, curated Wikipedia, legal documents	Very high — dense reasoning, precise language	Limited and shrinking rapidly
Medium quality	News articles, blog posts, GitHub code	Moderate — variable reasoning depth	Large but largely consumed
Low quality	Social media, comment threads, spam	Low — noise often exceeds signal	Vast but requires heavy filtering

The highest-value tier — books and academic writing — is also the most constrained. Google Books scanned roughly 25 million books through its library program, but the vast majority are under copyright and legally contested. The "Books3" dataset used to train several open-weight models is now the subject of active lawsuits. Access to premium data is narrowing even as demand grows.

Why This Matters for Synthetic Data

The data exhaustion problem is the primary structural motivation for synthetic data research. If human text is finite and largely consumed, then generating new training data artificially becomes not a niche technique but a foundational necessity. Every subsequent lesson in this module builds on this constraint.

Token exhaustion The point at which all available high-quality human text has been incorporated into training data, making further scraping yield negligible marginal improvement.

Data scaling law Empirical relationship showing that model capability improves predictably as training data volume increases — until the data supply is exhausted.

Common Crawl A non-profit that has scraped the public web since 2008, producing one of the largest publicly available training corpora, estimated at 250+ trillion raw tokens before filtering.

Lesson 1 Quiz

The Internet Has Already Been Written — check your understanding

1. According to Epoch AI's 2024 analysis, when is high-quality human text on the internet projected to be exhausted as a training resource?

Correct. Epoch AI's paper "Will We Run Out of Data?" projected exhaustion of high-quality training data between 2026 and 2032 if compute scaling continues at historical rates.

Not quite. Epoch AI's 2024 paper specifically projected the exhaustion window as 2026–2032, based on modelling data production rates against training consumption rates.

2. What does "token exhaustion" mean in the context of AI training?

Correct. Token exhaustion describes the point where the marginal value of additional scraping approaches zero because the quality-filtered corpus has already been consumed.

Not quite. Token exhaustion refers to the training data supply problem — once all high-quality human text has been incorporated into training runs, additional scraping adds little marginal value.

3. Which tier of human text has the highest training value but the most constrained supply?

Correct. Books and academic writing contain dense reasoning and precise language, making them most valuable for training — but most are under copyright, making large-scale access legally contested.

Incorrect. Books and academic papers sit at the top of the quality tier but are heavily constrained by copyright law, with datasets like Books3 currently subject to active litigation.

4. Approximately how many tokens did LLaMA 1 train on, compared to GPT-2's approximately 40 billion?

Correct. LLaMA 1 trained on 1.4 trillion tokens, representing a roughly 35-fold increase over GPT-2's 40 billion — illustrating the three-orders-of-magnitude scaling that occurred between 2019 and 2023.

Not quite. LLaMA 1 used 1.4 trillion tokens — a ~35x increase from GPT-2's 40 billion, demonstrating how rapidly training data requirements scaled over just a few years.

Lab 1: Mapping the Data Ceiling

Explore the limits of internet-scale training data with an AI research assistant

Your Task

You're going to interrogate the data exhaustion problem. Ask the assistant to help you think through the numbers: How much quality-filtered text actually exists? What growth rate does new human writing add? Why can't models just keep reusing the same data? Push on the assumptions behind the Epoch AI projections.

Suggested start: "Help me understand why simply adding more compute doesn't solve the data exhaustion problem. Isn't there a way to just train on the same data repeatedly?"

Data Ceiling Analysis

Lab 1

Welcome to Lab 1. I'm your research assistant for exploring the data exhaustion problem. The core question we're working with: if human text on the internet is finite and largely consumed, what does that actually imply for AI progress? Ask me anything about the numbers, the Epoch AI projections, how quality filtering works, or why repetition doesn't substitute for new data.

Module 2 · Lesson 2

Legal Walls: Copyright, Consent, and the Lawsuit Wave

The data crisis isn't only technical — it's legal. Courts are now deciding what AI companies are allowed to train on.

How have copyright lawsuits and regulatory pressure since 2023 changed what training data AI labs can legally access?

In late 2023, the New York Times filed a copyright infringement lawsuit against OpenAI and Microsoft, alleging that GPT-4 had been trained on millions of its articles without license or compensation. The suit cited specific examples in which the model reproduced lengthy passages verbatim. It was one of the largest and most high-profile in a growing wave of litigation that included authors, visual artists, musicians, and source code authors.

By mid-2024, the legal landscape for training data had changed fundamentally. Labs that had previously operated on an implicit assumption of "scrape now, litigate never" now faced real financial exposure — and, more practically, real uncertainty about which data they could use in future training runs.

The Key Cases and What They Established

The volume of AI copyright litigation since 2022 is unprecedented in tech legal history. Several cases set or are in the process of setting important precedents:

2022

Getty Images v. Stability AI — Getty alleges Stable Diffusion trained on 12 million of its watermarked images without license. The case proceeds in US and UK courts simultaneously and has not yet reached final judgment, but discovery has forced Stability AI to disclose training data details.

2023

Authors Guild collective action — George R.R. Martin, John Grisham, Jodi Picoult, and 17 other prominent authors file suit against OpenAI, alleging their books were used in the Books3 dataset without consent. The complaint explicitly targets GPT-4 training.

2023

NY Times v. OpenAI / Microsoft — Filed December 2023. The Times presents evidence of near-verbatim reproduction of its articles by GPT models. This case is widely watched as a potential precedent for whether training on copyrighted news text qualifies as fair use.

2024

Doe v. GitHub / Microsoft / OpenAI — A class action by software developers alleging Copilot was trained on GitHub code in violation of open-source licenses. Central question: does training a model on code that requires attribution or share-alike constitute license violation?

The Fair Use Question

US copyright law's fair use doctrine allows limited use of copyrighted material under four factors: purpose and character of use, nature of the copyrighted work, amount taken, and effect on the market for the original. AI training raises genuinely novel questions about each factor.

AI companies argue that training is "transformative" — the model does not store or reproduce the original text, it learns statistical patterns. Rights holders counter that when a model can reproduce verbatim passages and displaces the market for original work (as the Times lawsuit demonstrates), the reproduction argument fails.

No US court has issued a final ruling on the core fair use question for LLM training as of mid-2024. The uncertainty itself is a form of practical constraint: labs must now budget for legal risk when choosing datasets.

EU AI Act — Data Transparency Requirement

The EU AI Act, which entered into force in August 2024, requires "general-purpose AI" providers to publish summaries of training data used to develop foundation models. This creates regulatory pressure on data sourcing beyond the US litigation context — and gives rights holders a formal mechanism to identify whether their content was used.

How Legal Pressure Shrinks the Effective Dataset

Even when a case doesn't result in an immediate injunction, litigation changes lab behaviour. The Books3 dataset — roughly 196,000 copyrighted books scraped from a shadowy file-sharing site — was quietly removed from the RedPajama dataset in 2023 after legal scrutiny. Hugging Face removed it from public access. Labs that had trained on it faced retroactive exposure.

The practical effect: the high-quality book corpus that sat at the top of the training data quality tier became substantially less accessible. Future models either pay licensing fees, use only public-domain books, or substitute synthetic text — which is precisely why synthetic data research accelerated sharply after 2023.

The Licensing Alternative

Some labs have moved toward paid data licensing. OpenAI signed deals with the Associated Press and with Axel Springer (publisher of Business Insider and Politico) to license news content. These agreements are commercially significant but highlight the constraint: high-quality data at scale is no longer free. The economics of training now include data acquisition costs that did not exist at GPT-2's scale.

Fair use doctrine US copyright law provision allowing limited use of copyrighted material without permission; whether AI training qualifies remains unresolved in the courts.

Books3 A dataset of ~196,000 copyrighted books scraped without license, used to train several prominent models; removed from public repositories in 2023 amid legal pressure.

EU AI Act (2024) European regulation requiring foundation model providers to disclose training data summaries, creating formal transparency obligations for data sourcing.

Lesson 2 Quiz

1. What was the central legal claim in the New York Times v. OpenAI lawsuit filed in December 2023?

Correct. The Times lawsuit alleged copyright infringement through training on its articles, presenting specific examples of GPT-4 reproducing lengthy passages nearly verbatim.

Not quite. The core claim was copyright infringement: GPT-4 was trained on millions of Times articles without license or compensation, and the model could reproduce them near-verbatim.

2. What happened to the Books3 dataset in 2023?

Correct. Books3 was quietly removed from the RedPajama dataset and from Hugging Face's public access after legal scrutiny of its origins — it had been scraped from a file-sharing site without license.

Incorrect. Books3 was removed from the RedPajama dataset and Hugging Face public repositories in 2023 after scrutiny over its origins as unlicensed scrapes from a file-sharing site.

3. What transparency obligation does the EU AI Act (2024) impose on foundation model providers?

Correct. The EU AI Act requires general-purpose AI providers to publish training data summaries, giving rights holders a formal mechanism to determine if their content was used.

Not correct. The EU AI Act's training data requirement is a transparency/disclosure obligation: providers must publish summaries of the data used, enabling rights holders to check for inclusion.

4. Which of the following best describes why legal uncertainty itself constrains training data availability, even without court injunctions?

Correct. Legal risk functions as a practical constraint even without final rulings — the cost and exposure of litigation makes labs avoid certain data sources even when technically able to scrape them.

Not quite. The key mechanism is risk budgeting: even without injunctions, the financial and reputational exposure of litigation causes labs to self-restrict data sourcing, effectively shrinking the usable dataset.

Lab 2: The Legal Constraint Scenario

Work through the legal and licensing landscape with an AI assistant

Your Task

You're a researcher advising an AI lab about to begin a new training run. The lab wants to use the highest-quality data possible. Use the assistant to reason through which data sources are legally safe, which are risky, and what the licensing alternatives look like. Think about the tradeoffs between quality, legality, and cost.

Suggested start: "I'm advising a lab that wants to train on book-quality text. Walk me through the current legal landscape — what can we safely use, and what's the risk profile of different approaches?"

Legal & Licensing Advisor

Lab 2

I'm your legal and licensing advisor for this lab. We'll work through the data sourcing problem from a legal risk perspective. I can discuss the major copyright cases, fair use doctrine as it applies to AI training, what the EU AI Act requires, and the economics of licensed data. What aspect of the legal landscape do you want to explore first?

Module 2 · Lesson 3

The Quality Problem: Not All Data Is Equal

Having more tokens isn't enough if they come from the wrong distribution. The gap between data volume and data quality is where models fail.

Why did simply scaling up web-scraped data stop producing proportional improvements — and what does data quality actually mean in practice?

When Google DeepMind released the Chinchilla paper in March 2022, it upended a dominant assumption in the field. The prevailing wisdom — demonstrated by GPT-3 — was that you should train the largest model you could afford on as much data as was available. Chinchilla showed this was wrong. A smaller model, trained on proportionally more data, consistently outperformed a larger model undertrained on less data. Compute and data needed to scale together.

The implication was immediate: the field had been massively underinvesting in data. But as labs scrambled to acquire more tokens, a second problem emerged. The web's easily scraped text was not uniformly useful. Much of it was low-quality boilerplate, duplicate content, machine-generated spam, and text that, when fed to a model in bulk, actually degraded performance on reasoning tasks.

What the Chinchilla Paper Actually Found

Hoffmann et al. (2022) trained over 400 language models of different sizes on different amounts of data, holding compute budget fixed. Their finding: for a given compute budget, the optimal strategy is to train a model roughly half the size of what had become standard, but on twice as many tokens. GPT-3 (175B parameters) trained on 300B tokens; the Chinchilla-optimal model for the same compute would have been ~70B parameters trained on 1.4T tokens.

The Chinchilla scaling laws became the blueprint for subsequent models. LLaMA, Mistral, and the Gemma family all reflect Chinchilla-optimal or near-optimal data-to-parameter ratios. But following the Chinchilla prescription requires substantially more data — data that, per Lesson 1, is increasingly scarce.

The Chinchilla Compute-Optimal Relationship

The paper established that for compute-optimal training: model size and training tokens should scale in roughly equal proportion. This means doubling your compute budget should go half toward a larger model and half toward more data — not simply toward a bigger model trained on the same corpus.

Data Quality: The Empirical Evidence

The most striking evidence for data quality effects came from the FineWeb dataset released by Hugging Face in 2024. FineWeb is a 15 trillion token web dataset built from Common Crawl, but with aggressive quality filtering. In benchmark comparisons, models trained on FineWeb's 1.3T quality-filtered subset outperformed models trained on much larger unfiltered corpora on reading comprehension and reasoning tasks.

Hugging Face quantified the gain: FineWeb-Edu, a further filtered subset focusing on educational content, produced roughly 10% higher scores on MMLU and ARC benchmarks than equivalently-sized models trained on generic web text. The quantity of tokens used was smaller; the quality was higher.

15T

Total FineWeb tokens (raw filtered)

1.3T

FineWeb educational subset

~10%

MMLU benchmark gain over generic web text

The Phi Models: Quality Over Quantity

Microsoft Research's Phi series provided perhaps the most dramatic demonstration of data quality effects. Phi-1, released in June 2023, was a 1.3 billion parameter model — tiny by contemporary standards — that achieved GPT-3.5 level performance on Python coding benchmarks. It was trained on a carefully curated mix of "textbook quality" data: a filtered subset of Stack Overflow, synthetic coding exercises generated by GPT-4, and high-quality documentation.

Phi-1's training data totalled roughly 7 billion tokens. For comparison, models 100 times larger trained on generic web data did not clearly outperform it on its target tasks. The Microsoft Research team wrote explicitly: "the quality of the data is much more important than the quantity."

Phi-2 (2.7B parameters, December 2023) and Phi-3 (3.8B–14B parameters, April 2024) extended this approach, consistently achieving performance competitive with models 5–10x larger. The Phi papers represent the clearest published evidence that data quality is the primary lever — and that synthetic, curated data can substitute for orders of magnitude more raw web text.

The Phi-1 Finding in Plain Terms

A 1.3B parameter model trained on 7B tokens of curated/synthetic "textbook quality" data matched models trained on vastly more parameters and raw web data. This was not a minor gain — it was a paradigm-shifting result that directly motivated the synthetic data research wave of 2023–2024.

Why More Low-Quality Data Can Hurt

Training on noisy data is not neutral — it can actively degrade model capability. The mechanisms are well understood: noisy data increases gradient variance during training, causing the model to learn conflicting signals. Content that is grammatically correct but logically incoherent (much automated web content) teaches the model surface fluency without underlying reasoning structure.

OpenAI's data team described filtering decisions for GPT-4's training as among the most consequential choices made during development. The specific decisions remain proprietary, but multiple researchers who left OpenAI have noted in interviews that the quality filtering pipeline — not the model architecture or compute budget — was often the limiting factor in capability improvements.

Chinchilla scaling laws Empirical rules from Hoffmann et al. (2022) showing compute-optimal training requires scaling model size and training tokens in equal proportion — not simply maximising model size.

FineWeb Hugging Face's 15T token filtered web dataset (2024); its educational subset outperformed larger unfiltered corpora on reasoning benchmarks despite containing fewer total tokens.

Phi series (Microsoft) Small language models (1.3B–14B parameters) demonstrating that textbook-quality curated and synthetic data enables performance competitive with models 5–10x larger trained on generic web text.

Lesson 3 Quiz

The Quality Problem — check your understanding

1. What was the central finding of the Chinchilla paper (Hoffmann et al., 2022)?

Correct. Chinchilla showed that GPT-3-style training was compute-suboptimal — for the same compute budget, a ~70B parameter model trained on 1.4T tokens outperformed the 175B GPT-3 trained on 300B tokens.

Not quite. The Chinchilla finding was specifically about compute-optimal allocation: model size and training tokens should scale together, not just model size. A smaller model on more data beat a larger model on less data.

2. Microsoft's Phi-1 model (1.3B parameters) demonstrated which key insight about training data?

Correct. Phi-1 was a paradigm-shifting result: 1.3B parameters, 7B tokens of curated/synthetic "textbook quality" data, matching models far larger on Python coding benchmarks.

Incorrect. Phi-1's key finding was that data quality could substitute for orders of magnitude more parameters — it matched GPT-3.5 on Python coding with just 1.3B parameters and curated training data.

3. What did Hugging Face's FineWeb-Edu dataset demonstrate about filtered data?

Correct. FineWeb-Edu's 1.3T token educational subset scored roughly 10% higher on MMLU and ARC than equivalently-sized models trained on generic web text — demonstrating quality's dominance over quantity.

Not correct. FineWeb-Edu, a quality-filtered educational subset, achieved approximately 10% higher benchmark scores than models trained on larger generic datasets, demonstrating that quality filtering improves rather than hurts performance.

4. Why can training on large amounts of low-quality data actively degrade model performance rather than simply being neutral?

Correct. The mechanism is well-understood: noisy data creates conflicting training signals that increase gradient variance, and grammatically fluent but logically incoherent text teaches surface patterns without underlying reasoning ability.

Not quite. The degradation mechanism is gradient variance from conflicting signals, plus the model learning fluency without reasoning from incoherent text — this applies regardless of total dataset size or model size.

Lab 3: Diagnosing Data Quality

Apply quality assessment frameworks to real training data scenarios

Your Task

You're going to work through data quality assessment with the assistant. The goal: develop intuition for what makes training data "high quality" in the technical sense — not just in vague terms, but in terms of the specific properties that affect benchmark performance. Use the Chinchilla and Phi findings as anchors.

Suggested start: "Help me build a practical rubric for assessing training data quality. What specific properties distinguish textbook-quality data from generic web text in terms of what they teach a model?"

Data Quality Assessment

Lab 3

Welcome to Lab 3. I'm here to help you think rigorously about data quality in AI training. We can work through quality rubrics, look at specific examples of high vs. low quality text and why they differ, examine what the Phi and FineWeb results tell us practically, or discuss how filtering pipelines work. What would you like to tackle?

Module 2 · Lesson 4

The Specialised Data Desert

General web text is one problem. For medicine, law, science, and low-resource languages, the problem is far worse — almost no human-written data exists at the required quality or scale.

Why do specialised and low-resource domains face a data scarcity crisis that general internet scraping cannot solve — and what does this mean for AI's real-world usefulness?

In September 2023, Google DeepMind and Google Research published results for Med-PaLM 2, a large language model adapted for medical question answering. On the US Medical Licensing Examination, Med-PaLM 2 scored at expert doctor level — higher than the average human physician. It was a striking result. But the paper's methods section revealed the constraint underlying all medical AI: the model had been fine-tuned on a carefully curated dataset of expert-annotated medical questions that took teams of clinicians months to construct and was far too small for general pretraining.

The internet contains enormous amounts of health-related text — patient forums, medical news, wellness blogs. Almost none of it meets the quality threshold needed to train a model that will give advice in clinical settings. The gap between available medical text and trustworthy medical training data is vast.

The Four Scarcity Categories

Not all specialised domains face identical constraints. The data scarcity problem varies considerably by domain type:

Domain	Why Scarcity Is Severe	Real-World Consequence
Clinical medicine	Patient records are protected by HIPAA/GDPR; high-quality clinical notes require expert annotation; error risk is high	Medical AI systems require expensive curation pipelines; most deployed systems perform poorly on edge cases
Legal reasoning	Court documents are public but jurisdiction-specific; legal reasoning requires tracking precise precedent chains that generic web text does not contain	AI legal tools hallucinate citations at high rates; Bar association studies show 30–40% error rates on complex reasoning tasks
Scientific/technical	Most research is paywalled; preprints vary enormously in quality; domain expertise required to filter noise from valid content	Models trained on general web text generate plausible-sounding but scientifically invalid procedures and results
Low-resource languages	Many languages lack significant digital presence; what exists is often low-quality or misrepresentative of actual usage	AI performance on non-English languages degrades dramatically; OECD estimates 80% of world's languages are effectively unserved

The Legal AI Case Study

In May 2023, two New York attorneys — Steven Schwartz and Peter LoDuca — submitted a court brief that cited six precedents in support of their argument. The citations were entirely fabricated. The ChatGPT system they had used generated plausible-sounding but entirely nonexistent case names, docket numbers, and holdings. Judge P. Kevin Castel fined both attorneys and their firm.

The Schwartz/LoDuca case became widely cited not because it was unique but because it was caught and litigated. AI legal tools trained primarily on general web text have no robust grounding in the actual corpus of legal precedent — which sits largely behind expensive subscription databases like Westlaw and LexisNexis. Those databases contain, collectively, perhaps 10 trillion tokens of high-quality legal text that have never been part of any open training run.

The Hallucination-Scarcity Link

Hallucination in specialised domains is not primarily a model architecture problem — it is a training data problem. When a model has seen only sparse or low-quality examples of a domain, it fills gaps with statistically plausible but factually wrong content. The cure is domain-specific high-quality data, which is precisely what is scarce.

Low-Resource Languages: The Scale of the Gap

The BLOOM model (2022), developed by the BigScience consortium, attempted to address multilingual capability by training on 46 natural languages and 13 programming languages. Despite this, the researchers documented stark disparities: English and high-resource European languages received orders of magnitude more tokens than languages like Swahili, Yoruba, or Indic languages despite the latter being spoken by hundreds of millions of people.

The disparity is not a funding problem or a values problem — it is a data infrastructure problem. Swahili has roughly 200 million speakers but perhaps 5 billion tokens of quality-filtered digital text available. English has perhaps 100 trillion quality tokens available. A model trained proportionally to speaker population would require synthetic generation to close the gap — there simply is not enough human-written Swahili text to train an equivalent model.

~5B

Est. quality Swahili tokens available

~100T

Est. quality English tokens available

20,000×

Approximate gap in training data availability

Why This Drives the Synthetic Data Imperative

The specialised data desert creates a second, more urgent motivation for synthetic data beyond the general internet exhaustion problem. Even if internet text were inexhaustible, it would not solve the medical, legal, scientific, or low-resource language problems — because those problems are not about total volume, they are about the absence of domain-specific, high-quality, expert-grounded content.

Synthetic data generation — using existing capable models to produce new training examples in under-resourced domains — is the only currently viable path to closing these gaps at scale. The approach has documented risks (which subsequent modules address) but it is increasingly the only option. There are simply not enough doctors writing clinical notes, lawyers writing annotated precedent analyses, or Swahili authors producing digital text to close these gaps through data collection alone.

Module 2 in Summary

Real training data is running out for four compounding reasons: the finite and largely-consumed internet corpus (L1), legal constraints that shrink accessible high-quality data further (L2), quality effects that mean raw volume cannot substitute for curated content (L3), and specialised domain deserts where no amount of web scraping produces usable training data (L4). Together, these four forces make synthetic data generation not an optional enhancement — but a structural necessity for continued AI progress.

Hallucination-scarcity link The empirical observation that AI hallucination rates are highest in domains where training data is scarcest — because the model fills gaps with statistically plausible but ungrounded content.

Low-resource language A language with insufficient digital text for adequate AI model training; roughly 80% of the world's spoken languages fall into this category despite having hundreds of millions of speakers.

Domain-specific data desert A specialised field (medicine, law, science, non-English languages) where the available quality training data is structurally insufficient regardless of how aggressively the public internet is scraped.

Lesson 4 Quiz

The Specialised Data Desert — check your understanding

1. What happened in the Schwartz/LoDuca case in May 2023, and what does it illustrate about specialised AI data?

Correct. Attorneys Schwartz and LoDuca submitted AI-generated fake citations — plausible-sounding but entirely nonexistent cases. Judge Castel fined both. It illustrates the hallucination-scarcity link: models lacking legal training data fill gaps with statistically plausible fiction.

Not quite. Schwartz and LoDuca submitted fabricated case citations generated by ChatGPT and were sanctioned by Judge Castel. The case demonstrates that models trained on general web text have no robust grounding in actual legal precedent.

2. Approximately how large is the gap in quality-filtered training data available between English and Swahili?

Correct. The estimated gap is approximately 20,000× — ~100T quality English tokens versus ~5B quality Swahili tokens. This is not closeable by data collection; synthetic generation is the only viable path.

Not quite. The gap is approximately 20,000×: estimated 100 trillion quality-filtered English tokens versus roughly 5 billion for Swahili — despite Swahili having ~200 million speakers.

3. Why does the hallucination-scarcity link mean that hallucination is fundamentally a training data problem rather than a model architecture problem?

Correct. The mechanism is statistical gap-filling: sparse training data in a domain means the model has no grounded examples to draw on, so it generates plausible-sounding completions without factual anchoring. More domain-specific training data reduces this directly.

Incorrect. Hallucination in specialised domains occurs because sparse domain training data leaves gaps that the model fills with statistically plausible but factually wrong completions. It is not an architecture issue — better domain data reduces it regardless of model size or design.

4. Why is synthetic data generation described as a "structural necessity" for specialised domains rather than a nice-to-have enhancement?

Correct. The structural necessity argument is simple: the rate at which clinicians write clinical notes, lawyers annotate precedents, and Swahili authors produce digital text is far below what would be needed to train competitive domain-specific models. Synthetic generation is the only scalable option.

Not quite. The necessity argument is about supply rates: there are not enough domain experts producing enough text to close the gaps through real data collection at any commercially viable pace or cost. Synthetic data generation is the only scalable path forward.

Lab 4: Designing for the Data Desert

Develop a data strategy for a specialised domain where real data is structurally insufficient

Your Task

Choose a specialised domain — medical AI, legal AI, a low-resource language, or scientific AI — and work with the assistant to design a data acquisition and generation strategy. What real data exists and how can it be accessed? Where is it insufficient? What would a synthetic data pipeline need to produce to fill the gap? What quality controls would you need?

Suggested start: "I want to design a training data strategy for a medical diagnosis AI that needs to perform at expert level. Walk me through what real clinical data exists, why it's insufficient, and what a synthetic data pipeline for this domain would need to look like."

Specialised Domain Data Strategy

Lab 4

Welcome to Lab 4. I'm your data strategy advisor for specialised AI domains. We can work through the medical AI data landscape, legal AI data constraints, low-resource language challenges, or scientific/technical domain data scarcity. The goal is to move from abstract understanding to a concrete strategy: what data exists, why it's insufficient, and what synthetic generation would need to produce. Which domain would you like to tackle?

Module 2 Test

Why Real Data Is Running Out — 15 questions · Pass at 80%

1. What two-word phrase describes the point at which all available high-quality human text has been incorporated into training runs and further scraping yields negligible marginal improvement?

Correct. Token exhaustion describes the supply-side ceiling for human-written training data.

The correct term is "token exhaustion" — the point where marginal value of additional scraping approaches zero.

2. Epoch AI's 2024 paper projected that high-quality internet training data would be exhausted by:

Correct. Epoch AI's "Will We Run Out of Data?" paper projected the exhaustion window as 2026–2032.

Epoch AI projected the 2026–2032 window based on compute scaling rates versus data production rates.

3. GPT-2 trained on approximately 40 billion tokens; LLaMA 1 trained on approximately:

Correct. LLaMA 1 trained on 1.4 trillion tokens — a ~35x increase illustrating the three-orders-of-magnitude data scaling between 2019 and 2023.

LLaMA 1 used 1.4 trillion tokens, roughly 35x GPT-2's 40 billion.

4. Which dataset was quietly removed from public repositories in 2023 after legal scrutiny of its origins?

Correct. Books3 (~196,000 copyrighted books scraped without license) was removed from RedPajama and Hugging Face amid litigation pressure.

Books3 was removed from the RedPajama dataset and Hugging Face in 2023 after scrutiny of its unlicensed origins.

5. The New York Times lawsuit against OpenAI, filed in December 2023, specifically alleged:

Correct. The Times presented evidence of near-verbatim reproduction, making the case about the scope of fair use for LLM training.

The core claim was training on unlicensed Times articles with evidence of near-verbatim reproduction by GPT-4.

6. What data transparency obligation does the EU AI Act (2024) impose on foundation model providers?

Correct. The EU AI Act requires training data summary disclosure, giving rights holders a formal mechanism to identify inclusion of their content.

The EU AI Act requires training data summary disclosure — a transparency obligation, not a consent or deletion requirement.

7. The Chinchilla paper (Hoffmann et al., 2022) established that for a fixed compute budget, optimal training requires:

Correct. Chinchilla showed compute-optimal training scales model size and tokens together — not just model size, as GPT-3's training implicitly assumed.

Chinchilla's finding: scale model size and training tokens together for compute-optimal results. GPT-3 had massively undertrained its model relative to its size.

8. Microsoft's Phi-1 model achieved GPT-3.5-level Python coding performance with which training setup?

Correct. Phi-1's result — matching GPT-3.5 on coding with 1.3B params and 7B quality tokens — was the paradigm-shifting demonstration that data quality dominates quantity.

Phi-1 used 1.3B parameters and ~7B textbook-quality/synthetic tokens — tiny by contemporary standards — yet matched GPT-3.5 on Python benchmarks.

9. Hugging Face's FineWeb-Edu dataset showed that quality-filtered educational content:

Correct. FineWeb-Edu's 1.3T token subset outperformed much larger unfiltered datasets by ~10% on reasoning benchmarks — quality over quantity demonstrated at scale.

FineWeb-Edu showed ~10% higher MMLU and ARC scores than larger unfiltered datasets. Less data, better quality, better results.

10. Why does low-quality data actively degrade model performance rather than simply being neutral?

Correct. The two mechanisms: conflicting signals increase gradient variance, and logically incoherent text teaches fluency patterns without reasoning depth.

Low-quality data creates conflicting training signals (higher gradient variance) and teaches surface patterns without underlying reasoning structure — both actively harmful.

11. In the Schwartz/LoDuca case (2023), what specific AI failure led to attorney sanctions?

Correct. Six fabricated case citations — nonexistent case names, docket numbers, and holdings — were submitted to Judge Castel, who sanctioned both attorneys.

The failure was hallucinated citations: six completely fabricated case names, docket numbers, and holdings that did not exist.

12. The BLOOM multilingual model (2022) trained on 46 languages but documented stark disparities in token counts. Which statement best captures the core finding?

Correct. The disparity reflected available digital text infrastructure, not speaker populations — demonstrating that low-resource language scarcity is structural, not addressable by scraping harder.

BLOOM documented that token allocation tracked digital text availability, not speaker population — leaving high-speaker languages like Swahili severely undertrained.

13. The "hallucination-scarcity link" describes which causal relationship?

Correct. The link is causal: sparse domain data → knowledge gaps → gap-filling with plausible-sounding fiction → high hallucination rate in that domain.

The hallucination-scarcity link: sparse training data in a domain means the model fills gaps with statistically plausible but factually ungrounded content.

14. Which of the following best explains why synthetic data is described as a "structural necessity" rather than just a convenience for specialised domains?

Correct. The necessity argument is about supply rates: clinicians, lawyers, and language authors cannot produce domain-specific text at the scale and pace modern AI training requires. Synthetic generation is the only scalable option.

The structural necessity argument: domain expert text production rates are too slow and too costly to close gaps through real data collection alone — synthetic generation is the only viable path at scale.

15. Which of the following represents the correct ordering of the four compounding forces driving real data scarcity covered in this module?

Correct. The four forces: (1) finite internet corpus, (2) legal constraints, (3) quality effects, and (4) specialised domain deserts — together make synthetic data generation a structural necessity, not an optional enhancement.

The four forces from this module: finite internet corpus (L1), legal constraints (L2), quality effects (L3), and specialised data deserts (L4). Together they create the synthetic data imperative.