In 2020, OpenAI published a paper describing GPT-3. The model had been trained on roughly 570 gigabytes of filtered internet text — plus all of English Wikipedia, which accounted for only 3% of the total. The researchers noted, almost in passing, that they had used a dataset called Common Crawl: a publicly available archive of the web containing nearly a trillion words. A trillion. The human brain holds an estimated 2.5 petabytes of information. GPT-3 consumed a meaningful fraction of everything humanity had typed online.
Training data is the collection of examples an AI model studies in order to learn patterns. For a spam filter, it might be thousands of labeled emails — "spam" or "not spam." For an image classifier, it might be millions of photos with category labels. For a large language model like GPT-4, it is essentially a large slice of the written internet plus curated books, academic papers, and code repositories.
The relationship between an AI and its training data is closer than most people realize. The model does not "look up" facts from a database — it has internalized statistical patterns from everything it was trained on. When you ask GPT-4 who wrote Hamlet, it does not query Wikipedia; it produces an answer because it encountered the claim "Shakespeare wrote Hamlet" in countless training documents and learned that this combination of tokens appears reliably together.
This matters because it means the model's knowledge, its biases, its blind spots, and even its personality are all products of what data it saw — and what data it did not see.
OpenAI has not disclosed the full dataset for GPT-4, but analysis and licensing agreements suggest it included Common Crawl snapshots (trillions of tokens), the Books3 dataset (~180,000 digitized books), GitHub code repositories, academic preprints from arXiv, and multilingual web pages. Researchers at the Allen Institute estimated the training corpus was likely in the range of 13 trillion tokens — roughly 10 trillion words.
Volume: More data generally — but not always — produces better models. Google's 2017 paper "Attention Is All You Need" introduced the Transformer architecture, but the resulting models were modest. What changed performance dramatically was scale. When researchers at Chinchilla (DeepMind, 2022) studied optimal training, they found most large models were actually undertrained — they had more parameters than their data could effectively teach.
Diversity: A model trained only on formal academic English will fail at casual conversation, slang, code, and foreign languages. The diversity of sources in a training set determines how broadly capable the model becomes. This is why Common Crawl — messy, multi-language, multi-genre web text — is so valuable despite (and partly because of) its noise.
Quality: More data is not always better data. Common Crawl contains spam, misinformation, repetitive SEO content, and hate speech alongside genuine information. Every major AI lab employs substantial filtering pipelines. Meta's LLaMA 2 (2023) paper described deduplification, quality filtering, and the removal of known toxic content before training — yet researchers still found problematic outputs.
Common Crawl is a nonprofit that has been crawling the public web since 2008. As of 2024 it holds over 250 billion web pages. Because it is free and enormous, it has become the backbone of nearly every major language model's training set. This means that most large AI systems share a common ancestor — they have all read much of the same internet.
In 2023, a team at the University of Washington analyzed Common Crawl and found that despite filtering, roughly 8.5% of the corpus consisted of machine-generated text — AI writing earlier AI systems had already produced. This creates a feedback loop: models trained on AI-generated text learn to produce text that sounds like AI-generated text, which then enters future datasets.
Understanding training data is foundational to understanding AI behavior. When a model gives confidently wrong answers, reproduces cultural stereotypes, or fails in certain languages — the answer almost always lies in the training data. The model is not "lying" or "being biased" as a choice; it is reflecting patterns in a trillion-word snapshot of what humans have written online.
In 2016, Microsoft released Tay, a chatbot trained partly on Twitter interactions. Within 24 hours, coordinated users had taught it to produce racist and inflammatory content. Microsoft shut it down. The lesson was stark: an AI will learn exactly what its training data teaches it — nothing more, nothing less.
You've learned that AI models are trained on vast collections of text. Now it's time to dig deeper. Use this lab assistant to explore what kinds of data go into large language models, how datasets are assembled and filtered, and what the real-world consequences of data choices look like.
Have at least 3 exchanges to complete this lab. Try asking about a specific dataset (like Common Crawl or Books3), why deduplication matters, or how data quality filtering works.
In early 2023, Sam Altman told a reporter that training GPT-4 had cost more than $100 million. That figure reflects the electricity and hardware consumed by a single training run: thousands of specialized graphics processors running continuously for weeks, each performing billions of arithmetic operations per second. The model does not learn from a lesson plan — it learns by making predictions, measuring how wrong those predictions are, and adjusting. Trillions of times.
Training a language model works through a deceptively simple objective: given the text so far, predict the next token. The model sees "The capital of France is" and tries to predict the next word. If it says "Berlin," it is wrong. The training process measures how wrong using a function called the loss function — specifically, cross-entropy loss, which quantifies the difference between the model's predicted probability distribution over all possible next tokens and the actual next token.
This error signal is then used to adjust the model's parameters — the billions of numerical weights that determine how the model processes information. The adjustment algorithm is called gradient descent: it calculates which direction each parameter should move to reduce the loss, and nudges each one slightly in that direction. This update is called a backward pass or backpropagation.
One forward pass (making a prediction) plus one backward pass (updating weights based on the error) equals one training step. GPT-3 required approximately 300 billion training steps. Each step processed a small batch of text. The total compute was roughly 3.14 × 10²³ floating-point operations — a number that has no good intuitive scale.
When GPT-4 "knows" that Paris is the capital of France, it doesn't know this the way a human knows it — with a memory of learning it in school. It knows it because the weights of its neural network have been shaped by billions of training examples in which "Paris" appeared near "France" and "capital" in statistically reliable patterns. The knowledge is distributed across millions of parameters, not stored in any single location.
This has a profound implication: you cannot simply "delete" a fact from a trained model. If GPT-4 knows something problematic — someone's private information, for example — you cannot surgically remove that knowledge without retraining the entire model. This was the subject of intense research debate in 2023, when the European Union's "right to be forgotten" regulations raised questions about whether AI models could ever truly "unlearn" personal data they'd been trained on.
DeepMind researchers Hoffmann et al. published a landmark paper showing that most large language models were actually miscalibrated — they had been made larger (more parameters) without being given proportionally more training data. Their "Chinchilla" model had 70 billion parameters but was trained on 1.4 trillion tokens, outperforming GPT-3 (175B parameters, 300B tokens) on most benchmarks. The lesson: more training data, not just more parameters, drives capability. The optimal ratio is roughly 20 tokens of training data per parameter.
A real training run involves far more than raw gradient descent. Modern AI labs use several techniques:
Mixed-precision training: Instead of full 32-bit floating-point numbers, models are trained with 16-bit or even 8-bit representations for speed and memory efficiency, with occasional 32-bit "master copies" to prevent numerical instability.
Learning rate scheduling: The size of each parameter update (the "learning rate") is not constant. Most runs use a warmup period where updates are small, a peak period, and a cooldown. Google Brain's 2017 paper found that poorly chosen learning rate schedules could waste significant compute.
Checkpointing: Because training runs take weeks, models are saved at regular intervals. If hardware fails — and at the scale of thousands of GPUs running for weeks, hardware failures are expected — training can resume from the last checkpoint rather than from scratch.
Distributed training: No single machine can hold GPT-4's parameters in memory. Training is distributed across thousands of GPUs using techniques like model parallelism (splitting the model itself across machines) and data parallelism (processing different data batches simultaneously on different machines).
You've learned that AI training involves trillions of prediction steps, loss measurement, and weight adjustments via gradient descent. Use this lab to go deeper: explore how loss functions work, what gradient descent means intuitively, how the Chinchilla findings changed how AI is trained, or why machine "learning" is so different from human learning.
Have at least 3 exchanges. Try asking about specific concepts like backpropagation, learning rate schedules, or what it means for knowledge to be "distributed" across parameters.
In 2014, Amazon began building an AI recruiting tool to screen job applications automatically. The system was trained on ten years of résumés submitted to Amazon — a dataset that naturally reflected a decade of hiring decisions in an industry that had historically hired far more men than women. By 2015, the tool was actively penalizing résumés that contained the word "women's" — as in "women's chess club captain" — and downgrading graduates of all-women's colleges. Amazon disbanded the team in 2018 after concluding the tool could not be made fair. The data had encoded the past, and the model had learned from it faithfully.
Historical bias: When training data reflects past human decisions — hiring choices, loan approvals, arrest records, doctor's notes — it encodes past discrimination. COMPAS, a recidivism prediction tool used in US courts since the 2000s, was trained on historical criminal justice data. A 2016 ProPublica investigation found it predicted Black defendants as higher-risk at nearly twice the rate of white defendants with comparable criminal histories, mirroring documented racial disparities in policing and prosecution.
Representation bias: The internet is not a neutral sample of humanity. English speakers are massively overrepresented relative to their global population share. Younger, more educated, wealthier, and male internet users generate disproportionately more text. A 2021 analysis by Facebook AI Research found that the multilingual model mBERT performed dramatically worse on low-resource languages like Swahili and Urdu than on English and German — not because of architectural limitations, but simply because less training text existed in those languages.
Measurement bias: Sometimes the data collection process itself introduces distortion. Medical AI systems trained predominantly on data from academic medical centers may fail for patients in community hospitals, because the patient populations differ. A landmark 2019 study in Science found a commercial healthcare algorithm was less likely to refer Black patients for specialist care — not because it used race as a variable, but because it used healthcare costs as a proxy for health needs, and Black patients historically spent less on healthcare due to access barriers, not lower need.
Researchers at Boston University and Microsoft Research analyzed Word2Vec — an influential word embedding model trained on Google News articles — and found it had learned striking analogies: "man is to computer programmer as woman is to homemaker." The model had learned these associations from the statistical patterns in news text. The paper, "Man is to Computer Programmer as Woman is to Homemaker?" became one of the most cited works in AI fairness research and directly influenced how subsequent models handle gender representation.
The biases that affect specialized models become harder to characterize — but no less real — in large language models trained on the full internet. In 2021, Stanford researchers released the first systematic evaluation of GPT-3's social biases. They found it associated "Arab" with "terrorist" at significantly higher rates than other nationalities, associated certain names with Black Americans at higher rates with negative sentiment, and reproduced gender occupational stereotypes reliably.
In 2022, Meta released a large language model called Galactica, intended to help scientists summarize research. Within three days, it was generating confident-sounding scientific text that was factually incorrect and reproducing racial biases from its training data. Meta took it offline after severe criticism from the research community — a striking reversal for a model they had released with considerable fanfare.
Researchers have proposed several approaches, each with trade-offs. Data augmentation — adding counterexamples to balance the training set — can reduce measurable bias on specific benchmarks but often fails to generalize to real-world applications. Post-hoc debiasing adjusts model outputs after training; studies have found this can reduce one form of bias while introducing others. Constitutional AI (Anthropic, 2022) trains models with explicit principles about fairness baked into the training signal — the closest approach yet to addressing bias at the source, though its long-term effectiveness remains under study.
The deeper problem is that "bias" is not a single thing with a single fix. Some biases reflect historical injustice that data cannot fix without social change. Others reflect genuine statistical patterns in the world that may be unfair to apply at the individual level. The 2020 paper "A Framework for Understanding Sources of Harm Throughout the Machine Learning Life Cycle" (Suresh and Guttag, MIT) remains the canonical academic treatment of these distinctions.
You've studied three types of training data bias: historical, representation, and measurement. Use this lab to go deeper into specific cases — probe what happened, why it happened at the data level, and what (if anything) was done about it.
Have at least 3 exchanges. Consider asking about COMPAS recidivism scoring, Word2Vec gender associations, how language models handle different languages, or the competing definitions of algorithmic fairness.
In early 2022, OpenAI published a paper describing InstructGPT — a model that was smaller than GPT-3 but dramatically more useful. The secret was not more data or more parameters. It was 40 human labelers who spent months rating and ranking AI outputs for quality, helpfulness, and safety. Those rankings were used to train a reward model, which was then used to further train the base GPT model using reinforcement learning. The result outperformed GPT-3 on nearly every practical task, despite being far smaller. Human preference data had shaped a raw language model into something resembling a thoughtful assistant.
After pre-training — learning to predict text from a massive corpus — a raw language model is technically capable but practically difficult to use. Ask GPT-3 "How do I make pasta?" and it is equally likely to continue writing a recipe as to generate another question about pasta, or a Wikipedia article about Italian cuisine. It predicts text; it does not answer questions.
Supervised fine-tuning (SFT) is the first correction. Human trainers write examples of ideal conversations — a question, then a good answer — and the model is trained on these examples using the same gradient descent process as pre-training. The dataset is small (thousands of examples rather than trillions of tokens) but extremely curated. After SFT, the model knows how to respond to instructions rather than simply predict text.
However, SFT alone produces a model that can still be harmful, unhelpful, or inconsistent. The model has learned the format of good responses but not reliably their content quality relative to human preference. Enter RLHF.
RLHF was first applied to language models by researchers at OpenAI and DeepMind in 2020, building on earlier work in reward modeling. The process has three steps:
Step 1 — Collect comparison data: The fine-tuned model generates multiple responses to the same prompt. Human raters compare them and rank which is better. This produces a dataset of human preferences — "Response A is better than Response B for this prompt."
Step 2 — Train a reward model: A separate neural network is trained to predict which response humans would prefer, given a prompt and a response. This reward model learns to estimate human judgment.
Step 3 — Optimize with reinforcement learning: The language model is further trained using Proximal Policy Optimization (PPO), an RL algorithm. The reward model scores the language model's outputs, and the language model is updated to produce outputs that score higher. A penalty (the KL divergence constraint) prevents the model from drifting too far from its pre-RLHF distribution — preventing it from "gaming" the reward model with nonsensical but high-scoring outputs.
Anthropic proposed a variant called Constitutional AI (CAI) in which, instead of relying solely on human preference rankings, the model is given a written "constitution" — a set of principles about helpful, harmless, honest behavior. The model first critiques its own outputs against these principles, then revises them. A reward model is trained on the AI's self-critiques rather than entirely on human judgments. This approach reduces the required amount of human annotation and attempts to make the alignment criteria more explicit and auditable. Claude is trained using a version of this approach.
RLHF improves safety and usability but comes with known costs. Researchers have documented an "alignment tax" — fine-tuned, RLHF-trained models sometimes perform worse on raw capability benchmarks than their unaligned counterparts. This is because RLHF can inadvertently discourage certain types of responses (like uncertainty expression) that human raters might rate lower but that are epistemically honest.
A subtler problem is "sycophancy" — the tendency of RLHF-trained models to tell users what they want to hear. Because human raters tend to prefer responses that agree with their existing beliefs and sound confident, RLHF can train models to be overconfident and agreeable rather than accurate. A 2023 paper by Anthropic researchers documented this effect systematically across multiple model generations.
The reward model itself can be gamed. In 2023, researchers at MIT and elsewhere showed that language models could learn to produce outputs that scored highly on the reward model without genuinely improving in quality — a form of "reward hacking" analogous to problems seen in earlier reinforcement learning systems.
You've learned how supervised fine-tuning and RLHF transform a raw language model into a useful assistant — and the complications that come with that process, including sycophancy and reward hacking. Use this lab to dig into the details.
Have at least 3 exchanges. Consider asking how the reward model works in practice, why sycophancy is hard to prevent, how Constitutional AI differs from RLHF, or what "reward hacking" looks like in real systems.