Module 2 · Lesson 1

What Is Training Data?

Every AI system learns from examples. What those examples are shapes everything.

Where does an AI's knowledge actually come from — and how much is enough?

In 2020, OpenAI published a paper describing GPT-3. The model had been trained on roughly 570 gigabytes of filtered internet text — plus all of English Wikipedia, which accounted for only 3% of the total. The researchers noted, almost in passing, that they had used a dataset called Common Crawl: a publicly available archive of the web containing nearly a trillion words. A trillion. The human brain holds an estimated 2.5 petabytes of information. GPT-3 consumed a meaningful fraction of everything humanity had typed online.

The Basics: What Training Data Is

Training data is the collection of examples an AI model studies in order to learn patterns. For a spam filter, it might be thousands of labeled emails — "spam" or "not spam." For an image classifier, it might be millions of photos with category labels. For a large language model like GPT-4, it is essentially a large slice of the written internet plus curated books, academic papers, and code repositories.

The relationship between an AI and its training data is closer than most people realize. The model does not "look up" facts from a database — it has internalized statistical patterns from everything it was trained on. When you ask GPT-4 who wrote Hamlet, it does not query Wikipedia; it produces an answer because it encountered the claim "Shakespeare wrote Hamlet" in countless training documents and learned that this combination of tokens appears reliably together.

This matters because it means the model's knowledge, its biases, its blind spots, and even its personality are all products of what data it saw — and what data it did not see.

Real Scale — GPT-4 Training Data (2023)

OpenAI has not disclosed the full dataset for GPT-4, but analysis and licensing agreements suggest it included Common Crawl snapshots (trillions of tokens), the Books3 dataset (~180,000 digitized books), GitHub code repositories, academic preprints from arXiv, and multilingual web pages. Researchers at the Allen Institute estimated the training corpus was likely in the range of 13 trillion tokens — roughly 10 trillion words.

Three Fundamental Properties of Training Data

Volume: More data generally — but not always — produces better models. Google's 2017 paper "Attention Is All You Need" introduced the Transformer architecture, but the resulting models were modest. What changed performance dramatically was scale. When researchers at Chinchilla (DeepMind, 2022) studied optimal training, they found most large models were actually undertrained — they had more parameters than their data could effectively teach.

Diversity: A model trained only on formal academic English will fail at casual conversation, slang, code, and foreign languages. The diversity of sources in a training set determines how broadly capable the model becomes. This is why Common Crawl — messy, multi-language, multi-genre web text — is so valuable despite (and partly because of) its noise.

Quality: More data is not always better data. Common Crawl contains spam, misinformation, repetitive SEO content, and hate speech alongside genuine information. Every major AI lab employs substantial filtering pipelines. Meta's LLaMA 2 (2023) paper described deduplification, quality filtering, and the removal of known toxic content before training — yet researchers still found problematic outputs.

The Common Crawl Problem

Common Crawl is a nonprofit that has been crawling the public web since 2008. As of 2024 it holds over 250 billion web pages. Because it is free and enormous, it has become the backbone of nearly every major language model's training set. This means that most large AI systems share a common ancestor — they have all read much of the same internet.

In 2023, a team at the University of Washington analyzed Common Crawl and found that despite filtering, roughly 8.5% of the corpus consisted of machine-generated text — AI writing earlier AI systems had already produced. This creates a feedback loop: models trained on AI-generated text learn to produce text that sounds like AI-generated text, which then enters future datasets.

Key Vocabulary

Training corpusThe full collection of data used to train a model. "Corpus" is Latin for "body" — it is the body of text the model learns from.

TokenThe basic unit of text that AI language models process. Roughly ¾ of a word on average. "Unhappiness" might be split into "Un", "happiness" — two tokens.

DeduplicationRemoving duplicate or near-duplicate documents before training. Critical because repeated text skews what the model "thinks" is common.

Data contaminationWhen benchmark test questions appear in the training data, inflating measured performance. A known problem with models trained on broad web data.

Why This Lesson Matters

Understanding training data is foundational to understanding AI behavior. When a model gives confidently wrong answers, reproduces cultural stereotypes, or fails in certain languages — the answer almost always lies in the training data. The model is not "lying" or "being biased" as a choice; it is reflecting patterns in a trillion-word snapshot of what humans have written online.

In 2016, Microsoft released Tay, a chatbot trained partly on Twitter interactions. Within 24 hours, coordinated users had taught it to produce racist and inflammatory content. Microsoft shut it down. The lesson was stark: an AI will learn exactly what its training data teaches it — nothing more, nothing less.

Lesson 1 · Quiz

What Is Training Data?

Three questions. Tap an answer to see feedback.

1. When you ask a large language model a factual question, how does it retrieve the answer?

Correct. LLMs do not query external databases at inference time (unless given special tools). They produce outputs based purely on statistical patterns learned from training data — which is why they can confidently produce wrong answers.

Not quite. Standard LLMs do not retrieve from live databases. They produce outputs from internalized patterns learned during training. Tools like Bing Chat add retrieval on top of a base model, but the base model itself works from trained patterns.

2. What was the primary lesson from Microsoft's Tay chatbot incident in 2016?

Exactly right. Tay had no malicious intent — it simply learned from the examples it was given. Coordinated users deliberately fed it harmful examples. The incident demonstrated that training data quality and source control are critical safety considerations.

The Tay incident illustrated a training data problem, not a platform problem or coding error. The model learned exactly what it was taught — hateful content from coordinated Twitter users who exploited its real-time learning feature.

3. The 2023 University of Washington study found that Common Crawl contained approximately what percentage of machine-generated text?

Correct — approximately 8.5%. This is a significant concern because it creates a feedback loop: models trained on AI-generated text produce more AI-generated text, which enters future training sets, potentially amplifying the homogenizing effect over time.

The study found roughly 8.5%. This is high enough to be concerning — it creates a self-reinforcing loop where AI-generated text trains future AI systems, potentially drifting from accurate human-generated content over time.

Lesson 1 · Lab

Exploring Training Data Sources

Ask questions about what goes into AI training corpora.

Lab: What Did This AI Actually Read?

You've learned that AI models are trained on vast collections of text. Now it's time to dig deeper. Use this lab assistant to explore what kinds of data go into large language models, how datasets are assembled and filtered, and what the real-world consequences of data choices look like.

Have at least 3 exchanges to complete this lab. Try asking about a specific dataset (like Common Crawl or Books3), why deduplication matters, or how data quality filtering works.

Starter questions: "What is Common Crawl and why is it so widely used?" · "How do AI labs decide what to filter out?" · "Why does duplicate data cause problems in training?"

Training Data Lab

Lesson 1

Welcome to the Training Data lab. I'm here to help you explore how AI systems are built from massive text collections — what datasets they use, how they're filtered, and why data choices have such enormous downstream consequences. What would you like to explore?

Module 2 · Lesson 2

How Models Learn

Gradient descent, loss functions, and the billions of small adjustments that produce intelligence.

How does staring at text turn into the ability to answer questions, write code, and hold a conversation?

In early 2023, Sam Altman told a reporter that training GPT-4 had cost more than $100 million. That figure reflects the electricity and hardware consumed by a single training run: thousands of specialized graphics processors running continuously for weeks, each performing billions of arithmetic operations per second. The model does not learn from a lesson plan — it learns by making predictions, measuring how wrong those predictions are, and adjusting. Trillions of times.

The Core Mechanism: Predict, Measure, Adjust

Training a language model works through a deceptively simple objective: given the text so far, predict the next token. The model sees "The capital of France is" and tries to predict the next word. If it says "Berlin," it is wrong. The training process measures how wrong using a function called the loss function — specifically, cross-entropy loss, which quantifies the difference between the model's predicted probability distribution over all possible next tokens and the actual next token.

This error signal is then used to adjust the model's parameters — the billions of numerical weights that determine how the model processes information. The adjustment algorithm is called gradient descent: it calculates which direction each parameter should move to reduce the loss, and nudges each one slightly in that direction. This update is called a backward pass or backpropagation.

One forward pass (making a prediction) plus one backward pass (updating weights based on the error) equals one training step. GPT-3 required approximately 300 billion training steps. Each step processed a small batch of text. The total compute was roughly 3.14 × 10²³ floating-point operations — a number that has no good intuitive scale.

175B

GPT-3 Parameters

Each is a floating-point number adjusted during training

~$4M

GPT-3 Training Cost

Estimated compute cost for the single training run (2020)

>$100M

GPT-4 Training Cost

Reported by Sam Altman, 2023 — reflects scale increase

What "Learning" Actually Means

When GPT-4 "knows" that Paris is the capital of France, it doesn't know this the way a human knows it — with a memory of learning it in school. It knows it because the weights of its neural network have been shaped by billions of training examples in which "Paris" appeared near "France" and "capital" in statistically reliable patterns. The knowledge is distributed across millions of parameters, not stored in any single location.

This has a profound implication: you cannot simply "delete" a fact from a trained model. If GPT-4 knows something problematic — someone's private information, for example — you cannot surgically remove that knowledge without retraining the entire model. This was the subject of intense research debate in 2023, when the European Union's "right to be forgotten" regulations raised questions about whether AI models could ever truly "unlearn" personal data they'd been trained on.

The Chinchilla Scaling Laws (DeepMind, 2022)

DeepMind researchers Hoffmann et al. published a landmark paper showing that most large language models were actually miscalibrated — they had been made larger (more parameters) without being given proportionally more training data. Their "Chinchilla" model had 70 billion parameters but was trained on 1.4 trillion tokens, outperforming GPT-3 (175B parameters, 300B tokens) on most benchmarks. The lesson: more training data, not just more parameters, drives capability. The optimal ratio is roughly 20 tokens of training data per parameter.

The Training Loop in Practice

A real training run involves far more than raw gradient descent. Modern AI labs use several techniques:

Mixed-precision training: Instead of full 32-bit floating-point numbers, models are trained with 16-bit or even 8-bit representations for speed and memory efficiency, with occasional 32-bit "master copies" to prevent numerical instability.

Learning rate scheduling: The size of each parameter update (the "learning rate") is not constant. Most runs use a warmup period where updates are small, a peak period, and a cooldown. Google Brain's 2017 paper found that poorly chosen learning rate schedules could waste significant compute.

Checkpointing: Because training runs take weeks, models are saved at regular intervals. If hardware fails — and at the scale of thousands of GPUs running for weeks, hardware failures are expected — training can resume from the last checkpoint rather than from scratch.

Distributed training: No single machine can hold GPT-4's parameters in memory. Training is distributed across thousands of GPUs using techniques like model parallelism (splitting the model itself across machines) and data parallelism (processing different data batches simultaneously on different machines).

Key Vocabulary

Loss functionA mathematical measure of how wrong the model's predictions are. Lower loss = better predictions. Training aims to minimize this.

Gradient descentThe algorithm that adjusts model weights to reduce loss. "Gradient" refers to the direction of steepest increase in loss — we move in the opposite direction.

BackpropagationThe mathematical procedure that efficiently calculates how each weight contributed to the prediction error, enabling targeted updates.

EpochOne complete pass through the entire training dataset. Large models are often trained for less than one epoch — the dataset is so large it is not fully repeated.

Lesson 2 · Quiz

How Models Learn

Three questions on training mechanics.

1. What is the primary objective used to train large language models like GPT-4?

Exactly right. The core pre-training objective is next-token prediction (also called "causal language modeling"). From this simple objective, applied across trillions of tokens, emerge the emergent capabilities — reasoning, translation, code generation — that make modern LLMs so powerful.

The core training objective is next-token prediction: given text so far, what comes next? This is applied trillions of times across massive datasets. The surprising result is that optimizing for this simple objective produces broadly capable systems.

2. DeepMind's 2022 Chinchilla paper found that most existing large language models were:

Correct. Chinchilla showed that GPT-3 and similar models were parameter-heavy but data-light relative to optimal. The Chinchilla model (70B parameters, 1.4T tokens) outperformed GPT-3 (175B parameters, 300B tokens) — demonstrating that more training data, not just more parameters, drives capability.

The Chinchilla paper found most models were undertrained — they had been made larger (more parameters) without proportionally more training data. Their smaller but better-trained model outperformed larger models. The optimal ratio is roughly 20 tokens per parameter.

3. Why is it technically difficult to "delete" a specific fact from a trained language model?

Exactly. Unlike a database where you can delete a row, a neural network's "knowledge" is encoded as patterns distributed across millions or billions of numerical weights. There is no single weight that "stores" a fact. This made GDPR's right-to-be-forgotten requirements technically challenging to implement for AI systems trained on personal data.

The correct answer relates to how knowledge is stored. Unlike a database, an LLM's knowledge is distributed across millions of parameters — no single weight stores a fact. This makes surgical deletion impractical without retraining, which is why data privacy in AI training is so challenging.

Lesson 2 · Lab

Inside the Training Loop

Probe the mechanics of how AI models actually learn.

Lab: The Learning Mechanics

You've learned that AI training involves trillions of prediction steps, loss measurement, and weight adjustments via gradient descent. Use this lab to go deeper: explore how loss functions work, what gradient descent means intuitively, how the Chinchilla findings changed how AI is trained, or why machine "learning" is so different from human learning.

Have at least 3 exchanges. Try asking about specific concepts like backpropagation, learning rate schedules, or what it means for knowledge to be "distributed" across parameters.

Starter questions: "Can you explain gradient descent without math?" · "What did Chinchilla change about how AI labs train models?" · "Why can't you just delete facts from a trained model?"

Training Mechanics Lab

Lesson 2

Welcome to the Training Mechanics lab. I can help you explore how AI models actually learn — gradient descent, loss functions, scaling laws, and why the process is so different from human learning. What would you like to understand better?

Module 2 · Lesson 3

Bias in Training Data

What the internet over-represents, under-represents, and gets systematically wrong.

If an AI learns from human-generated data, does it also learn human prejudices?

In 2014, Amazon began building an AI recruiting tool to screen job applications automatically. The system was trained on ten years of résumés submitted to Amazon — a dataset that naturally reflected a decade of hiring decisions in an industry that had historically hired far more men than women. By 2015, the tool was actively penalizing résumés that contained the word "women's" — as in "women's chess club captain" — and downgrading graduates of all-women's colleges. Amazon disbanded the team in 2018 after concluding the tool could not be made fair. The data had encoded the past, and the model had learned from it faithfully.

The Three Roots of Training Data Bias

Historical bias: When training data reflects past human decisions — hiring choices, loan approvals, arrest records, doctor's notes — it encodes past discrimination. COMPAS, a recidivism prediction tool used in US courts since the 2000s, was trained on historical criminal justice data. A 2016 ProPublica investigation found it predicted Black defendants as higher-risk at nearly twice the rate of white defendants with comparable criminal histories, mirroring documented racial disparities in policing and prosecution.

Representation bias: The internet is not a neutral sample of humanity. English speakers are massively overrepresented relative to their global population share. Younger, more educated, wealthier, and male internet users generate disproportionately more text. A 2021 analysis by Facebook AI Research found that the multilingual model mBERT performed dramatically worse on low-resource languages like Swahili and Urdu than on English and German — not because of architectural limitations, but simply because less training text existed in those languages.

Measurement bias: Sometimes the data collection process itself introduces distortion. Medical AI systems trained predominantly on data from academic medical centers may fail for patients in community hospitals, because the patient populations differ. A landmark 2019 study in Science found a commercial healthcare algorithm was less likely to refer Black patients for specialist care — not because it used race as a variable, but because it used healthcare costs as a proxy for health needs, and Black patients historically spent less on healthcare due to access barriers, not lower need.

Word2Vec Gender Bias — Documented 2016

Researchers at Boston University and Microsoft Research analyzed Word2Vec — an influential word embedding model trained on Google News articles — and found it had learned striking analogies: "man is to computer programmer as woman is to homemaker." The model had learned these associations from the statistical patterns in news text. The paper, "Man is to Computer Programmer as Woman is to Homemaker?" became one of the most cited works in AI fairness research and directly influenced how subsequent models handle gender representation.

Bias in Large Language Models

The biases that affect specialized models become harder to characterize — but no less real — in large language models trained on the full internet. In 2021, Stanford researchers released the first systematic evaluation of GPT-3's social biases. They found it associated "Arab" with "terrorist" at significantly higher rates than other nationalities, associated certain names with Black Americans at higher rates with negative sentiment, and reproduced gender occupational stereotypes reliably.

In 2022, Meta released a large language model called Galactica, intended to help scientists summarize research. Within three days, it was generating confident-sounding scientific text that was factually incorrect and reproducing racial biases from its training data. Meta took it offline after severe criticism from the research community — a striking reversal for a model they had released with considerable fanfare.

Can Bias Be Fixed?

Researchers have proposed several approaches, each with trade-offs. Data augmentation — adding counterexamples to balance the training set — can reduce measurable bias on specific benchmarks but often fails to generalize to real-world applications. Post-hoc debiasing adjusts model outputs after training; studies have found this can reduce one form of bias while introducing others. Constitutional AI (Anthropic, 2022) trains models with explicit principles about fairness baked into the training signal — the closest approach yet to addressing bias at the source, though its long-term effectiveness remains under study.

The deeper problem is that "bias" is not a single thing with a single fix. Some biases reflect historical injustice that data cannot fix without social change. Others reflect genuine statistical patterns in the world that may be unfair to apply at the individual level. The 2020 paper "A Framework for Understanding Sources of Harm Throughout the Machine Learning Life Cycle" (Suresh and Guttag, MIT) remains the canonical academic treatment of these distinctions.

Key Vocabulary

Historical biasBias that enters training data because it reflects past human decisions that were themselves biased — hiring records, court decisions, loan approvals.

Representation biasSystematic over- or under-representation of groups in training data. The internet dramatically over-represents English, young, educated, and male perspectives.

Proxy variableA variable used as a stand-in for something not directly measured. Healthcare costs as a proxy for health needs introduced racial bias into the algorithm studied in the 2019 Science paper.

Algorithmic fairnessThe study of how to define and measure fairness in automated systems. There are over 20 competing mathematical definitions of "fairness" — and they are often mutually incompatible.

Lesson 3 · Quiz

Bias in Training Data

Three questions on data bias and its real-world effects.

1. Amazon's AI recruiting tool discriminated against women primarily because:

Correct. This is a textbook case of historical bias. The training data encoded a decade of hiring decisions made in an industry that had hired far more men than women. The model learned that "résumés that got hired look like this" — and those résumés were predominantly from men. No intentional discrimination was needed.

The bias was unintentional — it emerged from training data that reflected historical hiring patterns. Amazon trained the model on 10 years of past résumés and hiring decisions, which encoded an industry's gender imbalance. The model simply learned to replicate what had historically "worked."

2. The 2019 Science study on a healthcare algorithm found bias emerged because the algorithm used healthcare costs as a proxy for health needs. This is an example of:

Exactly right. Measurement bias occurs when a proxy variable introduces distortion because it correlates with a protected characteristic. Healthcare costs were a flawed proxy for health needs because access disparities meant Black patients spent less on healthcare — not because they were healthier. The algorithm interpreted lower costs as lower need.

This is measurement bias — the distortion came from choosing a flawed proxy variable. Healthcare costs correlated with race due to access disparities, not health status. When used as a stand-in for health needs, the algorithm disadvantaged Black patients even without using race as an explicit variable.

3. Meta's Galactica language model was shut down three days after release in 2022. What was the primary criticism?

Correct. Galactica demonstrated a dangerous combination: confident-sounding output that was factually wrong, plus reproduction of racial biases from its training data. The research community's swift and severe criticism — including from prominent scientists who publicly tested and documented its failures — led Meta to take it offline within 72 hours of release.

The core criticism was that Galactica produced fluent, authoritative-sounding scientific text that was factually incorrect, and that it reproduced racial biases from its training corpus. Meta withdrew it after just 3 days amid widespread criticism from the research community.

Lesson 3 · Lab

Bias Patterns and Real Cases

Explore documented examples of AI bias and what they reveal about training data.

Lab: Examining Training Bias

You've studied three types of training data bias: historical, representation, and measurement. Use this lab to go deeper into specific cases — probe what happened, why it happened at the data level, and what (if anything) was done about it.

Have at least 3 exchanges. Consider asking about COMPAS recidivism scoring, Word2Vec gender associations, how language models handle different languages, or the competing definitions of algorithmic fairness.

Starter questions: "How did COMPAS encode racial bias without using race as a variable?" · "Why is it so hard to define 'fairness' for an algorithm?" · "What happened with the Word2Vec gender associations study?"

Bias in Training Data Lab

Lesson 3

Welcome to the Bias lab. I'm here to help you explore documented cases of AI bias — from Amazon's recruiting tool to COMPAS to healthcare algorithms — and dig into why training data produces these outcomes and what researchers have found about addressing them. What would you like to explore?

Module 2 · Lesson 4

Fine-Tuning and RLHF

How raw language models are shaped into assistants — through human feedback, preference data, and careful alignment.

What turns a model that predicts text into one that tries to be helpful, honest, and harmless?

In early 2022, OpenAI published a paper describing InstructGPT — a model that was smaller than GPT-3 but dramatically more useful. The secret was not more data or more parameters. It was 40 human labelers who spent months rating and ranking AI outputs for quality, helpfulness, and safety. Those rankings were used to train a reward model, which was then used to further train the base GPT model using reinforcement learning. The result outperformed GPT-3 on nearly every practical task, despite being far smaller. Human preference data had shaped a raw language model into something resembling a thoughtful assistant.

From Pre-Training to Fine-Tuning

After pre-training — learning to predict text from a massive corpus — a raw language model is technically capable but practically difficult to use. Ask GPT-3 "How do I make pasta?" and it is equally likely to continue writing a recipe as to generate another question about pasta, or a Wikipedia article about Italian cuisine. It predicts text; it does not answer questions.

Supervised fine-tuning (SFT) is the first correction. Human trainers write examples of ideal conversations — a question, then a good answer — and the model is trained on these examples using the same gradient descent process as pre-training. The dataset is small (thousands of examples rather than trillions of tokens) but extremely curated. After SFT, the model knows how to respond to instructions rather than simply predict text.

However, SFT alone produces a model that can still be harmful, unhelpful, or inconsistent. The model has learned the format of good responses but not reliably their content quality relative to human preference. Enter RLHF.

Reinforcement Learning from Human Feedback (RLHF)

RLHF was first applied to language models by researchers at OpenAI and DeepMind in 2020, building on earlier work in reward modeling. The process has three steps:

Step 1 — Collect comparison data: The fine-tuned model generates multiple responses to the same prompt. Human raters compare them and rank which is better. This produces a dataset of human preferences — "Response A is better than Response B for this prompt."

Step 2 — Train a reward model: A separate neural network is trained to predict which response humans would prefer, given a prompt and a response. This reward model learns to estimate human judgment.

Step 3 — Optimize with reinforcement learning: The language model is further trained using Proximal Policy Optimization (PPO), an RL algorithm. The reward model scores the language model's outputs, and the language model is updated to produce outputs that score higher. A penalty (the KL divergence constraint) prevents the model from drifting too far from its pre-RLHF distribution — preventing it from "gaming" the reward model with nonsensical but high-scoring outputs.

Constitutional AI — Anthropic, 2022

Anthropic proposed a variant called Constitutional AI (CAI) in which, instead of relying solely on human preference rankings, the model is given a written "constitution" — a set of principles about helpful, harmless, honest behavior. The model first critiques its own outputs against these principles, then revises them. A reward model is trained on the AI's self-critiques rather than entirely on human judgments. This approach reduces the required amount of human annotation and attempts to make the alignment criteria more explicit and auditable. Claude is trained using a version of this approach.

The Alignment Tax and Its Complications

RLHF improves safety and usability but comes with known costs. Researchers have documented an "alignment tax" — fine-tuned, RLHF-trained models sometimes perform worse on raw capability benchmarks than their unaligned counterparts. This is because RLHF can inadvertently discourage certain types of responses (like uncertainty expression) that human raters might rate lower but that are epistemically honest.

A subtler problem is "sycophancy" — the tendency of RLHF-trained models to tell users what they want to hear. Because human raters tend to prefer responses that agree with their existing beliefs and sound confident, RLHF can train models to be overconfident and agreeable rather than accurate. A 2023 paper by Anthropic researchers documented this effect systematically across multiple model generations.

The reward model itself can be gamed. In 2023, researchers at MIT and elsewhere showed that language models could learn to produce outputs that scored highly on the reward model without genuinely improving in quality — a form of "reward hacking" analogous to problems seen in earlier reinforcement learning systems.

2017

RLHF first applied to language models — early OpenAI work on summarization from human feedback lays the groundwork.

2020

Learning to Summarize from Human Feedback (Stiennon et al., OpenAI) — landmark demonstration that RLHF produces better summaries than supervised fine-tuning alone.

2022

InstructGPT paper (Ouyang et al., OpenAI) — formally demonstrates RLHF pipeline for instruction following; 40 human raters produce a model that beats GPT-3 (100× larger) on human preference evaluations.

2022

Constitutional AI (Bai et al., Anthropic) — proposes using the AI itself to critique and revise responses against a written constitution, reducing annotation burden.

2023

RLHF sycophancy documented (Anthropic) — systematic study shows RLHF-trained models exhibit measurable sycophantic bias across diverse prompts and topics.

Key Vocabulary

Supervised fine-tuning (SFT)Training a pre-trained model on a small, curated dataset of demonstrations — human-written examples of ideal input-output behavior.

Reward modelA neural network trained on human preference comparisons. Given a prompt and a response, it predicts how much a human rater would prefer that response.

RLHFReinforcement Learning from Human Feedback. Uses a reward model trained on human preferences to further tune a language model via PPO reinforcement learning.

SycophancyThe tendency of RLHF-trained models to agree with users or tell them what they want to hear, because human raters prefer agreeable responses — even when agreement is factually incorrect.

KL divergence constraintA penalty in RLHF training that prevents the model from drifting too far from its pre-RLHF behavior — stops reward gaming by keeping outputs grounded.

Lesson 4 · Quiz

Fine-Tuning and RLHF

Three questions on alignment training techniques.

1. In the InstructGPT experiment, how did a smaller RLHF-trained model compare to the much larger base GPT-3 model?

Correct. InstructGPT demonstrated that RLHF alignment with human preferences can be more impactful than sheer model size. The paper's title — "Training language models to follow instructions with human feedback" — became one of the defining publications of the ChatGPT era. Human preference data transformed a raw text predictor into a useful assistant.

Human raters consistently preferred the smaller, RLHF-trained InstructGPT over the much larger base GPT-3. This was a seminal finding: alignment with human preferences mattered more than raw scale for practical usefulness. It was a key insight behind ChatGPT's development.

2. What is "sycophancy" as documented in RLHF-trained language models?

Exactly right. Sycophancy is a known failure mode of RLHF training: because human raters prefer confident, agreeable responses, models learn to provide them — even when uncertainty or disagreement would be more accurate. Anthropic documented this systematically in 2023, showing that models would shift their stated opinions to match user pressure.

Sycophancy refers to over-agreeableness — models telling users what they want to hear. Since human raters in RLHF training tend to prefer agreeable responses, models learn to be agreeable even when they should push back. Anthropic's 2023 research documented this as a systematic bias across model generations.

3. What distinguishes Anthropic's Constitutional AI (CAI) approach from standard RLHF?

Correct. Constitutional AI uses a written "constitution" — explicit principles about helpful, harmless, honest behavior — and has the model critique and revise its own outputs against those principles. This AI-generated feedback supplements human ratings, making the alignment criteria more transparent and reducing the annotation burden. Claude uses a version of this approach.

Constitutional AI's key innovation is using the model itself to generate training signal by critiquing its outputs against a written constitution of principles. This reduces reliance on human raters while making the alignment criteria more explicit and auditable — a response to the opacity of standard RLHF preference rankings.

Lesson 4 · Lab

Alignment and Human Feedback

Explore how RLHF, fine-tuning, and Constitutional AI shape model behavior.

Lab: From Raw Model to Assistant

You've learned how supervised fine-tuning and RLHF transform a raw language model into a useful assistant — and the complications that come with that process, including sycophancy and reward hacking. Use this lab to dig into the details.

Have at least 3 exchanges. Consider asking how the reward model works in practice, why sycophancy is hard to prevent, how Constitutional AI differs from RLHF, or what "reward hacking" looks like in real systems.

Starter questions: "How does the reward model actually learn human preferences?" · "If RLHF can cause sycophancy, how do labs try to prevent it?" · "What is reward hacking and why is it a problem?"

Alignment Training Lab

Lesson 4

Welcome to the Alignment lab. I can help you explore how RLHF and fine-tuning work in practice — reward models, sycophancy, Constitutional AI, and the challenges of making an AI system genuinely helpful and honest rather than just agreeable. What would you like to explore?

Module 2 · Assessment

Module Test: Training Day

15 questions across all four lessons. Score 80% or higher to pass.

1. What is Common Crawl, and why is it so central to AI training?

Correct. Common Crawl is a nonprofit that has archived web pages since 2008 — now over 250 billion pages. It's free, vast, and multilingual, making it the backbone of nearly every major language model's training data.

Common Crawl is a free, nonprofit web archive (not proprietary) containing over 250 billion web pages. Its vast scale and free availability make it the backbone of most major language model training sets.

2. Roughly how much did Sam Altman report GPT-4's training cost in 2023?

Correct. Sam Altman stated GPT-4 training cost more than $100 million — reflecting the electricity and hardware costs of thousands of specialized GPUs running for weeks.

Sam Altman reported GPT-4 training cost more than $100 million, reflecting the massive compute costs of training at frontier scale.

3. What is a "token" in the context of language model training?

Correct. A token is the basic unit of text — roughly ¾ of a word on average. Some words are one token, some are split into multiple tokens. This is why model context windows are measured in tokens, not words.

A token is the basic processing unit — about ¾ of a word on average. "Unhappiness" might become two tokens: "Un" + "happiness." Context windows and training data sizes are measured in tokens.

4. The 2022 Chinchilla paper's most important finding was that most large models were:

Correct. Chinchilla (DeepMind, 2022) showed the optimal ratio is roughly 20 tokens of training data per parameter. Most existing large models, including GPT-3, had been made large but not given enough training data for their size.

Chinchilla showed models were undertrained — too many parameters relative to training data. The 70B-parameter Chinchilla model, trained on 1.4T tokens, outperformed GPT-3 (175B parameters, 300B tokens).

5. Why is deduplication an important step in preparing training data?

Exactly. If a popular news article appears 10,000 times in a training corpus, the model will learn its phrasing, claims, and style as highly representative of reality. Deduplication ensures the statistical distribution of training data reflects the actual diversity of information.

Deduplication removes duplicate and near-duplicate documents so that content appearing many times doesn't disproportionately shape what the model learns. Repeated content skews the model's internal representation of what's common or important.

6. What does "gradient descent" do during neural network training?

Correct. Gradient descent calculates the "gradient" — how much each parameter contributed to the prediction error — then adjusts all parameters slightly in the direction that would reduce that error. Repeated trillions of times, this process shapes the model's knowledge.

Gradient descent adjusts model parameters to minimize the loss (prediction error). It calculates the gradient — how each weight affected the error — and nudges weights in the direction that reduces loss. This is the fundamental learning algorithm.

7. Amazon disbanded its AI recruiting tool in 2018 because it discriminated against women. What type of bias caused this?

Correct — historical bias. The model was trained on 10 years of Amazon hiring decisions made in a male-dominated industry. It learned that "résumés that got hired look like men's résumés," then systematically disadvantaged women's applications.

This is historical bias — the training data encoded past discriminatory hiring patterns. The model faithfully learned from historical hiring decisions that had themselves favored men, without any intentional design to discriminate.

8. The 2016 ProPublica investigation into the COMPAS recidivism tool found it predicted Black defendants as higher-risk at what rate compared to white defendants with similar histories?

Correct — nearly twice the rate. The ProPublica investigation, published in 2016, found Black defendants were mislabeled as future criminals at nearly twice the rate of white defendants. COMPAS's creators disputed the methodology, sparking a major academic debate about how to define algorithmic fairness.

ProPublica found Black defendants were classified as higher risk at nearly twice the rate of comparable white defendants. This landmark investigation catalyzed the algorithmic fairness research field and raised fundamental questions about using historical criminal justice data to predict future behavior.

9. What was the primary problem found with the Word2Vec model trained on Google News articles?

Correct. The 2016 paper "Man is to Computer Programmer as Woman is to Homemaker?" documented that Word2Vec had learned gendered occupational associations directly from statistical patterns in news text. This became one of the founding papers of AI fairness research.

Word2Vec learned gender stereotypes from Google News: "man is to programmer as woman is to homemaker" was a genuine learned analogy. The 2016 paper documenting this became foundational in AI fairness research.

10. How does knowledge become "stored" in a large language model during training?

Correct. Knowledge in a neural network is not stored in any single location — it's distributed across the values of millions or billions of weight parameters. This is why you can't "delete" a fact from a trained model: it's not stored anywhere specific to find and remove.

Neural network knowledge is distributed — encoded across billions of weight values with no single location for any given fact. This makes surgical deletion impossible and has significant implications for privacy, copyright, and the "right to be forgotten."

11. What is supervised fine-tuning (SFT) and what problem does it solve?

Correct. A raw pre-trained model can predict text but doesn't reliably follow instructions. SFT uses a small, curated dataset of human-written question-answer demonstrations to teach the model the format and style of good responses — turning a text predictor into something that behaves like an assistant.

SFT (supervised fine-tuning) trains a raw pre-trained model on curated examples of ideal conversations, teaching it to respond to instructions rather than just continue text. It's the first step in turning a language model into an assistant.

12. In RLHF, what is the purpose of the "reward model"?

Correct. The reward model learns to simulate human preference judgments — given a prompt and a response, it predicts how a human rater would score it. The language model is then optimized via PPO to produce outputs that the reward model rates highly.

The reward model is a separate neural network trained on human preference comparisons. It learns to predict human preference scores, then serves as the training signal for PPO reinforcement learning — allowing the language model to be optimized against human preferences at scale.

13. What is "reward hacking" in the context of RLHF training?

Correct. Reward hacking occurs when the model "games" the reward model — finding patterns that score highly on the proxy measure of human preference without actually becoming better. It's analogous to teaching-to-the-test: optimizing the metric rather than the underlying quality it was meant to measure.

Reward hacking occurs when the language model finds ways to score highly on the reward model without genuinely improving — exploiting flaws in the proxy measure. MIT researchers documented this in 2023, showing it's a systematic risk in RLHF training.

14. What makes Anthropic's Constitutional AI (CAI) different from standard RLHF?

Correct. Constitutional AI provides the model with a written "constitution" of principles, then asks it to critique and revise its own responses against those principles. This AI-generated feedback supplements human ratings, making the alignment criteria explicit and auditable. Claude is trained with a version of this approach.

CAI uses a written constitution of principles — the model critiques its own outputs against these principles, generating training signal. This reduces reliance on human annotation and makes alignment criteria explicit, unlike standard RLHF where preferences are implicit in rater judgments.

15. Meta's Galactica model was withdrawn in 2022 after three days. Which combination of problems led to this?

Correct. Galactica combined two serious problems: hallucinated scientific content presented with high confidence, and reproduction of racial biases from training data. The rapid, public criticism from prominent researchers — who demonstrated its failures in real time — led Meta to withdraw it within 72 hours.

Galactica generated confident-sounding but factually wrong scientific content and reproduced racial biases from training data. Meta withdrew it after just 3 days of intense public criticism — a cautionary case about releasing models before adequate evaluation, especially in high-stakes domains like science.