L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 2 · Lesson 1

What Is Pre-Training?

The phase in which a model reads nearly everything humans have written — and learns to predict what comes next.
How does exposure to billions of text examples produce a model capable of reasoning, translation, and code?

In June 2020, OpenAI released a technical report describing GPT-3 — a model trained on roughly 45 terabytes of text. The training run consumed an estimated 3.14 × 10²³ floating-point operations. No fine-tuning on downstream tasks was used. The same weights answered trivia, translated French, and wrote working Python. The community called it "few-shot learning" and spent months debating whether the model was reasoning or merely pattern-matching at extraordinary scale.

The Core Idea: Next-Token Prediction

Pre-training is the foundational phase of building a large language model. The objective is deceptively simple: given a sequence of tokens, predict the next token. This is called the language modeling objective, sometimes written as maximizing the log-probability of the training corpus.

Every parameter update during pre-training is driven by this single signal — the difference between what the model predicted and what actually appeared next in the text. Multiplied across trillions of examples, this pressure sculpts the model's weights into representations that capture grammar, facts, reasoning patterns, and stylistic conventions.

There is no labeled dataset, no human annotator telling the model what a "good" answer looks like. The training data itself provides supervision — the next word in a Wikipedia article, the next line of a Python function, the next sentence in a legal brief.

Why This Matters

Because language is a compressed representation of human knowledge, a model that genuinely predicts text well must implicitly encode enormous amounts of world knowledge. Predicting "the capital of France is ___" correctly requires knowing geography. Predicting the next line of a proof requires understanding logical structure.

Self-Supervised Learning

Pre-training belongs to a paradigm called self-supervised learning. The labels are derived from the input data itself — no external annotation is needed. For a causal (autoregressive) language model like GPT, the label for position i in the sequence is simply the token at position i+1.

BERT, introduced by Google in 2018, used a variant called masked language modeling: randomly mask 15% of tokens and train the model to predict the masked words using surrounding context. Both approaches are self-supervised, but they produce models with different strengths — BERT excels at understanding tasks; autoregressive models excel at generation.

Modern LLMs like GPT-4, Claude, and Gemini all use autoregressive pre-training as their foundation, with masked approaches now mostly reserved for encoder models used in search and classification.

Key Terms

Pre-trainingThe initial phase of LLM development where a model learns from massive unlabeled text corpora via self-supervised objectives before any task-specific fine-tuning.
Next-token predictionThe training objective of predicting the next token in a sequence given all preceding tokens; also called the causal language modeling objective.
Self-supervised learningA learning paradigm where labels are automatically derived from the input data, requiring no human annotation.
Masked language modeling (MLM)A variant where random tokens are hidden and the model predicts them using bidirectional context; used to train BERT-style encoder models.
Historical Marker

The GPT-3 paper (Brown et al., 2020) demonstrated that scale alone — without any task-specific fine-tuning — could unlock capabilities like arithmetic, translation, and code generation that researchers had previously assumed required explicit training on each task. This finding fundamentally reframed how the field thought about pre-training.

Lesson 1 Quiz

What Is Pre-Training? — 4 questions
1. What is the primary training objective used during autoregressive pre-training?
Correct. Next-token prediction (causal language modeling) is the core objective for autoregressive LLMs like GPT, Claude, and Llama.
Not quite. Autoregressive pre-training predicts the next token from left-to-right context only. Sentiment classification requires labeled data; image-text alignment is contrastive learning; masked prediction is BERT's objective.
2. Why is pre-training called "self-supervised"?
Correct. In self-supervised learning, labels (e.g., the next token) come directly from the data — no human annotation is needed, enabling training at internet scale.
Not quite. "Self-supervised" means the supervision signal — what the correct next token is — is derived from the data itself, not from human labels or an external reward model.
3. GPT-3's 2020 paper showed that, without task-specific fine-tuning, a large pre-trained model could perform arithmetic, translation, and code generation. What term describes this ability?
Correct. GPT-3 demonstrated "few-shot learning" — providing a handful of examples in the prompt was enough for the model to perform new tasks without weight updates.
Not quite. The capability described is "few-shot learning," a key finding of the GPT-3 paper (Brown et al., 2020), where task instructions in the prompt substitute for fine-tuning.
4. Masked language modeling (MLM), as used in BERT, differs from autoregressive pre-training in that:
Correct. BERT uses bidirectional context — surrounding tokens on both sides — to predict randomly masked tokens. This contrasts with the left-to-right causal approach of GPT-style models.
Not quite. MLM is still self-supervised (no human labels needed) and uses natural language. Its distinguishing feature is using both left and right context to predict masked tokens, unlike left-to-right autoregressive models.

Lab 1: Exploring the Pre-Training Objective

Chat with an AI tutor about next-token prediction and self-supervised learning

Your Task

You have a direct line to an AI tutor specialized in LLM pre-training mechanics. Use it to deepen your understanding of the concepts covered in Lesson 1. Ask about next-token prediction, how self-supervised learning works at scale, or why predicting text forces a model to encode world knowledge.

Starter prompts: "Why does next-token prediction force the model to learn facts?" · "How does BERT's MLM objective differ mechanically from GPT's causal LM?" · "Could you give me an intuitive example of self-supervised learning?"
Pre-Training Tutor
Lesson 1
Hello! I'm your pre-training specialist. Ask me anything about next-token prediction, self-supervised learning, or the history of how GPT-3 changed our understanding of what scale can achieve. What would you like to explore?
Module 2 · Lesson 2

Data at Scale

The composition, curation, and controversies of the trillion-token corpora that define what a model knows.
What exactly do LLMs train on — and why does data quality matter as much as quantity?

In December 2020, EleutherAI released The Pile — an 825 GB open dataset assembled from 22 distinct sources including Common Crawl, PubMed, arXiv, GitHub, the FreeLaw Project, and Project Gutenberg. The paper explicitly documented every source's size, license status, and intended contribution. It was the first large-scale attempt to publicly audit what an LLM training corpus actually contains — a stark contrast to the opaque "internet data" descriptions common at the time.

Sources and Their Trade-Offs

Modern LLM corpora blend many source types, each contributing different properties:

Web Crawl (Common Crawl)

The largest source by volume — petabytes of raw HTML from across the web. Highly diverse but noisy: duplicate content, spam, hate speech, and low-quality writing require aggressive filtering. GPT-3 used a filtered version of Common Crawl weighted at ~60% of the training mix.

Books & Long-Form Text

Books provide long-range coherence that web text lacks. GPT-3 used the Books1 and Books2 corpora. LLaMA (Meta, 2023) used Project Gutenberg and Books3. Long documents teach models to maintain consistent context and argument structure over thousands of tokens.

Code (GitHub)

GitHub code is present in nearly every modern LLM corpus. Code has strict syntax, explicit logic, and comments explaining intent — a uniquely structured form of text. Researchers at DeepMind found that training on code improves mathematical and logical reasoning even on purely natural-language benchmarks.

Curated High-Quality Sources

Wikipedia, academic papers (arXiv, PubMed, Semantic Scholar), legal documents (FreeLaw), and filtered Reddit (WebText / OpenWebText) are typically up-weighted relative to their raw volume to increase factual density.

Data Curation Techniques

Raw web crawl data is far too noisy to use directly. Standard curation steps include:

Deduplication: Near-duplicate documents inflate the effective weight of repeated content. Facebook's RoBERTa team showed that removing duplicates improved model performance even when total token count decreased. Tools like MinHash are used to detect fuzzy duplicates at scale.

Quality filtering: Heuristic filters remove documents with high character-level perplexity, too-short texts, excessive repetition, or too few natural language words. Some pipelines use a classifier trained on Wikipedia-quality text to score and filter web pages.

Decontamination: Benchmark test sets (MMLU, HumanEval, etc.) must be identified and removed from training data to prevent inadvertent memorization inflating evaluation scores.

Up-weighting: High-quality sources are seen multiple times (epochs > 1 on selected subsets) while low-quality sources are down-sampled. The Chinchilla paper found that data quality and quantity jointly determine optimal training efficiency.

The Toxicity Problem

Internet-scale corpora inevitably contain harmful content. The GPT-3 paper acknowledged this, noting that the model could generate biased or toxic completions. Subsequent work (Bender et al., "Stochastic Parrots," 2021) argued that the costs of massive data collection — including encoding social biases at scale — had been systematically underweighted in the field's cost-benefit analysis.

Scale Numbers in Context

GPT-3 (2020)
300B
training tokens
Chinchilla (2022)
1.4T
training tokens
LLaMA 3 (2024)
15T
training tokens
FineWeb (2024)
15T
tokens (open dataset)
Key Insight

The Chinchilla paper (Hoffmann et al., DeepMind, 2022) found that many large models were undertrained relative to their size. Optimal training requires scaling data tokens proportionally with model parameters — roughly 20 tokens per parameter. This finding drove the shift toward training smaller models on far more data, producing models that are both cheaper to run and better performing.

Lesson 2 Quiz

Data at Scale — 4 questions
1. What is "decontamination" in the context of training data preparation?
Correct. Decontamination ensures that test benchmarks (like MMLU or HumanEval) are not memorized during training, which would inflate apparent performance scores.
Not quite. Decontamination specifically means removing test-set examples from training data so evaluations measure genuine generalization, not memorization.
2. According to the Chinchilla paper (Hoffmann et al., 2022), the key insight about training efficiency was:
Correct. Chinchilla demonstrated that compute should be split roughly equally between model size and training tokens — about 20 tokens per parameter — leading to smaller but better-trained models.
Not quite. Chinchilla's key finding was that prior large models (like Gopher) were trained on too few tokens for their size. Optimal training requires ~20 tokens per parameter.
3. Why is GitHub code commonly included in LLM training corpora even when the model is intended for natural language tasks?
Correct. DeepMind and other researchers found that training on code — with its explicit logic, structured syntax, and intent-explaining comments — transfers positively to reasoning tasks in natural language.
Not quite. Code improves reasoning because it contains explicit logical relationships and structured problem-solving patterns that transfer to mathematical and reasoning tasks in natural language contexts.
4. EleutherAI's "The Pile" (2020) was significant primarily because it:
Correct. The Pile was notable for its transparency — explicitly documenting every constituent source, enabling the research community to scrutinize corpus composition rather than accepting opaque "internet data" descriptions.
Not quite. The Pile's contribution was transparency: it documented 22 sources with sizes and licenses, making it the first publicly audited large-scale training corpus.

Lab 2: Training Data Decisions

Discuss corpus curation trade-offs with an AI tutor

Your Task

Engage with the tutor about the hard decisions behind assembling a trillion-token training corpus. Ask about data source trade-offs, filtering strategies, the Chinchilla findings, or the ethical tensions around what gets included or excluded.

Starter prompts: "If I were building an LLM corpus today, how would I decide what sources to include?" · "Why does deduplication help even if it reduces total token count?" · "What are the ethical arguments against training on web-scale data?"
Data Curation Tutor
Lesson 2
Welcome! I specialize in training data decisions — corpus composition, filtering pipelines, deduplication, and the ethical dimensions of data at scale. What would you like to dig into?
Module 2 · Lesson 3

Compute, Hardware, and the Training Loop

How thousands of GPUs move gradients through hundreds of billions of parameters — repeatedly, for months.
What happens inside a pre-training run, and why does hardware topology matter as much as algorithm design?

Microsoft's Azure infrastructure for OpenAI's GPT-4 training run reportedly used around 25,000 A100 GPUs connected by high-speed InfiniBand networking. At peak, the cluster sustained communication bandwidth measured in terabits per second between nodes. A single training run at this scale costs tens of millions of dollars in compute alone. Managing failure gracefully — a single GPU failure in a 25,000-node cluster is not unusual — required checkpoint systems that could resume training within minutes of hardware faults.

The Forward and Backward Pass

Each step of pre-training consists of two phases. In the forward pass, a batch of token sequences is fed through the model layer by layer. Each transformer block applies attention and feedforward operations, producing a probability distribution over the vocabulary at the final layer. The cross-entropy loss between the predicted distribution and the actual next tokens is computed.

In the backward pass, gradients of the loss with respect to every parameter are computed via backpropagation through time (BPTT). An optimizer — typically AdamW for LLMs — uses these gradients to update parameters. This forward-backward cycle is called a training step, and modern runs execute millions of steps.

Distributed Training Strategies

No single GPU can hold a model with hundreds of billions of parameters. Training is distributed across hundreds or thousands of devices using several complementary strategies:

Data Parallelism

Each GPU holds a full copy of the model and processes a different mini-batch. Gradients are averaged across all devices (AllReduce) before each parameter update. This is the simplest form of parallelism and scales well to many GPUs.

Model (Tensor) Parallelism

Individual weight matrices are split across GPUs — rows on one device, columns on another. Used when a single layer is too large for one GPU's memory. Megatron-LM (NVIDIA, 2019) pioneered efficient tensor parallelism for transformer LLMs.

Pipeline Parallelism

Different transformer layers run on different GPUs, with activations passed between pipeline stages. GPipe (Google, 2019) demonstrated that micro-batching could keep all pipeline stages busy, avoiding the "bubble" of idle computation.

ZeRO (Zero Redundancy Optimizer)

DeepSpeed's ZeRO (Microsoft, 2020) shards optimizer states, gradients, and parameters across data-parallel ranks, reducing per-GPU memory by up to 8× compared to naive data parallelism — enabling training of trillion-parameter models.

Mixed Precision and Memory Management

Training in FP16 or BF16 (16-bit floating point) rather than FP32 roughly halves memory usage and doubles throughput on modern hardware. NVIDIA A100 and H100 GPUs include hardware-accelerated support for BF16, which has better numerical stability than FP16 due to a wider exponent range. Master weights are kept in FP32 for numerical precision; activations and communications use lower precision.

Gradient checkpointing (also called activation recomputation) trades compute for memory: instead of storing all intermediate activations during the forward pass, they are recomputed during backprop. This allows training larger models at the cost of ~30% more compute per step.

Training Stability

Pre-training runs at scale are prone to loss spikes — sudden increases in training loss that can corrupt weeks of computation if not caught. Teams monitor loss curves continuously and may roll back to an earlier checkpoint when spikes occur. PaLM (Google, 2022) reported and diagnosed several such spikes during its 540B parameter training run, attributing them to particular batches with unusually high gradient norms.

The Cost Trajectory

2018
BERT-Large — estimated ~$7,000 on cloud TPUs. Set the baseline for large-scale pre-training costs.
2020
GPT-3 (175B) — estimated $4–12M. Sparked debate about whether only well-funded labs could do frontier AI research.
2022
Chinchilla (70B) — smaller model, more tokens. Showed compute-optimal training could match GPT-3 at a fraction of the cost.
2023
LLaMA-2 (70B) — Meta released weights publicly, demonstrating that open-weight models trained efficiently could approach proprietary performance.
2024
Frontier runs — GPT-4, Claude 3, Gemini Ultra estimated at $50–100M+. H100 clusters with 10,000–50,000 GPUs now standard for frontier training.

Lesson 3 Quiz

Compute, Hardware, and the Training Loop — 4 questions
1. In the context of distributed LLM training, what does "tensor parallelism" specifically involve?
Correct. Tensor parallelism splits weight matrices themselves — for instance, splitting a large matrix multiply across multiple GPUs. Megatron-LM (NVIDIA, 2019) is the seminal implementation.
Not quite. You've described data parallelism, pipeline parallelism, or ZeRO. Tensor parallelism specifically shards individual weight tensors across GPUs, requiring synchronization within each layer.
2. Why is BF16 often preferred over FP16 for LLM training despite both being 16-bit formats?
Correct. BF16 and FP32 share the same 8-bit exponent, meaning BF16 can represent the same dynamic range as FP32. FP16's narrower exponent makes it prone to overflow/underflow during training, requiring loss scaling workarounds.
Not quite. BF16's key advantage is its wider exponent (same as FP32), which avoids the numerical instability (overflow and underflow) that FP16 suffers during training due to large gradient values.
3. What problem does "gradient checkpointing" (activation recomputation) solve?
Correct. Gradient checkpointing discards intermediate activations after the forward pass and recomputes them during backpropagation, trading ~30% extra compute for significantly reduced memory consumption.
Not quite. Gradient checkpointing addresses GPU memory — by not storing all activations from the forward pass, memory is freed, at the cost of recomputing those activations during backprop (~30% more compute).
4. Google's PaLM training paper (2022) documented "loss spikes" during pre-training. What is a loss spike, and how is it typically addressed?
Correct. Loss spikes are sudden training instabilities often caused by problematic data batches with extreme gradient norms. Rolling back to an earlier checkpoint and either skipping or reprocessing the offending batch is the standard mitigation.
Not quite. Loss spikes are sudden (not gradual) jumps in training loss. PaLM's team responded by identifying the problematic batches that caused high gradient norms and rolling back to a checkpoint before the spike occurred.

Lab 3: Inside the Training Loop

Explore distributed training mechanics with an AI tutor

Your Task

Use this lab to work through the hardware and systems side of pre-training. Ask about distributed training strategies, memory management techniques, why training runs cost millions of dollars, or how engineers deal with hardware failures at scale.

Starter prompts: "Can you walk me through what happens in a single training step from data loading to parameter update?" · "What's the difference between pipeline parallelism and tensor parallelism?" · "How does ZeRO reduce memory compared to standard data parallelism?"
Training Systems Tutor
Lesson 3
Hello! I'm your training systems specialist. I can walk you through forward and backward passes, the different parallelism strategies, memory management tricks like gradient checkpointing, and the engineering realities of running a 25,000-GPU training cluster. What would you like to explore?
Module 2 · Lesson 4

Scaling Laws and Emergent Capabilities

The mathematical regularities that predict model performance — and the sudden capability jumps that still surprise researchers.
Can we predict what an LLM will be able to do before we train it — and why do some abilities appear suddenly at scale thresholds?

In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models." They found that test loss follows smooth power laws with model size, dataset size, and compute — each varying across orders of magnitude. The relationship was strikingly regular: double the model parameters and loss falls by a predictable amount. The paper gave labs a quantitative tool to forecast performance before spending millions on training, fundamentally changing how frontier LLM development was planned and justified.

The Three Axes of Scale

The Kaplan scaling laws identified three independent axes that drive language model performance:

Parameters (N)
N^0.07
loss scales as inverse power of N
Data Tokens (D)
D^0.28
loss scales as inverse power of D
Compute (C)
C^0.05
loss scales as inverse power of C

Each axis drives loss reduction independently, but with different efficiency. Critically, Kaplan et al. found that model size was the primary driver and recommended concentrating compute on larger models trained on relatively less data. The Chinchilla paper (2022) later revised this, arguing data and model size should scale together — a finding that redirected the entire field's resource allocation strategy.

Emergent Capabilities

Beyond the smooth scaling of perplexity, researchers at Google Brain documented a more striking phenomenon in 2022: certain capabilities appear to emerge discontinuously as model scale crosses thresholds. Wei et al. (2022) catalogued over 100 such "emergent abilities" across benchmarks — tasks on which smaller models perform near random chance and larger models abruptly perform well.

Examples from the literature include:

Chain-of-thought reasoning: Models below ~100B parameters showed minimal improvement from chain-of-thought prompting. Above that threshold, step-by-step reasoning prompts dramatically improved performance on multi-step math problems.

Arithmetic: GPT-3 (175B) could perform 2-digit addition reliably; 3-digit addition appeared much earlier in PaLM (540B), suggesting the capability threshold depends on both scale and training data composition.

Instruction following: The ability to follow novel task descriptions without examples appeared to emerge rather than improve gradually, though the precise threshold varies by task.

The Debate: Are Emergent Abilities Real?

Schaeffer et al. (2023) argued that many apparent "emergent abilities" are artifacts of discontinuous evaluation metrics. When continuous metrics are used instead of pass/fail benchmarks, capability improvements often appear smooth rather than sudden. This sparked ongoing debate: are LLMs undergoing genuine qualitative phase transitions, or do researchers simply choose metrics that make smooth improvements look like discontinuous jumps?

Compute-Optimal Training and the Chinchilla Revision

The Chinchilla paper (Hoffmann et al., DeepMind, 2022) ran a series of controlled experiments varying both model size and token count across hundreds of training runs. They fit scaling law coefficients to the results and derived a key rule: for a given compute budget, loss is minimized by using roughly 20 tokens per parameter.

This directly contradicted the prior field consensus. GPT-3 (175B parameters) was trained on 300B tokens — about 1.7 tokens per parameter. Chinchilla (70B parameters) trained on 1.4T tokens achieved better performance than Gopher (280B) despite having far fewer parameters. The lesson: data quantity and model size must scale together, not size alone.

The Chinchilla findings reverberated through 2023 training decisions. Meta's LLaMA models, Mistral, and others trained smaller models on far more tokens — producing models that are cheaper to serve while matching or exceeding larger, undertrained predecessors.

Implications for AI Development

Scaling laws give organizations quantitative tools to forecast capability improvements before committing compute — enabling rational investment decisions. But emergent abilities create a complementary uncertainty: even a well-calibrated scaling law for perplexity may not predict when qualitatively new capabilities will appear. Both phenomena shape the economics and risk calculus of frontier LLM development.

Lesson 4 Quiz

Scaling Laws and Emergent Capabilities — 4 questions
1. What was the key practical contribution of Kaplan et al.'s scaling laws paper (2020)?
Correct. The scaling laws gave researchers a quantitative framework to predict loss (and therefore capability) as a function of model size, data, and compute — enabling systematic planning of expensive training runs.
Not quite. The key contribution was the discovery that loss follows smooth, predictable power laws across many orders of magnitude in scale — providing a planning tool for large-scale training investments.
2. How did the Chinchilla paper revise the consensus established by the Kaplan scaling laws?
Correct. Chinchilla's 70B model, trained on 1.4T tokens (~20 tokens per parameter), outperformed Gopher (280B) trained on far fewer tokens — showing that prior large models were parameter-rich but data-poor.
Not quite. Chinchilla's revision was specifically about the balance between model size and training tokens: prior models overinvested in parameters and underinvested in data, yielding suboptimal compute efficiency.
3. According to Wei et al. (2022), what defines an "emergent ability" in large language models?
Correct. Wei et al. defined emergent abilities as those showing near-chance performance in smaller models that appear to "emerge" discontinuously as scale crosses thresholds — cataloguing over 100 such abilities across benchmarks.
Not quite. Emergent abilities are specifically characterized by their discontinuous appearance with scale — small models show near-random performance, while models above a scale threshold perform significantly better, seemingly without gradual improvement in between.
4. Schaeffer et al. (2023) challenged the concept of emergent abilities by arguing:
Correct. Schaeffer et al. showed that pass/fail benchmarks create the illusion of discontinuous jumps when the underlying capability may be improving smoothly — a methodological challenge to how emergence is measured.
Not quite. Schaeffer's critique was methodological: using continuous evaluation metrics often reveals smooth improvement where discontinuous metrics (like exact-match accuracy) had suggested sudden emergence. Whether "true" phase transitions exist remains debated.

Lab 4: Scaling Laws in Practice

Reason through scaling decisions with an AI tutor

Your Task

Work through the implications of scaling laws and emergent capabilities. Ask the tutor to help you reason through compute allocation decisions, challenge you on Chinchilla vs. Kaplan trade-offs, or explain why emergent abilities matter for AI safety.

Starter prompts: "If I have a $1M compute budget, how would the Chinchilla findings change how I allocate it compared to following the Kaplan recommendations?" · "Are emergent abilities dangerous from a safety perspective?" · "Why do scaling laws matter for organizations that don't train frontier models?"
Scaling Laws Tutor
Lesson 4
Welcome! I specialize in scaling laws, emergent capabilities, and compute-optimal training decisions. Whether you want to reason through the Kaplan vs. Chinchilla trade-offs, understand what "emergence" means for AI safety, or just work through the math — I'm here. What would you like to explore?

Module 2 Test

Pre-Training at Scale — 15 questions · Pass at 80%
1. The training objective for autoregressive LLMs like GPT is best described as:
Correct. Autoregressive language modeling maximizes log-probability of each token given its left context — the cross-entropy loss summed over all positions.
Incorrect. Autoregressive LLMs use the causal language modeling objective: maximize log P(token_i | token_1 ... token_{i-1}) for every position i.
2. BERT's masked language modeling objective differs from GPT's causal language modeling in that:
Correct. BERT is bidirectional — it attends to context on both sides to predict masked tokens. GPT is causal/autoregressive — it only uses tokens to the left when predicting the next token.
Incorrect. The key distinction is directionality: BERT uses bidirectional context for masked prediction; GPT uses only left-to-right context for next-token prediction.
3. GPT-3 was trained on approximately how many tokens?
Correct. GPT-3 (175B parameters) was trained on ~300 billion tokens — which Chinchilla later showed was significantly fewer than optimal for a model that size.
Incorrect. GPT-3 was trained on approximately 300 billion tokens across its filtered web crawl, books, and Wikipedia sources.
4. What role does up-weighting play in training corpus preparation?
Correct. Up-weighting means high-quality sources (e.g., Wikipedia, curated books) are seen multiple times or sampled at higher rates than raw volume would suggest, improving the ratio of high-quality to low-quality signal.
Incorrect. Up-weighting is a corpus sampling strategy: high-quality sources are sampled more frequently than their raw proportion, giving the model more exposure to reliable, well-structured text.
5. Which technique uses MinHash to identify and remove similar documents from training data?
Correct. Deduplication uses approximate matching techniques like MinHash (locality-sensitive hashing) to identify and remove near-duplicate documents, which RoBERTa's team showed improves model performance even when total token count decreases.
Incorrect. MinHash is used for deduplication — identifying near-duplicate documents so they can be removed or reduced in the training corpus.
6. In data parallelism for distributed LLM training, how are gradients reconciled across devices?
Correct. In data parallelism, each device computes gradients on its local mini-batch, then an AllReduce operation averages gradients across all devices before the synchronized parameter update.
Incorrect. In standard data parallelism, gradients are synchronized every step via AllReduce — computing the average gradient across all devices before any parameter update occurs.
7. DeepSpeed's ZeRO optimizer primarily reduces memory usage by:
Correct. ZeRO (Zero Redundancy Optimizer) eliminates memory redundancy in data-parallel training by partitioning optimizer states, gradients, and parameters across all devices — each device holds only a shard, reducing per-device memory by up to 8×.
Incorrect. ZeRO's contribution is partitioning (sharding) optimizer states, gradients, and parameters across ranks instead of replicating all three on every GPU, dramatically reducing per-device memory requirements.
8. Why might training on GitHub code improve a model's performance on natural language math benchmarks?
Correct. Code's structured syntax, explicit conditionals, loops, and inline comments explaining intent all reinforce logical reasoning patterns. DeepMind researchers found these patterns transfer positively to non-code reasoning tasks.
Incorrect. The benefit comes from code's inherent logical structure — explicit if-then reasoning, structured problem decomposition — which reinforces reasoning patterns that generalize to math and logic tasks in natural language.
9. The Kaplan scaling law for model size found that test loss scales approximately as:
Correct. Loss ∝ N^(-α) where α ≈ 0.07 for model size. This smooth power-law relationship held across many orders of magnitude, enabling reliable performance extrapolation.
Incorrect. Kaplan et al. found loss follows a power law — specifically L ∝ N^(-0.07) — meaning each doubling of parameters yields a predictable percentage drop in test loss.
10. Chinchilla's training compute-optimal rule of thumb states that:
Correct. Chinchilla found that ~20 tokens per parameter is compute-optimal. A 70B parameter model should see ~1.4T tokens — far more than GPT-3's 300B tokens for 175B parameters (~1.7 tokens/param).
Incorrect. Chinchilla's key finding: for any parameter count N, optimal training uses approximately 20 × N tokens. This ratio was derived by fitting scaling law coefficients to hundreds of controlled training runs.
11. Which of the following is an example of an "emergent ability" as defined by Wei et al. (2022)?
Correct. Chain-of-thought reasoning is a canonical emergent ability — below a scale threshold it provides no benefit; above that threshold it substantially improves multi-step reasoning, appearing as a discontinuous jump rather than gradual improvement.
Incorrect. Emergent abilities are specifically those that appear discontinuously with scale — like chain-of-thought prompting only working above ~100B parameters, in contrast to perplexity which improves smoothly.
12. BF16's main advantage over FP16 for LLM training is:
Correct. BF16 uses 8 exponent bits (same as FP32) vs. FP16's 5, giving it the same dynamic range as FP32. This prevents the overflow/underflow that plagues FP16 training without requiring manual loss scaling.
Incorrect. BF16's key advantage is its exponent width (8 bits, matching FP32) rather than mantissa precision. This wide exponent range means BF16 can represent the same dynamic range as FP32, avoiding FP16's overflow instability.
13. The EleutherAI "Pile" dataset (2020) was notable primarily for:
Correct. The Pile was the first large training corpus to fully document its composition — 22 sources, sizes, and license status — enabling the community to scrutinize what went into an LLM's training data.
Incorrect. The Pile's contribution was transparency: it publicly documented all 22 source datasets with sizes and license information, in contrast to opaque "internet data" descriptions common at the time.
14. Schaeffer et al.'s (2023) critique of "emergent abilities" argued that:
Correct. Schaeffer showed that when pass/fail metrics are replaced with continuous measures of task performance, the discontinuous "emergence" often disappears — suggesting the phenomenon may be a measurement artifact rather than a true phase transition.
Incorrect. Schaeffer's argument was that emergence is partly a measurement artifact: discontinuous metrics (like exact-match accuracy) create the appearance of sudden jumps even when underlying capability improves smoothly.
15. A company with a $10M compute budget wants to train the best possible language model. Based on the Chinchilla findings, which strategy is most likely to produce the best performance?
Correct. Chinchilla demonstrated that compute-optimal training requires balancing model size and data volume — roughly 20 tokens per parameter. The same budget on a smaller model with more tokens typically outperforms a larger undertrained model.
Incorrect. The Chinchilla findings directly address this scenario: for a fixed compute budget, allocate to balance model size and training tokens at roughly 20 tokens per parameter, rather than maximizing model size as Kaplan's original work suggested.