In June 2020, OpenAI released a technical report describing GPT-3 — a model trained on roughly 45 terabytes of text. The training run consumed an estimated 3.14 × 10²³ floating-point operations. No fine-tuning on downstream tasks was used. The same weights answered trivia, translated French, and wrote working Python. The community called it "few-shot learning" and spent months debating whether the model was reasoning or merely pattern-matching at extraordinary scale.
Pre-training is the foundational phase of building a large language model. The objective is deceptively simple: given a sequence of tokens, predict the next token. This is called the language modeling objective, sometimes written as maximizing the log-probability of the training corpus.
Every parameter update during pre-training is driven by this single signal — the difference between what the model predicted and what actually appeared next in the text. Multiplied across trillions of examples, this pressure sculpts the model's weights into representations that capture grammar, facts, reasoning patterns, and stylistic conventions.
There is no labeled dataset, no human annotator telling the model what a "good" answer looks like. The training data itself provides supervision — the next word in a Wikipedia article, the next line of a Python function, the next sentence in a legal brief.
Because language is a compressed representation of human knowledge, a model that genuinely predicts text well must implicitly encode enormous amounts of world knowledge. Predicting "the capital of France is ___" correctly requires knowing geography. Predicting the next line of a proof requires understanding logical structure.
Pre-training belongs to a paradigm called self-supervised learning. The labels are derived from the input data itself — no external annotation is needed. For a causal (autoregressive) language model like GPT, the label for position i in the sequence is simply the token at position i+1.
BERT, introduced by Google in 2018, used a variant called masked language modeling: randomly mask 15% of tokens and train the model to predict the masked words using surrounding context. Both approaches are self-supervised, but they produce models with different strengths — BERT excels at understanding tasks; autoregressive models excel at generation.
Modern LLMs like GPT-4, Claude, and Gemini all use autoregressive pre-training as their foundation, with masked approaches now mostly reserved for encoder models used in search and classification.
The GPT-3 paper (Brown et al., 2020) demonstrated that scale alone — without any task-specific fine-tuning — could unlock capabilities like arithmetic, translation, and code generation that researchers had previously assumed required explicit training on each task. This finding fundamentally reframed how the field thought about pre-training.
You have a direct line to an AI tutor specialized in LLM pre-training mechanics. Use it to deepen your understanding of the concepts covered in Lesson 1. Ask about next-token prediction, how self-supervised learning works at scale, or why predicting text forces a model to encode world knowledge.
In December 2020, EleutherAI released The Pile — an 825 GB open dataset assembled from 22 distinct sources including Common Crawl, PubMed, arXiv, GitHub, the FreeLaw Project, and Project Gutenberg. The paper explicitly documented every source's size, license status, and intended contribution. It was the first large-scale attempt to publicly audit what an LLM training corpus actually contains — a stark contrast to the opaque "internet data" descriptions common at the time.
Modern LLM corpora blend many source types, each contributing different properties:
The largest source by volume — petabytes of raw HTML from across the web. Highly diverse but noisy: duplicate content, spam, hate speech, and low-quality writing require aggressive filtering. GPT-3 used a filtered version of Common Crawl weighted at ~60% of the training mix.
Books provide long-range coherence that web text lacks. GPT-3 used the Books1 and Books2 corpora. LLaMA (Meta, 2023) used Project Gutenberg and Books3. Long documents teach models to maintain consistent context and argument structure over thousands of tokens.
GitHub code is present in nearly every modern LLM corpus. Code has strict syntax, explicit logic, and comments explaining intent — a uniquely structured form of text. Researchers at DeepMind found that training on code improves mathematical and logical reasoning even on purely natural-language benchmarks.
Wikipedia, academic papers (arXiv, PubMed, Semantic Scholar), legal documents (FreeLaw), and filtered Reddit (WebText / OpenWebText) are typically up-weighted relative to their raw volume to increase factual density.
Raw web crawl data is far too noisy to use directly. Standard curation steps include:
Deduplication: Near-duplicate documents inflate the effective weight of repeated content. Facebook's RoBERTa team showed that removing duplicates improved model performance even when total token count decreased. Tools like MinHash are used to detect fuzzy duplicates at scale.
Quality filtering: Heuristic filters remove documents with high character-level perplexity, too-short texts, excessive repetition, or too few natural language words. Some pipelines use a classifier trained on Wikipedia-quality text to score and filter web pages.
Decontamination: Benchmark test sets (MMLU, HumanEval, etc.) must be identified and removed from training data to prevent inadvertent memorization inflating evaluation scores.
Up-weighting: High-quality sources are seen multiple times (epochs > 1 on selected subsets) while low-quality sources are down-sampled. The Chinchilla paper found that data quality and quantity jointly determine optimal training efficiency.
Internet-scale corpora inevitably contain harmful content. The GPT-3 paper acknowledged this, noting that the model could generate biased or toxic completions. Subsequent work (Bender et al., "Stochastic Parrots," 2021) argued that the costs of massive data collection — including encoding social biases at scale — had been systematically underweighted in the field's cost-benefit analysis.
The Chinchilla paper (Hoffmann et al., DeepMind, 2022) found that many large models were undertrained relative to their size. Optimal training requires scaling data tokens proportionally with model parameters — roughly 20 tokens per parameter. This finding drove the shift toward training smaller models on far more data, producing models that are both cheaper to run and better performing.
Engage with the tutor about the hard decisions behind assembling a trillion-token training corpus. Ask about data source trade-offs, filtering strategies, the Chinchilla findings, or the ethical tensions around what gets included or excluded.
Microsoft's Azure infrastructure for OpenAI's GPT-4 training run reportedly used around 25,000 A100 GPUs connected by high-speed InfiniBand networking. At peak, the cluster sustained communication bandwidth measured in terabits per second between nodes. A single training run at this scale costs tens of millions of dollars in compute alone. Managing failure gracefully — a single GPU failure in a 25,000-node cluster is not unusual — required checkpoint systems that could resume training within minutes of hardware faults.
Each step of pre-training consists of two phases. In the forward pass, a batch of token sequences is fed through the model layer by layer. Each transformer block applies attention and feedforward operations, producing a probability distribution over the vocabulary at the final layer. The cross-entropy loss between the predicted distribution and the actual next tokens is computed.
In the backward pass, gradients of the loss with respect to every parameter are computed via backpropagation through time (BPTT). An optimizer — typically AdamW for LLMs — uses these gradients to update parameters. This forward-backward cycle is called a training step, and modern runs execute millions of steps.
No single GPU can hold a model with hundreds of billions of parameters. Training is distributed across hundreds or thousands of devices using several complementary strategies:
Each GPU holds a full copy of the model and processes a different mini-batch. Gradients are averaged across all devices (AllReduce) before each parameter update. This is the simplest form of parallelism and scales well to many GPUs.
Individual weight matrices are split across GPUs — rows on one device, columns on another. Used when a single layer is too large for one GPU's memory. Megatron-LM (NVIDIA, 2019) pioneered efficient tensor parallelism for transformer LLMs.
Different transformer layers run on different GPUs, with activations passed between pipeline stages. GPipe (Google, 2019) demonstrated that micro-batching could keep all pipeline stages busy, avoiding the "bubble" of idle computation.
DeepSpeed's ZeRO (Microsoft, 2020) shards optimizer states, gradients, and parameters across data-parallel ranks, reducing per-GPU memory by up to 8× compared to naive data parallelism — enabling training of trillion-parameter models.
Training in FP16 or BF16 (16-bit floating point) rather than FP32 roughly halves memory usage and doubles throughput on modern hardware. NVIDIA A100 and H100 GPUs include hardware-accelerated support for BF16, which has better numerical stability than FP16 due to a wider exponent range. Master weights are kept in FP32 for numerical precision; activations and communications use lower precision.
Gradient checkpointing (also called activation recomputation) trades compute for memory: instead of storing all intermediate activations during the forward pass, they are recomputed during backprop. This allows training larger models at the cost of ~30% more compute per step.
Pre-training runs at scale are prone to loss spikes — sudden increases in training loss that can corrupt weeks of computation if not caught. Teams monitor loss curves continuously and may roll back to an earlier checkpoint when spikes occur. PaLM (Google, 2022) reported and diagnosed several such spikes during its 540B parameter training run, attributing them to particular batches with unusually high gradient norms.
Use this lab to work through the hardware and systems side of pre-training. Ask about distributed training strategies, memory management techniques, why training runs cost millions of dollars, or how engineers deal with hardware failures at scale.
In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models." They found that test loss follows smooth power laws with model size, dataset size, and compute — each varying across orders of magnitude. The relationship was strikingly regular: double the model parameters and loss falls by a predictable amount. The paper gave labs a quantitative tool to forecast performance before spending millions on training, fundamentally changing how frontier LLM development was planned and justified.
The Kaplan scaling laws identified three independent axes that drive language model performance:
Each axis drives loss reduction independently, but with different efficiency. Critically, Kaplan et al. found that model size was the primary driver and recommended concentrating compute on larger models trained on relatively less data. The Chinchilla paper (2022) later revised this, arguing data and model size should scale together — a finding that redirected the entire field's resource allocation strategy.
Beyond the smooth scaling of perplexity, researchers at Google Brain documented a more striking phenomenon in 2022: certain capabilities appear to emerge discontinuously as model scale crosses thresholds. Wei et al. (2022) catalogued over 100 such "emergent abilities" across benchmarks — tasks on which smaller models perform near random chance and larger models abruptly perform well.
Examples from the literature include:
Chain-of-thought reasoning: Models below ~100B parameters showed minimal improvement from chain-of-thought prompting. Above that threshold, step-by-step reasoning prompts dramatically improved performance on multi-step math problems.
Arithmetic: GPT-3 (175B) could perform 2-digit addition reliably; 3-digit addition appeared much earlier in PaLM (540B), suggesting the capability threshold depends on both scale and training data composition.
Instruction following: The ability to follow novel task descriptions without examples appeared to emerge rather than improve gradually, though the precise threshold varies by task.
Schaeffer et al. (2023) argued that many apparent "emergent abilities" are artifacts of discontinuous evaluation metrics. When continuous metrics are used instead of pass/fail benchmarks, capability improvements often appear smooth rather than sudden. This sparked ongoing debate: are LLMs undergoing genuine qualitative phase transitions, or do researchers simply choose metrics that make smooth improvements look like discontinuous jumps?
The Chinchilla paper (Hoffmann et al., DeepMind, 2022) ran a series of controlled experiments varying both model size and token count across hundreds of training runs. They fit scaling law coefficients to the results and derived a key rule: for a given compute budget, loss is minimized by using roughly 20 tokens per parameter.
This directly contradicted the prior field consensus. GPT-3 (175B parameters) was trained on 300B tokens — about 1.7 tokens per parameter. Chinchilla (70B parameters) trained on 1.4T tokens achieved better performance than Gopher (280B) despite having far fewer parameters. The lesson: data quantity and model size must scale together, not size alone.
The Chinchilla findings reverberated through 2023 training decisions. Meta's LLaMA models, Mistral, and others trained smaller models on far more tokens — producing models that are cheaper to serve while matching or exceeding larger, undertrained predecessors.
Scaling laws give organizations quantitative tools to forecast capability improvements before committing compute — enabling rational investment decisions. But emergent abilities create a complementary uncertainty: even a well-calibrated scaling law for perplexity may not predict when qualitatively new capabilities will appear. Both phenomena shape the economics and risk calculus of frontier LLM development.
Work through the implications of scaling laws and emergent capabilities. Ask the tutor to help you reason through compute allocation decisions, challenge you on Chinchilla vs. Kaplan trade-offs, or explain why emergent abilities matter for AI safety.