Synthetic Data and Self-Improvement · Introduction

AI Is Now Its Own Raw Material

Understanding the loop that may define the next generation of machine intelligence

In 1877, Thomas Edison recorded a human voice onto tinfoil for the first time. Within two decades, recorded sound had transformed from laboratory curiosity into an industrial medium — and by the 1920s, the music industry faced a question no one had anticipated: what happens when a technology can reproduce its own output indefinitely, without returning to the original source? Radio stations began broadcasting recordings of recordings. Fidelity degraded. New genres emerged partly from the artifacts of that degradation. The infrastructure of music was rebuilt around a feedback loop nobody had planned for.

Something structurally similar is happening now in machine learning. Since roughly 2022, AI developers have been training large models not just on text written by humans, but on text generated by earlier AI systems. In 2023, researchers at Stanford, the University of Edinburgh, and elsewhere began documenting what happens when this process is left unchecked — a phenomenon they termed model collapse: successive generations of models drifting away from the diversity of the original distribution. At the same time, OpenAI, Google DeepMind, and Anthropic were each developing methods to use AI-generated data deliberately and carefully, as a lever for capability improvement rather than a source of corruption.

This course examines that lever. We will cover what synthetic data actually is, how it is produced, where it demonstrably helps, where it provably degrades performance, and what the emergence of AI self-improvement loops means for anyone building, evaluating, or deploying AI systems. The field is moving fast; some specifics here will age. The conceptual framework will not.

If you finish every module, here's who you become:

You'll understand what synthetic data is, how it differs from real data, and why major AI labs now depend on it.
You'll be able to explain model collapse — what causes it, how researchers detect it, and how careful filtering prevents it.
You'll recognize the specific techniques — self-play, distillation, constitutional self-critique — used to turn AI output into training signal.
When someone claims AI can now improve itself indefinitely, you'll know exactly which parts of that claim are supported and which are not.
You'll think like someone who evaluates AI systems for a living, asking not just what a model was trained on, but what generated that data.
You'll leave with a conceptual framework stable enough to interpret research and product announcements as this field continues to move.
You're becoming the kind of practitioner who understands the feedback loops shaping the next generation of AI — not just the outputs they produce.

Synthetic Data and Self-Improvement · Lesson 1

What Synthetic Data Actually Is

Definitions, distinctions, and the spectrum from simulation to AI-generated text

If a model learns from data made by another model, is it still learning from the world — or from a reflection of itself?

In 2016, Waymo — then still operating as Google's self-driving car project — faced a problem that every autonomous vehicle team eventually hits: you cannot put a test car into every dangerous scenario you need it to survive. Head-on collisions, pedestrians darting from behind buses, black ice at 60 mph. Real crashes cost real lives. So Waymo's engineers built a simulator called Carcraft, which by 2017 was running 25,000 virtual cars simultaneously across a digital replica of Austin, Texas. By 2020, Waymo had logged over 15 billion simulated miles. The cars that drove on real roads in Phoenix and San Francisco had already, in a meaningful sense, driven in a world that did not exist.

That is synthetic data at its most legible: a human-designed simulation producing labeled experiences that would be impossible, dangerous, or prohibitively expensive to collect in the real world. But the term now covers a much wider territory — and understanding that territory is the first task of this course.

A Working Definition

Synthetic data is any data that was generated by a process rather than directly measured from the phenomenon it represents. The process might be a physics simulator, a statistical model, a generative adversarial network, or a large language model. What makes data synthetic is not that it is fake — it may be highly accurate — but that it was produced rather than observed.

This definition matters because it separates synthetic data from two things people sometimes confuse it with. First, augmented data: taking a real image and flipping it horizontally is data augmentation, not synthetic data generation. The source is still a real photograph; you are just multiplying it. Second, mislabeled data: data that is real but incorrectly annotated is not synthetic, it is erroneous. Synthetic data has a known generative process — that is the property that makes it useful and also what makes it potentially dangerous.

Why This Distinction Matters

If you know the generative process, you can reason about what the synthetic data can and cannot represent. A simulator that never produces rain cannot teach a model to handle rain, no matter how many synthetic miles it logs. Knowing this lets engineers identify gaps deliberately, rather than discovering them in production.

Three Families of Synthetic Data

It is useful to distinguish three broad families, because they differ in how they are produced, what they are good for, and what their failure modes look like.

Simulation-Based Environments governed by physical or logical rules produce labeled outputs. Waymo's Carcraft belongs here. So does DeepMind's use of StarCraft II as a training environment for AlphaStar in 2019. The ground truth is the simulator's own rules, which are human-authored and auditable.

Statistical / Generative Model A model trained on real data generates new samples that preserve statistical properties of the original. Early medical imaging work used GANs (generative adversarial networks) to produce synthetic MRI scans for training classifiers when patient data was scarce — a technique documented extensively by researchers at NYU Langone in 2018–2019.

LLM-Generated Text and Reasoning A large language model produces text — answers, reasoning chains, conversations — that is then used to train the same or a different model. This is the newest and most discussed family, and the one at the center of current debates about AI self-improvement. It is the family this course focuses on most heavily.

The Provenance Problem

Every useful property of synthetic data depends on being able to answer one question: where did this come from, and what did the generative process assume? A simulator that models friction incorrectly produces consistently wrong training signal. An LLM that tends to produce overconfident answers produces a training set full of overconfident answers. The errors are coherent and systematic, which makes them harder to catch than random noise.

Researchers at the University of Toronto and Cambridge demonstrated this in a 2024 paper on iterative model training. When models were trained on their own outputs without careful filtering, errors compounded across generations in predictable ways — specifically, low-probability events in the original distribution were underrepresented in the synthetic data, and models trained on that data became progressively less capable of handling unusual inputs. They called this the "tails disappear" problem. The tails — rare but real events — are exactly what you most need a robust model to handle.

Core Concept

Synthetic data is most powerful when it fills a specific gap in real data (dangerous scenarios, rare medical conditions, low-resource languages) and most dangerous when it is used as a wholesale replacement for real data without understanding what the generative process cannot represent.

Scale and Prevalence in 2024

By 2024, synthetic data had moved from specialized technique to routine ingredient. Meta's Llama 3 technical report, released in April 2024, described using synthetic data generated by earlier Llama models to create instruction-following training sets. Google's Gemini technical report discussed synthetic chain-of-thought data. Microsoft's Phi series of small language models — Phi-1 in 2023, Phi-2 and Phi-3 in 2024 — were explicitly designed around the insight that a small model trained on very high-quality synthetic data could match or exceed larger models trained on raw web text. Phi-1, a 1.3-billion-parameter model, matched the code-generation performance of GPT-3.5 on several benchmarks, trained almost entirely on synthetically generated "textbook-quality" coding problems.

This is not a niche research curiosity. If you interact with any major AI product today, it was almost certainly trained, at some stage, on data that was itself generated by an AI system. The question is no longer whether synthetic data is part of AI training — it is. The question is how to use it well.

Lesson 1 Quiz

What Synthetic Data Actually Is · 5 questions

1. Which property most precisely distinguishes synthetic data from real data?

Correct. Synthetic data is defined by its production process — it is generated rather than observed — not by its accuracy or usability.

Not quite. Synthetic data is defined by its provenance (how it was made), not by its accuracy. High-quality synthetic data can be extremely accurate.

2. Waymo's Carcraft simulator had logged over 15 billion virtual miles by 2020. What is the primary limitation this approach cannot overcome by adding more simulation?

Correct. A simulator's coverage is bounded by what its designers modeled. Phenomena outside those rules — weather types, road textures, human behaviors — simply do not appear.

The core limitation is coverage, not speed. No matter how many miles are simulated, events the simulator's rules never encode cannot appear in the data.

3. The 2024 "tails disappear" research finding describes what specific failure mode?

Correct. Low-probability events in the original distribution are underrepresented in synthetic outputs, and this gap widens across training generations — exactly when robustness to edge cases matters most.

The "tails disappear" finding specifically concerns rare events. When models train on their own outputs, the low-probability tails of the real data distribution shrink each generation.

4. Microsoft's Phi-1 model (2023) demonstrated which notable result using synthetic data?

Correct. Phi-1's result suggested that data quality — specifically, synthetically generated "textbook-quality" coding problems — could substitute for massive scale in raw web-crawled data.

Phi-1 was deliberately small (1.3B parameters) and achieved strong code results primarily from synthetic data quality, challenging the assumption that more parameters and more raw data are the only path to capability.

5. Which of the following is NOT a family of synthetic data as described in this lesson?

Correct. Data augmentation transforms real data but does not generate it from a separate process. The source is still an original observation — only its representation is changed.

The lesson specifically distinguishes synthetic data from data augmentation. Flipping a real image still originates from a real observation; it is augmented, not synthetic.

Lab 1 — Classifying Synthetic Data Sources

Interactive practice · Lesson 1 concepts

Your Task

You will describe a data source to the AI assistant and ask it to classify the data as simulation-based, generative model-based, LLM-generated, or real/augmented. Then push back, ask about failure modes, or propose edge cases. Complete at least 3 exchanges to finish this lab.

Try: "A hospital uses a GAN to create synthetic chest X-rays for training a pneumonia classifier. What family of synthetic data is this, and what are the coverage risks?" — or invent your own scenario.

Lab Assistant

Synthetic Data · L1

Hello. I'm your lab assistant for this lesson on what synthetic data is. Describe a data source or scenario — real or hypothetical — and I'll help you classify it, identify what the generative process can and cannot represent, and reason about its failure modes. What would you like to explore?

Synthetic Data and Self-Improvement · Lesson 2

Why It Exists: The Data Scarcity Problem

The specific gaps that make real data insufficient — and why AI developers turned to synthetic alternatives

When does the cost or impossibility of collecting real data justify teaching a model from a world that was built rather than found?

In 2018, the FDA cleared IDx-DR — the first AI system authorized to diagnose a medical condition without a clinician reviewing the result. It detected diabetic retinopathy from retinal photographs. The approval was based on a clinical trial of 900 patients. But the system had been trained on over 100,000 retinal images collected across multiple continents over several years. Getting those images required ethics approvals, patient consent, specialist grading of each image, and coordination across dozens of medical centers. Every labeled image cost real time, real money, and a real person's involvement. For common conditions with large patient populations, this pipeline — expensive as it is — can be made to work. For rare diseases, pediatric populations, or conditions in low-income countries where imaging infrastructure is sparse, it cannot.

Data scarcity is not a temporary engineering problem that more compute will solve. It is structural. And it takes several distinct forms.

Four Structural Scarcity Types

Understanding why real data is insufficient requires distinguishing the specific mechanism causing the shortage, because different mechanisms call for different synthetic data strategies.

Rarity Scarcity The real-world event happens infrequently. Aircraft engine failures, drug interactions in patients over 85, tornado formation — these events exist but cannot be collected at training-data scale. Simulation is the dominant tool here.

Ethical / Legal Scarcity Collecting the data would harm people or violate law. Training a facial recognition system on surveillance footage collected without consent. Generating labeled data about children's behavior. Here, synthetic data can substitute ethically — provided the generative process does not itself rely on the impermissible data.

Annotation Scarcity The raw signal exists but labeling it is prohibitively expensive or requires rare expertise. Medical imaging is the canonical case. Radiologists cost $400,000 a year; labeling a dataset of 500,000 CT scans at expert quality is not economically feasible for most organizations. Generative models and LLMs are used to produce pre-labeled data that reduces the expert annotation burden.

Distribution Scarcity Data exists for a domain in general but not for a specific sub-distribution you need the model to handle. A speech recognition system trained primarily on American English accents has distribution scarcity for Scottish English, Nigerian English, and Cantonese-accented English. Synthetic speech generation — companies like Respeecher and ElevenLabs have produced such datasets — can fill specific accent and dialect gaps.

The Language Imbalance Problem

One of the clearest documented cases of distribution scarcity driving synthetic data use is in low-resource languages. Of the approximately 7,000 languages spoken today, fewer than 100 have substantial representation in the web-crawled corpora that train large language models. Swahili, with roughly 200 million speakers, has far less training data than Finnish, with 5 million. Yoruba, spoken by 50 million people in Nigeria and Benin, has a fraction of the text data available for Danish.

In 2023, Meta's NLLB (No Language Left Behind) project, along with academic research from Masakhane — a grassroots African NLP research community — began using translation models to generate synthetic parallel text for low-resource African languages. The generated text is imperfect, but it expands the training distribution in ways that meaningful improve downstream performance. A 2023 paper from the Masakhane community documented that synthetic data augmentation improved machine translation quality for Wolof, Fon, and Igbo — languages where a few hundred thousand real parallel sentences exist rather than tens of millions.

The Tension

Synthetic data for low-resource languages is an equity argument as well as a technical one. But if the generative model was itself trained primarily on high-resource languages, the synthetic data it produces may import biases and errors from those languages into the low-resource training set — a form of linguistic colonialism through AI pipelines that researchers have documented and criticized.

The Instruction-Following Gap

The most commercially significant scarcity problem in 2023–2024 was annotation scarcity for instruction-following data — specifically, the high-quality human-written question-answer pairs needed to fine-tune a base language model into a useful assistant. OpenAI's InstructGPT paper (2022) showed that a relatively small number of expert-written demonstrations, used in RLHF (Reinforcement Learning from Human Feedback), dramatically improved model behavior. But producing those demonstrations at scale required thousands of hours of contractor labor.

The response was to use existing capable models to generate synthetic instruction data. Stanford's Alpaca project in March 2023 used GPT-3.5 (text-davinci-003) to generate 52,000 instruction-following examples from 175 human-written seed examples, at a cost of roughly $500. The resulting model, fine-tuned on LLaMA 7B, demonstrated instruction-following capability comparable to early GPT-3.5 on many tasks. This experiment opened the door to what researchers called "self-instruct" pipelines — and demonstrated both the power and the risks of the approach, since Alpaca also inherited factual errors and biases from GPT-3.5's outputs.

The Core Trade-off

Synthetic data solves a real problem — data scarcity of several distinct types — at the cost of introducing a new problem: the synthetic data's quality ceiling is set by the generator, and the generator's errors are systematic rather than random. Random errors average out over enough examples. Systematic ones do not.

Lesson 2 Quiz

Why Synthetic Data Exists · 5 questions

1. IDx-DR, the first FDA-cleared autonomous AI diagnostic system (2018), was trained on over 100,000 retinal images. What type of scarcity would make this training approach unworkable for rare diseases?

Correct. Rare diseases don't provide enough naturally occurring cases to build training sets at the scale IDx-DR required, regardless of annotation cost or legal status.

For rare diseases, the fundamental problem is that the condition occurs so infrequently that sufficient labeled examples cannot be gathered even with unlimited expert annotators and legal clearance.

2. The Masakhane research community used synthetic data to improve machine translation for languages including Wolof, Fon, and Igbo. What specific scarcity type does this address?

Correct. Wolof, Fon, and Igbo have real speaker populations in the tens of millions, but the web-crawled text that fills LLM training sets reflects internet usage patterns that heavily favor high-resource European languages.

Distribution scarcity is the issue here. These languages are widely spoken; what's missing is their representation in the datasets AI systems are trained on, not the languages themselves.

3. Stanford's Alpaca (2023) generated 52,000 instruction-following examples using GPT-3.5 from 175 human seed examples for approximately $500. What was the primary problem Alpaca's creators acknowledged with this approach?

Correct. Alpaca demonstrated the power of self-instruct pipelines but also made clear that errors in the generator propagate systematically into the student model — inherited rather than random.

The acknowledged problem was quality inheritance: GPT-3.5's factual errors and biases appeared in the synthetic training data and then in Alpaca's behavior. Systematic errors, not random ones.

4. Why is the distinction between random errors and systematic errors critical when evaluating synthetic training data quality?

Correct. A generator that consistently overestimates certainty will produce thousands of training examples all exhibiting that same overconfidence, making the trained model more overconfident — not less — as the dataset grows.

The asymmetry matters enormously. Random labeling errors average out. Systematic errors from the generative process are replicated across all generated examples and reinforce themselves during training.

5. The lesson warns that using a model trained primarily on high-resource languages to generate synthetic data for low-resource languages risks what specific harm?

Correct. This is what researchers have called a form of "linguistic colonialism" — the generator's assumptions about language structure and world knowledge, formed from high-resource corpora, get embedded into the synthetic data for minority languages.

The concern is structural: the generator's linguistic and conceptual assumptions, built from predominantly European-language data, can be imposed on the target language's training distribution through the synthetic examples it produces.

Lab 2 — Diagnosing Data Scarcity Types

Interactive practice · Lesson 2 concepts

Your Task

Present a real or hypothetical AI project to the assistant. Describe what data you would need to train it. The assistant will help you identify which of the four scarcity types you face and suggest what kind of synthetic data strategy would be most appropriate — and what risks that strategy carries. Aim for at least 3 substantive exchanges.

Try: "I want to train a model to detect early-stage pancreatic cancer from CT scans. What are my data challenges?" — or describe your own project.

Lab Assistant

Synthetic Data · L2

Ready to help you diagnose data scarcity. Tell me about an AI project — what it needs to do, what domain it operates in, and what data you imagine needing. I'll help you identify which type of scarcity is your primary obstacle and what synthetic data strategies apply.

Synthetic Data and Self-Improvement · Lesson 3

How Synthetic Data Is Made: The Key Methods

From GANs and diffusion models to chain-of-thought distillation and rule-based filtering

The method of generation is inseparable from the quality ceiling — so how exactly is synthetic training data produced?

In June 2023, Apple quietly published a paper describing a technique called HUGS — Human Gaussian Splatting — that could generate photorealistic animated human figures from a short video clip. The generated humans could be placed in arbitrary scenes, posed, lit differently, and used to train computer vision models without requiring a human subject to be present during training. The pipeline produced data about people from data about people, but with controlled variation that real video could not provide. Apple was not alone: Meta, Google, and a dozen academic groups were all publishing variations of the same basic idea that year. Generation methods had matured enough to produce training data that was, in several measurable respects, better than the real data it supplemented.

The methods underlying this moment are worth understanding in some detail — not to become engineers of these systems, but to understand what each approach can and cannot represent.

Generative Adversarial Networks (GANs)

Ian Goodfellow introduced GANs in 2014 at NeurIPS. The core idea: train two networks simultaneously — a generator that produces synthetic samples, and a discriminator that tries to distinguish generated from real. The generator improves by fooling the discriminator; the discriminator improves by catching the generator. At equilibrium, the generator produces samples the discriminator cannot distinguish from real ones.

GANs became the dominant tool for synthetic image generation between roughly 2017 and 2021. NVIDIA's StyleGAN (2019) and StyleGAN2 (2020) produced photorealistic synthetic faces that were demonstrably indistinguishable from real photographs in human studies. For training data purposes, GANs were used extensively in medical imaging — generating synthetic MRI scans, dermatology images, and pathology slides to augment scarce real datasets.

The known failure mode of GANs is mode collapse: the generator learns to produce a narrow range of outputs that reliably fool the discriminator rather than covering the full diversity of the real distribution. A GAN trained on skin lesion images might produce highly realistic examples of one lesion type while entirely failing to generate rare variants — exactly the examples that matter most for a diagnostic classifier.

Diffusion Models

By 2022, diffusion models had largely displaced GANs for image synthesis in research settings. OpenAI's DALL-E 2 (April 2022), Stability AI's Stable Diffusion (August 2022), and Google's Imagen (May 2022) were all diffusion-based. The core mechanism: train a model to reverse a process of gradually adding noise to an image, learning to denoise at each step. Text conditioning is added so that the denoising process can be guided by a text prompt.

For synthetic training data, diffusion models offer better coverage of the distribution (less mode collapse), fine-grained control through text prompts, and the ability to generate labeled data directly — asking for "a chest X-ray showing early-stage pneumothorax in the left lobe" produces a labeled synthetic sample. Companies like Syntheticus and Synthesis AI built commercial products around this capability in 2022–2023, selling synthetic data pipelines to enterprise customers who could not use real patient or user data.

The DALL-E 3 Training Loop Concern

By late 2023, a documented concern in the research community was that diffusion models used for training data generation were themselves trained on data that included outputs of earlier diffusion models — creating multi-generation loops with unknown distributional effects. The provenance chain was becoming difficult to trace.

LLM-Based Synthetic Text: Chain-of-Thought Distillation

For language model training, the dominant method in 2023–2024 was using a larger, more capable model to generate training examples for a smaller model — a process called knowledge distillation when the goal is capability transfer, or self-instruct when the same model family generates its own training data.

A particularly important variant is chain-of-thought distillation. Rather than generating only final answers, a capable model is prompted to produce detailed step-by-step reasoning. These reasoning chains become training examples. Google's work on PaLM (2022) and the subsequent Flan series demonstrated that training on chain-of-thought examples improved reasoning performance on problems that require multi-step inference. OpenAI's GPT-4 technical report (March 2023) alluded to similar techniques without full disclosure.

The quality-control mechanism that separates good from bad synthetic text data is typically rule-based filtering: generated examples are scored by a separate model or a set of heuristics (length, format consistency, semantic similarity to seed examples, factual verification against a knowledge base) and low-quality examples are discarded. Microsoft's Phi team described their filtering pipeline in detail: they generated far more synthetic data than they used, then selected only the top fraction by quality score. The effective dataset for Phi-1 was roughly 1 billion tokens of highly filtered synthetic code, not the full generated corpus.

Self-Play A model generates both sides of a training scenario — questions and answers, arguments and counterarguments, code and tests. AlphaGo Zero (DeepMind, 2017) used self-play exclusively: no human game records, only games played against itself. It surpassed all previous Go-playing systems within 40 days.

Constitutional AI / RLAIF Anthropic's Constitutional AI method (2022) used Claude to evaluate and revise its own outputs against a written set of principles, generating preference data that replaced expensive human labeling in parts of the RLHF pipeline. RLAIF (Reinforcement Learning from AI Feedback) is the generalized version: AI preference labels substitute for human ones.

Method Shapes the Ceiling

Each generation method has a characteristic ceiling and a characteristic failure mode. GANs collapse on rare variants. Diffusion models can hallucinate fine-grained domain knowledge. LLM-generated text inherits the generator's factual errors and stylistic tendencies. Understanding which method produced a dataset tells you where to look for its weaknesses.

Lesson 3 Quiz

How Synthetic Data Is Made · 5 questions

1. What is "mode collapse" in the context of GANs, and why does it matter for synthetic training data?

Correct. Mode collapse means the GAN produces convincing but narrow outputs — dangerous for training data because the rare variants (often the most diagnostically important) are systematically absent.

Mode collapse is about distributional coverage, not system failure. The GAN keeps producing outputs, but those outputs cluster in a narrow region of the real distribution rather than spanning it.

2. AlphaGo Zero (2017) used self-play exclusively to learn Go. What made this significant for synthetic data methodology?

Correct. AlphaGo Zero's result was striking evidence that synthetic self-generated data, in domains with a clear verifiable signal, could entirely replace human-produced training data and produce superior performance.

AlphaGo Zero was significant because it used only self-generated game data — no human records — and surpassed all previous systems. This validated self-play as a scalable training paradigm in well-defined game environments.

3. Microsoft's Phi-1 team generated far more synthetic data than they used for training. What was the mechanism that made their approach effective?

Correct. The insight was that generation and selection are separate steps. Generate broadly, then filter aggressively for quality. The effective training set was approximately 1 billion highly filtered tokens, not the full raw generation.

The Phi approach was generate-then-filter: produce far more synthetic examples than needed, apply quality filters (length, format, semantic coherence, etc.), and train only on the filtered subset.

4. Anthropic's Constitutional AI method (2022) addressed which specific bottleneck in traditional RLHF?

Correct. RLHF requires humans to label which of two model outputs they prefer, across thousands of examples. Constitutional AI and RLAIF use an AI system to make those preference judgments, reducing but not eliminating the human-labeling bottleneck.

The bottleneck Constitutional AI addressed was human labeling cost. Human preference pairs are expensive to collect; having an AI evaluate outputs against a written set of principles generates preference data at much lower cost.

5. Why did researchers raise concerns about diffusion models being used to generate training data for newer diffusion models by late 2023?

Correct. When model outputs train newer models whose outputs train even newer models, the distribution effects of each generation compound in ways that are difficult to audit — a version of the "tails disappear" problem applied to image generation.

The concern was provenance and distributional compounding. Each generation of synthetic data carries artifacts from the previous generator, and multi-hop loops make it hard to identify what those artifacts are or how they accumulate.

Lab 3 — Evaluating Generation Methods

Interactive practice · Lesson 3 concepts

Your Task

You are advising a team that needs to produce synthetic training data for a specific task. Describe the task to the assistant, and it will help you choose between GAN, diffusion model, LLM chain-of-thought distillation, and self-play — and explain what failure modes each option carries for your specific use case. Minimum 3 exchanges.

Try: "We need synthetic driving data for a vehicle navigating construction zones at night. What generation method should we use?" — or describe a different task.

Lab Assistant

Synthetic Data · L3

Tell me about the task you need synthetic training data for. I'll help you think through which generation method — GAN, diffusion model, LLM distillation, or self-play — fits best, and what failure modes each option brings to your specific problem. What are you building?

Synthetic Data and Self-Improvement · Lesson 4

What Synthetic Data Can and Cannot Replace

Documented successes, documented failures, and the honest limits of the technique

Given real deployments, where does synthetic data help — and where does relying on it cause demonstrable harm?

In 2019, a team at Google Health published results showing that an AI system trained on retinal fundus images could predict cardiovascular risk factors — age, sex, blood pressure, smoking status — from photographs of the eye alone. The model had been trained on real patient data. Later attempts to replicate and extend this work using synthetic retinal images produced a consistent finding: models trained on synthetic fundus images performed well on other synthetic fundus images but showed measurable performance drops on real patient photographs. The gap was small enough to be invisible in synthetic-only evaluations and large enough to matter clinically. The problem was not that the synthetic data was low-quality — it was visually convincing. The problem was that the distribution of subtle variations in real patient eyes did not fully match the distribution of subtle variations in the generated images.

This pattern — synthetic data that performs well in closed-loop evaluation and underperforms on the real distribution — is documented across enough domains to be treated as a baseline expectation rather than an exception.

Where Synthetic Data Demonstrably Helps

The success cases share a common structure: synthetic data fills a specific, bounded gap in a real-data training set, while real data anchors the overall distribution.

Robotics and physical simulation. OpenAI's Dactyl project (2019) trained a robotic hand to solve a Rubik's Cube using almost entirely simulated data — a technique called Sim-to-Real transfer. The key innovation was domain randomization: randomizing the physical parameters of the simulation (friction, lighting, object weight) so extensively that the real world appears as just another instance of the randomized distribution. The physical robot's performance on the real Rubik's Cube was close to the simulated performance — a documented success of large-scale simulation-to-reality transfer.

Code generation. The Phi series results are well-documented: synthetic "textbook-quality" coding problems, filtered aggressively, produced models that punched above their parameter weight on coding benchmarks. The domain has a critical property that enables this: code is executable. Synthetic code examples can be automatically verified by running them. This means quality filtering is partially automated and highly reliable — unlike synthetic natural language, where correctness is harder to verify.

Rare event augmentation. Tesla's Autopilot team has described using their neural network rendering pipeline to generate synthetic versions of edge-case scenarios encountered in real-world fleet data — a pedestrian in an unexpected location, unusual lane markings — and using those synthetic examples to patch model weaknesses. The synthetic data supplements real data at the specific distribution gaps, rather than replacing it wholesale.

Where Synthetic Data Demonstrably Fails

The failure cases also share a structure: synthetic data used as a wholesale replacement for real data, without accounting for the gap between the generative process and the real distribution.

Medical imaging without real-data anchoring. A 2022 systematic review in Nature Machine Intelligence analyzed 41 studies using GAN-generated medical images as training data. Of these, 28 studies evaluated only on synthetic or in-distribution test data. When real patient data was used for evaluation, performance gaps appeared in 19 of 28 cases. The review's conclusion was pointed: "Synthetic data augmentation is frequently evaluated in a manner that does not expose the distribution gap that appears in real deployment."

Factual accuracy in LLM fine-tuning. A documented failure mode in the Alpaca-style self-instruct pipeline was factual error propagation. When GPT-3.5 generated incorrect information in a training example — wrong dates, fabricated citations, incorrect scientific claims — those errors appeared in the fine-tuned model's outputs. Several teams published results in 2023 showing that models fine-tuned on unfiltered LLM-generated instruction data showed increased hallucination rates compared to the base model on factual tasks, even when instruction-following ability improved. Better at following instructions; worse at being correct.

Fairness and demographic coverage. Synthetic face generation systems have been documented to underrepresent darker skin tones and non-Western facial features — replicating biases present in the real datasets the generators were trained on. A 2023 study from the University of Maryland found that computer vision models trained on DALL-E 2 generated person images showed larger demographic performance gaps than models trained on curated real datasets. The synthetic data did not just fail to fix the bias; it in some cases amplified it, because the generator's own biases were systematically present across all generated examples.

The Evaluation Trap

Perhaps the most persistent practical problem with synthetic data is that evaluation on synthetic held-out data gives misleadingly positive results. A model trained on synthetic chest X-rays, evaluated on synthetic chest X-rays, will appear to perform well — even if its performance on real patient images is clinically unacceptable. Real-data evaluation is non-negotiable if the model will ever encounter real data.

The Honest Summary

Synthetic data is a powerful technique for bounded, well-defined gaps in real training data, particularly when: the generative process is auditable and its assumptions are understood; a verifiable ground truth exists (executable code, physics simulation with known parameters); and real data is used for both anchoring and final evaluation.

It is a risky technique when: it wholesale replaces real data; the evaluation loop is closed (synthetic train, synthetic test); the generative process encodes biases that are invisible in closed-loop evaluation; or the deployment domain has distributional properties the generator cannot represent.

The next module examines how these properties interact with the specific case of AI self-improvement — models that use their own outputs, or the outputs of peer models, as the primary training signal. That is where the stakes become highest and the failure modes most consequential.

Module 1 Core Takeaway

Synthetic data is not a substitute for understanding your data distribution. It is a tool for extending coverage within a distribution you already partially understand. The generative process determines the ceiling. The evaluation method determines whether you can see the ceiling.

Lesson 4 Quiz

What Synthetic Data Can and Cannot Replace · 5 questions

1. OpenAI's Dactyl project (2019) used "domain randomization" for Sim-to-Real transfer. What is domain randomization?

Correct. By randomizing friction, lighting, object weight, and other parameters across wide ranges, the simulation makes the real world's specific values just one sample from the distribution the model has already been trained on.

Domain randomization is about simulation parameter diversity. If you vary simulation parameters broadly enough, the real world — with its fixed actual parameters — falls within the trained distribution.

2. The 2022 Nature Machine Intelligence systematic review of GAN-generated medical images found that synthetic data performance gaps were often invisible during development. Why?

Correct. Evaluating on synthetic data from the same generator as the training data produces optimistic results. The distribution gap only becomes visible when real patient images are used for evaluation.

The review specifically identified evaluation methodology as the issue. Closed-loop evaluation — synthetic train, synthetic test — consistently conceals the performance gap that emerges when real data is used.

3. Why does the executability of code make synthetic coding data particularly reliable compared to synthetic natural language data?

Correct. A generated code example can be executed against test cases. Either it produces the right output or it does not. This binary verifiability enables automated filtering that has no equivalent for factual prose.

The key property is automated verification. You can run code and check the output. You cannot run a factual sentence and automatically check whether it is true — which is why synthetic code pipelines can filter reliably but synthetic fact-heavy text cannot.

4. The 2023 University of Maryland study on DALL-E 2 generated training images found what result regarding demographic bias?

Correct. The generator's biases — underrepresentation of darker skin tones, non-Western features — appeared in every synthetic image, creating a training set where those biases were systematic rather than statistical noise. The result was amplification, not mitigation.

The finding was that using a biased generator doesn't average out its biases; it replicates them coherently across the entire synthetic dataset. Models trained on that data can end up more biased than if trained on curated real data.

5. According to this lesson's summary, under what condition is synthetic data a genuinely powerful training tool?

Correct. The success pattern is: specific gap-filling, not wholesale replacement; auditable generation; and real-data evaluation. These conditions characterize Dactyl, Phi, and Tesla's edge-case augmentation — the documented success cases.

Synthetic data works best as a supplement, not a replacement. The conditions for success are bounded scope, auditable generation, and real-data evaluation. When any of those three conditions is missing, documented failure modes emerge.

Lab 4 — Audit a Synthetic Data Proposal

Interactive practice · Lesson 4 concepts

Your Task

You have received a project proposal that relies heavily on synthetic data. Describe the proposal to the assistant, and together you will audit it: identify the scarcity type being addressed, the generation method proposed, the evaluation plan, and any failure modes the proposal may have missed. This lab is most useful if you bring a real scenario you have encountered or are considering.

Try: "Our team wants to train a fraud detection model using LLM-generated synthetic transaction records because we can't access real customer data. The plan is to train and evaluate entirely on synthetic data before deployment." — What's wrong with this plan?

Lab Assistant

Synthetic Data · L4

Share a synthetic data proposal you want to audit — real or hypothetical. I'll help you evaluate the scarcity type being addressed, whether the proposed generation method fits the problem, what the evaluation plan reveals or conceals, and what failure modes the proposal may be walking into. What's the proposal?

Module 1 Test

What Synthetic Data Is · 15 questions · Pass at 80%

1. The most precise definition of synthetic data is data that was:

Correct. The defining property is provenance — produced rather than observed — not quality, origin technology, or intent.

Synthetic data is defined by how it was made, not by quality or purpose. A physics simulator, a GAN, and an LLM all produce synthetic data by this definition.

2. Data augmentation (e.g., flipping a real photograph horizontally) is NOT considered synthetic data because:

Correct. Augmentation transforms real data. Synthetic generation produces new data from a model of the phenomenon. The distinction is whether the underlying observation is real.

The distinction is the origin: augmentation modifies a real observation; synthetic generation produces a new sample from a generative model. The source matters, not the technique used.

3. Waymo's Carcraft simulator logged over 15 billion virtual miles by 2020. The fundamental limitation of this approach is:

Correct. Coverage is bounded by the designers' modeling assumptions. Events outside those assumptions are invisible to the simulation however many iterations are run.

The limitation is coverage, not fidelity or speed. A scenario the engineers never encoded simply cannot appear in any simulated data, no matter how extensive the simulation.

4. "Rarity scarcity" in the context of training data refers to:

Correct. Rarity scarcity is about event frequency — aircraft failures, rare drug interactions, extreme weather events — not about annotation cost or language size.

Rarity scarcity is specifically about frequency of occurrence. The real-world event happens too rarely to be observed and collected at the scale needed for training, regardless of annotation cost.

5. Stanford's Alpaca (March 2023) used GPT-3.5 to generate 52,000 instruction-following examples from 175 seed examples. The documented flaw in this approach was:

Correct. Alpaca demonstrated both the power and the flaw of self-instruct pipelines: the generator's errors are not random noise but systematic properties that propagate coherently into the trained model.

The documented problem was quality inheritance. GPT-3.5's factual errors appeared in the synthetic training data and were then reinforced in the Alpaca model — a systematic flaw, not a random one.

6. The 2024 "tails disappear" research finding states that iterative training on model-generated outputs leads to:

Correct. Each generation of synthetic training data slightly undersamples the tails of the real distribution. Over multiple training loops, those tails effectively disappear — and the model loses robustness on the edge cases that matter most.

The tails disappear finding is specifically about distributional coverage. Low-probability events shrink with each self-training generation, producing models that are increasingly fragile on unusual inputs.

7. Microsoft's Phi-1 model achieved strong code generation results because it was trained on:

Correct. Phi-1's result challenged the assumption that more parameters and more raw data are necessary for capability. High-quality, filtered synthetic data could substitute for scale.

Phi-1 used a relatively small dataset — about 1 billion tokens — of synthetic coding problems filtered for quality. The key was data quality, not quantity or raw source.

8. Ian Goodfellow's GAN architecture (2014) consists of:

Correct. The adversarial dynamic — generator improving by fooling the discriminator, discriminator improving by catching the generator — is what gives GANs their generative power and their characteristic failure modes.

The GAN architecture is specifically adversarial: two networks with opposing objectives trained simultaneously. The generator tries to produce convincing fakes; the discriminator tries to identify them.

9. Anthropic's Constitutional AI (2022) addressed which bottleneck in RLHF?

Correct. RLHF requires human raters to compare model outputs at scale. Constitutional AI used an AI system to evaluate outputs against a written set of principles, generating preference data without requiring equivalent human effort for each comparison.

Constitutional AI specifically targeted the human labeling bottleneck in RLHF. Having an AI evaluate outputs against explicit principles reduces the need for human preference comparisons at every training step.

10. OpenAI's Dactyl project (2019) transferred robotic hand training from simulation to reality by using:

Correct. Rather than trying to simulate the real world exactly — an impossible goal — domain randomization makes the training distribution so wide that the real world becomes just one point within it.

Dactyl's key insight was domain randomization: not simulating reality perfectly, but simulating such a wide variety of physical parameters that real-world conditions become a subset of the training distribution.

11. The 2022 Nature Machine Intelligence systematic review of GAN-generated medical image studies found that performance gaps were often invisible because:

Correct. Closed-loop evaluation — synthetic train, synthetic test — produces optimistic results. The distribution gap only appears when real patient data is used for evaluation.

The review specifically identified the evaluation methodology as the problem. Evaluating on synthetic held-out data from the same generator conceals the gap that emerges when the model encounters real patient images.

12. AlphaGo Zero's achievement of surpassing all previous Go systems using only self-play data (no human game records) succeeded primarily because:

Correct. Go's binary outcome — win or lose — provides an unambiguous quality signal for every self-generated game. This automated verifiability is what makes self-play synthetic data so powerful in well-defined game environments.

The key enabling property is the verifiable signal. Every self-play game has a clear outcome. This makes quality filtering fully automated — unlike synthetic text or images, where correctness is far harder to verify automatically.

13. A model fine-tuned on unfiltered LLM-generated instruction data was documented in 2023 research to show which counterintuitive result?

Correct. Better at following instructions; worse at being factually correct. The synthetic data improved one behavior by reinforcing patterns from the generator, while those same patterns encoded the generator's factual errors.

The documented finding was a divergence: synthetic instruction data improved instruction-following (the intended goal) while increasing hallucination rates on factual questions — because the generator's factual errors were systematically present in the training examples.

14. The Masakhane community's use of synthetic data for low-resource African languages illustrates which risk, in addition to its benefits?

Correct. If the generator was primarily trained on European languages, its structural assumptions about grammar, world knowledge, and pragmatics may be embedded in the synthetic text it produces for Wolof, Fon, or Igbo — a form of distributional bias imported through the pipeline.

The risk is that the generator's own biases — formed from high-resource language corpora — propagate into the synthetic training data for low-resource languages. Researchers have described this as a form of linguistic colonialism through AI pipelines.

15. According to this module, synthetic data is most defensibly used when it:

Correct. These three conditions — specific gap-filling rather than wholesale replacement, auditable generation, real-data evaluation — characterize the documented success cases and distinguish them from the documented failures.

The module's consistent finding is that synthetic data works as a supplement with known scope, not a replacement. And real-data evaluation is non-negotiable if the model will encounter real-world inputs in deployment.