In 1877, Thomas Edison recorded a human voice onto tinfoil for the first time. Within two decades, recorded sound had transformed from laboratory curiosity into an industrial medium — and by the 1920s, the music industry faced a question no one had anticipated: what happens when a technology can reproduce its own output indefinitely, without returning to the original source? Radio stations began broadcasting recordings of recordings. Fidelity degraded. New genres emerged partly from the artifacts of that degradation. The infrastructure of music was rebuilt around a feedback loop nobody had planned for.
Something structurally similar is happening now in machine learning. Since roughly 2022, AI developers have been training large models not just on text written by humans, but on text generated by earlier AI systems. In 2023, researchers at Stanford, the University of Edinburgh, and elsewhere began documenting what happens when this process is left unchecked — a phenomenon they termed model collapse: successive generations of models drifting away from the diversity of the original distribution. At the same time, OpenAI, Google DeepMind, and Anthropic were each developing methods to use AI-generated data deliberately and carefully, as a lever for capability improvement rather than a source of corruption.
This course examines that lever. We will cover what synthetic data actually is, how it is produced, where it demonstrably helps, where it provably degrades performance, and what the emergence of AI self-improvement loops means for anyone building, evaluating, or deploying AI systems. The field is moving fast; some specifics here will age. The conceptual framework will not.
If you finish every module, here's who you become:
In 2016, Waymo — then still operating as Google's self-driving car project — faced a problem that every autonomous vehicle team eventually hits: you cannot put a test car into every dangerous scenario you need it to survive. Head-on collisions, pedestrians darting from behind buses, black ice at 60 mph. Real crashes cost real lives. So Waymo's engineers built a simulator called Carcraft, which by 2017 was running 25,000 virtual cars simultaneously across a digital replica of Austin, Texas. By 2020, Waymo had logged over 15 billion simulated miles. The cars that drove on real roads in Phoenix and San Francisco had already, in a meaningful sense, driven in a world that did not exist.
That is synthetic data at its most legible: a human-designed simulation producing labeled experiences that would be impossible, dangerous, or prohibitively expensive to collect in the real world. But the term now covers a much wider territory — and understanding that territory is the first task of this course.
Synthetic data is any data that was generated by a process rather than directly measured from the phenomenon it represents. The process might be a physics simulator, a statistical model, a generative adversarial network, or a large language model. What makes data synthetic is not that it is fake — it may be highly accurate — but that it was produced rather than observed.
This definition matters because it separates synthetic data from two things people sometimes confuse it with. First, augmented data: taking a real image and flipping it horizontally is data augmentation, not synthetic data generation. The source is still a real photograph; you are just multiplying it. Second, mislabeled data: data that is real but incorrectly annotated is not synthetic, it is erroneous. Synthetic data has a known generative process — that is the property that makes it useful and also what makes it potentially dangerous.
If you know the generative process, you can reason about what the synthetic data can and cannot represent. A simulator that never produces rain cannot teach a model to handle rain, no matter how many synthetic miles it logs. Knowing this lets engineers identify gaps deliberately, rather than discovering them in production.
It is useful to distinguish three broad families, because they differ in how they are produced, what they are good for, and what their failure modes look like.
Every useful property of synthetic data depends on being able to answer one question: where did this come from, and what did the generative process assume? A simulator that models friction incorrectly produces consistently wrong training signal. An LLM that tends to produce overconfident answers produces a training set full of overconfident answers. The errors are coherent and systematic, which makes them harder to catch than random noise.
Researchers at the University of Toronto and Cambridge demonstrated this in a 2024 paper on iterative model training. When models were trained on their own outputs without careful filtering, errors compounded across generations in predictable ways — specifically, low-probability events in the original distribution were underrepresented in the synthetic data, and models trained on that data became progressively less capable of handling unusual inputs. They called this the "tails disappear" problem. The tails — rare but real events — are exactly what you most need a robust model to handle.
Synthetic data is most powerful when it fills a specific gap in real data (dangerous scenarios, rare medical conditions, low-resource languages) and most dangerous when it is used as a wholesale replacement for real data without understanding what the generative process cannot represent.
By 2024, synthetic data had moved from specialized technique to routine ingredient. Meta's Llama 3 technical report, released in April 2024, described using synthetic data generated by earlier Llama models to create instruction-following training sets. Google's Gemini technical report discussed synthetic chain-of-thought data. Microsoft's Phi series of small language models — Phi-1 in 2023, Phi-2 and Phi-3 in 2024 — were explicitly designed around the insight that a small model trained on very high-quality synthetic data could match or exceed larger models trained on raw web text. Phi-1, a 1.3-billion-parameter model, matched the code-generation performance of GPT-3.5 on several benchmarks, trained almost entirely on synthetically generated "textbook-quality" coding problems.
This is not a niche research curiosity. If you interact with any major AI product today, it was almost certainly trained, at some stage, on data that was itself generated by an AI system. The question is no longer whether synthetic data is part of AI training — it is. The question is how to use it well.
You will describe a data source to the AI assistant and ask it to classify the data as simulation-based, generative model-based, LLM-generated, or real/augmented. Then push back, ask about failure modes, or propose edge cases. Complete at least 3 exchanges to finish this lab.
In 2018, the FDA cleared IDx-DR — the first AI system authorized to diagnose a medical condition without a clinician reviewing the result. It detected diabetic retinopathy from retinal photographs. The approval was based on a clinical trial of 900 patients. But the system had been trained on over 100,000 retinal images collected across multiple continents over several years. Getting those images required ethics approvals, patient consent, specialist grading of each image, and coordination across dozens of medical centers. Every labeled image cost real time, real money, and a real person's involvement. For common conditions with large patient populations, this pipeline — expensive as it is — can be made to work. For rare diseases, pediatric populations, or conditions in low-income countries where imaging infrastructure is sparse, it cannot.
Data scarcity is not a temporary engineering problem that more compute will solve. It is structural. And it takes several distinct forms.
Understanding why real data is insufficient requires distinguishing the specific mechanism causing the shortage, because different mechanisms call for different synthetic data strategies.
One of the clearest documented cases of distribution scarcity driving synthetic data use is in low-resource languages. Of the approximately 7,000 languages spoken today, fewer than 100 have substantial representation in the web-crawled corpora that train large language models. Swahili, with roughly 200 million speakers, has far less training data than Finnish, with 5 million. Yoruba, spoken by 50 million people in Nigeria and Benin, has a fraction of the text data available for Danish.
In 2023, Meta's NLLB (No Language Left Behind) project, along with academic research from Masakhane — a grassroots African NLP research community — began using translation models to generate synthetic parallel text for low-resource African languages. The generated text is imperfect, but it expands the training distribution in ways that meaningful improve downstream performance. A 2023 paper from the Masakhane community documented that synthetic data augmentation improved machine translation quality for Wolof, Fon, and Igbo — languages where a few hundred thousand real parallel sentences exist rather than tens of millions.
Synthetic data for low-resource languages is an equity argument as well as a technical one. But if the generative model was itself trained primarily on high-resource languages, the synthetic data it produces may import biases and errors from those languages into the low-resource training set — a form of linguistic colonialism through AI pipelines that researchers have documented and criticized.
The most commercially significant scarcity problem in 2023–2024 was annotation scarcity for instruction-following data — specifically, the high-quality human-written question-answer pairs needed to fine-tune a base language model into a useful assistant. OpenAI's InstructGPT paper (2022) showed that a relatively small number of expert-written demonstrations, used in RLHF (Reinforcement Learning from Human Feedback), dramatically improved model behavior. But producing those demonstrations at scale required thousands of hours of contractor labor.
The response was to use existing capable models to generate synthetic instruction data. Stanford's Alpaca project in March 2023 used GPT-3.5 (text-davinci-003) to generate 52,000 instruction-following examples from 175 human-written seed examples, at a cost of roughly $500. The resulting model, fine-tuned on LLaMA 7B, demonstrated instruction-following capability comparable to early GPT-3.5 on many tasks. This experiment opened the door to what researchers called "self-instruct" pipelines — and demonstrated both the power and the risks of the approach, since Alpaca also inherited factual errors and biases from GPT-3.5's outputs.
Synthetic data solves a real problem — data scarcity of several distinct types — at the cost of introducing a new problem: the synthetic data's quality ceiling is set by the generator, and the generator's errors are systematic rather than random. Random errors average out over enough examples. Systematic ones do not.
Present a real or hypothetical AI project to the assistant. Describe what data you would need to train it. The assistant will help you identify which of the four scarcity types you face and suggest what kind of synthetic data strategy would be most appropriate — and what risks that strategy carries. Aim for at least 3 substantive exchanges.
In June 2023, Apple quietly published a paper describing a technique called HUGS — Human Gaussian Splatting — that could generate photorealistic animated human figures from a short video clip. The generated humans could be placed in arbitrary scenes, posed, lit differently, and used to train computer vision models without requiring a human subject to be present during training. The pipeline produced data about people from data about people, but with controlled variation that real video could not provide. Apple was not alone: Meta, Google, and a dozen academic groups were all publishing variations of the same basic idea that year. Generation methods had matured enough to produce training data that was, in several measurable respects, better than the real data it supplemented.
The methods underlying this moment are worth understanding in some detail — not to become engineers of these systems, but to understand what each approach can and cannot represent.
Ian Goodfellow introduced GANs in 2014 at NeurIPS. The core idea: train two networks simultaneously — a generator that produces synthetic samples, and a discriminator that tries to distinguish generated from real. The generator improves by fooling the discriminator; the discriminator improves by catching the generator. At equilibrium, the generator produces samples the discriminator cannot distinguish from real ones.
GANs became the dominant tool for synthetic image generation between roughly 2017 and 2021. NVIDIA's StyleGAN (2019) and StyleGAN2 (2020) produced photorealistic synthetic faces that were demonstrably indistinguishable from real photographs in human studies. For training data purposes, GANs were used extensively in medical imaging — generating synthetic MRI scans, dermatology images, and pathology slides to augment scarce real datasets.
The known failure mode of GANs is mode collapse: the generator learns to produce a narrow range of outputs that reliably fool the discriminator rather than covering the full diversity of the real distribution. A GAN trained on skin lesion images might produce highly realistic examples of one lesion type while entirely failing to generate rare variants — exactly the examples that matter most for a diagnostic classifier.
By 2022, diffusion models had largely displaced GANs for image synthesis in research settings. OpenAI's DALL-E 2 (April 2022), Stability AI's Stable Diffusion (August 2022), and Google's Imagen (May 2022) were all diffusion-based. The core mechanism: train a model to reverse a process of gradually adding noise to an image, learning to denoise at each step. Text conditioning is added so that the denoising process can be guided by a text prompt.
For synthetic training data, diffusion models offer better coverage of the distribution (less mode collapse), fine-grained control through text prompts, and the ability to generate labeled data directly — asking for "a chest X-ray showing early-stage pneumothorax in the left lobe" produces a labeled synthetic sample. Companies like Syntheticus and Synthesis AI built commercial products around this capability in 2022–2023, selling synthetic data pipelines to enterprise customers who could not use real patient or user data.
By late 2023, a documented concern in the research community was that diffusion models used for training data generation were themselves trained on data that included outputs of earlier diffusion models — creating multi-generation loops with unknown distributional effects. The provenance chain was becoming difficult to trace.
For language model training, the dominant method in 2023–2024 was using a larger, more capable model to generate training examples for a smaller model — a process called knowledge distillation when the goal is capability transfer, or self-instruct when the same model family generates its own training data.
A particularly important variant is chain-of-thought distillation. Rather than generating only final answers, a capable model is prompted to produce detailed step-by-step reasoning. These reasoning chains become training examples. Google's work on PaLM (2022) and the subsequent Flan series demonstrated that training on chain-of-thought examples improved reasoning performance on problems that require multi-step inference. OpenAI's GPT-4 technical report (March 2023) alluded to similar techniques without full disclosure.
The quality-control mechanism that separates good from bad synthetic text data is typically rule-based filtering: generated examples are scored by a separate model or a set of heuristics (length, format consistency, semantic similarity to seed examples, factual verification against a knowledge base) and low-quality examples are discarded. Microsoft's Phi team described their filtering pipeline in detail: they generated far more synthetic data than they used, then selected only the top fraction by quality score. The effective dataset for Phi-1 was roughly 1 billion tokens of highly filtered synthetic code, not the full generated corpus.
Each generation method has a characteristic ceiling and a characteristic failure mode. GANs collapse on rare variants. Diffusion models can hallucinate fine-grained domain knowledge. LLM-generated text inherits the generator's factual errors and stylistic tendencies. Understanding which method produced a dataset tells you where to look for its weaknesses.
You are advising a team that needs to produce synthetic training data for a specific task. Describe the task to the assistant, and it will help you choose between GAN, diffusion model, LLM chain-of-thought distillation, and self-play — and explain what failure modes each option carries for your specific use case. Minimum 3 exchanges.
In 2019, a team at Google Health published results showing that an AI system trained on retinal fundus images could predict cardiovascular risk factors — age, sex, blood pressure, smoking status — from photographs of the eye alone. The model had been trained on real patient data. Later attempts to replicate and extend this work using synthetic retinal images produced a consistent finding: models trained on synthetic fundus images performed well on other synthetic fundus images but showed measurable performance drops on real patient photographs. The gap was small enough to be invisible in synthetic-only evaluations and large enough to matter clinically. The problem was not that the synthetic data was low-quality — it was visually convincing. The problem was that the distribution of subtle variations in real patient eyes did not fully match the distribution of subtle variations in the generated images.
This pattern — synthetic data that performs well in closed-loop evaluation and underperforms on the real distribution — is documented across enough domains to be treated as a baseline expectation rather than an exception.
The success cases share a common structure: synthetic data fills a specific, bounded gap in a real-data training set, while real data anchors the overall distribution.
Robotics and physical simulation. OpenAI's Dactyl project (2019) trained a robotic hand to solve a Rubik's Cube using almost entirely simulated data — a technique called Sim-to-Real transfer. The key innovation was domain randomization: randomizing the physical parameters of the simulation (friction, lighting, object weight) so extensively that the real world appears as just another instance of the randomized distribution. The physical robot's performance on the real Rubik's Cube was close to the simulated performance — a documented success of large-scale simulation-to-reality transfer.
Code generation. The Phi series results are well-documented: synthetic "textbook-quality" coding problems, filtered aggressively, produced models that punched above their parameter weight on coding benchmarks. The domain has a critical property that enables this: code is executable. Synthetic code examples can be automatically verified by running them. This means quality filtering is partially automated and highly reliable — unlike synthetic natural language, where correctness is harder to verify.
Rare event augmentation. Tesla's Autopilot team has described using their neural network rendering pipeline to generate synthetic versions of edge-case scenarios encountered in real-world fleet data — a pedestrian in an unexpected location, unusual lane markings — and using those synthetic examples to patch model weaknesses. The synthetic data supplements real data at the specific distribution gaps, rather than replacing it wholesale.
The failure cases also share a structure: synthetic data used as a wholesale replacement for real data, without accounting for the gap between the generative process and the real distribution.
Medical imaging without real-data anchoring. A 2022 systematic review in Nature Machine Intelligence analyzed 41 studies using GAN-generated medical images as training data. Of these, 28 studies evaluated only on synthetic or in-distribution test data. When real patient data was used for evaluation, performance gaps appeared in 19 of 28 cases. The review's conclusion was pointed: "Synthetic data augmentation is frequently evaluated in a manner that does not expose the distribution gap that appears in real deployment."
Factual accuracy in LLM fine-tuning. A documented failure mode in the Alpaca-style self-instruct pipeline was factual error propagation. When GPT-3.5 generated incorrect information in a training example — wrong dates, fabricated citations, incorrect scientific claims — those errors appeared in the fine-tuned model's outputs. Several teams published results in 2023 showing that models fine-tuned on unfiltered LLM-generated instruction data showed increased hallucination rates compared to the base model on factual tasks, even when instruction-following ability improved. Better at following instructions; worse at being correct.
Fairness and demographic coverage. Synthetic face generation systems have been documented to underrepresent darker skin tones and non-Western facial features — replicating biases present in the real datasets the generators were trained on. A 2023 study from the University of Maryland found that computer vision models trained on DALL-E 2 generated person images showed larger demographic performance gaps than models trained on curated real datasets. The synthetic data did not just fail to fix the bias; it in some cases amplified it, because the generator's own biases were systematically present across all generated examples.
Perhaps the most persistent practical problem with synthetic data is that evaluation on synthetic held-out data gives misleadingly positive results. A model trained on synthetic chest X-rays, evaluated on synthetic chest X-rays, will appear to perform well — even if its performance on real patient images is clinically unacceptable. Real-data evaluation is non-negotiable if the model will ever encounter real data.
Synthetic data is a powerful technique for bounded, well-defined gaps in real training data, particularly when: the generative process is auditable and its assumptions are understood; a verifiable ground truth exists (executable code, physics simulation with known parameters); and real data is used for both anchoring and final evaluation.
It is a risky technique when: it wholesale replaces real data; the evaluation loop is closed (synthetic train, synthetic test); the generative process encodes biases that are invisible in closed-loop evaluation; or the deployment domain has distributional properties the generator cannot represent.
The next module examines how these properties interact with the specific case of AI self-improvement — models that use their own outputs, or the outputs of peer models, as the primary training signal. That is where the stakes become highest and the failure modes most consequential.
Synthetic data is not a substitute for understanding your data distribution. It is a tool for extending coverage within a distribution you already partially understand. The generative process determines the ceiling. The evaluation method determines whether you can see the ceiling.
You have received a project proposal that relies heavily on synthetic data. Describe the proposal to the assistant, and together you will audit it: identify the scarcity type being addressed, the generation method proposed, the evaluation plan, and any failure modes the proposal may have missed. This lab is most useful if you bring a real scenario you have encountered or are considering.