L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 6 · Lesson 1

What Is Model Distillation?

Compressing the knowledge of a large model into a smaller, deployable one — without starting from scratch.
How can a compact model learn to think like a giant one?

When Meta released the first LLaMA weights in February 2023, researchers immediately noticed something remarkable. Fine-tuning a 7-billion-parameter model on outputs generated by GPT-3.5 could produce an assistant — Alpaca, from Stanford — that felt surprisingly capable despite costing roughly $600 to train. The large model had been used not as a base, but as a teacher.

The Core Idea

Model distillation is a training technique in which a smaller student model learns by imitating a larger teacher model. The concept was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper "Distilling the Knowledge in a Neural Network." Their insight was that a trained network's output probabilities — its soft targets — carry more information than simple one-hot labels.

When a teacher model assigns 0.7 probability to "cat," 0.2 to "lynx," and 0.05 to "dog," the student learns not just the answer but the structure of similarity the teacher has internalized. This richer signal accelerates and improves student training compared to learning from raw data alone.

In the language model era, distillation typically works differently: the teacher generates large volumes of text — reasoning chains, answers, explanations — and the student is fine-tuned on this synthetic corpus. The student never sees the teacher's internal weights, only its outputs.

Key Distinction

Classic distillation trains on the teacher's probability distributions (soft targets). Modern LLM distillation more often trains on the teacher's generated text (behavioral cloning). Both are valid; the right choice depends on whether you have access to the teacher's logits.

Why Distillation Matters

Large frontier models like GPT-4 or Claude 3 Opus cannot run on consumer hardware, embedded devices, or edge servers with strict latency requirements. Distillation bridges this gap. A student model trained on teacher outputs can achieve a large fraction of the teacher's performance at a fraction of the compute and memory cost.

The economic argument is compelling. Running a 7B-parameter model for inference costs roughly 10–50× less per token than a 70B model, depending on hardware. If distillation can close even half the capability gap, the return on investment is enormous for high-volume production applications.

Teacher Model

Large, expensive, high-capability. Generates training signal — either probability distributions or full text outputs — that encode its learned knowledge.

Student Model

Smaller, cheaper, fast to run. Trained to mimic teacher behavior. Retains much of the teacher's capability in a deployable footprint.

Key Terms

Soft Targets —The full probability distribution output by the teacher over all possible tokens or classes, rather than the single most likely answer. Carry richer information than hard labels.
Temperature Scaling —In Hinton's original formulation, a temperature T>1 is applied to soften the teacher's distribution, making low-probability outputs more visible to the student.
Knowledge Transfer —The process by which information encoded in a large model's parameters is transmitted to a smaller model through distillation training.
Behavioral Cloning —Training a student to reproduce teacher outputs (text, actions) rather than to match its internal distributions. Common in LLM distillation pipelines.
Historical Anchor — Alpaca (2023)

Stanford's Alpaca model fine-tuned LLaMA-7B on 52,000 instruction-following examples generated by GPT-3.5-turbo using a self-instruct pipeline. Training cost ~$600. The resulting model matched early GPT-3.5 assistant behavior on many simple tasks. This demonstrated that behavioral distillation from a frontier teacher could produce highly capable small models — but also raised licensing concerns that led Meta to restrict the LLaMA license for commercial use.

Distillation is not compression alone. It differs from quantization (reducing numerical precision) and pruning (removing weights). Those techniques modify an existing model. Distillation trains a new, architecturally independent model that learns from the existing one. The student may have a completely different architecture — it need only learn to produce similar outputs.

Lesson 1 Quiz

What Is Model Distillation? — 3 questions
What are "soft targets" in the context of Hinton et al.'s 2015 distillation framework?
Correct. Soft targets are the teacher's full output distribution. They carry richer information than hard labels — low-probability assignments reveal the structure of similarity the teacher has learned.
Not quite. Soft targets refer to the teacher's complete output probability distribution, which encodes more information than any single label or compressed weight.
Stanford's Alpaca model demonstrated behavioral distillation by training LLaMA-7B on outputs from which teacher?
Correct. Alpaca used 52,000 instructions generated via GPT-3.5-turbo through a self-instruct pipeline, at a training cost of roughly $600.
Not correct. Alpaca (Stanford, 2023) used GPT-3.5-turbo as the teacher to generate instruction-following training data for LLaMA-7B.
How does model distillation differ from quantization?
Correct. Distillation produces a new model trained to mimic the teacher; quantization reduces the bit-width of an existing model's weights without training a new architecture.
Incorrect. Distillation trains a completely new student model using teacher outputs. Quantization, by contrast, modifies an existing model's weight precision without changing its architecture or training it again.

Lab 1 — Distillation Fundamentals

Discuss the mechanics and history of knowledge distillation with your AI tutor

Your Task

Engage with the AI tutor about the core concepts of model distillation. Explore the difference between soft and hard targets, why temperature scaling matters, and how Stanford's Alpaca demonstrated behavioral cloning at low cost.

Suggested opening: "Explain why a teacher model's probability distribution carries more information than the correct label alone, and how temperature scaling affects this."
Distillation Tutor
Lab 1
Welcome to Lab 1. I'm here to help you explore model distillation — from Hinton's original soft-target insight to modern behavioral cloning pipelines like Alpaca. Ask me anything about teacher-student dynamics, temperature scaling, or the economics of deploying distilled models.
Module 6 · Lesson 2

Distillation Architectures and Training Pipelines

From response-based distillation to feature-level matching — how the mechanics are actually implemented.
What training signals, beyond text outputs, can a teacher pass to a student?

In October 2019, Hugging Face published DistilBERT — a version of BERT compressed to 40% fewer parameters while retaining 97% of BERT's performance on GLUE benchmarks and running 60% faster. The training used three simultaneous distillation signals: the teacher's soft output probabilities, a cosine embedding loss aligning hidden state vectors, and a masked language modeling loss on the same data. The combination of these three signals, not just output imitation, was critical to its success.

Three Levels of Distillation Signal

Modern distillation pipelines can operate at three distinct levels of teacher signal, each offering a different fidelity-cost tradeoff:

  • 1Response-Based: The student learns from the teacher's final outputs — logits, probability distributions, or generated text. This is the most accessible form since it requires no access to internal model layers. Used in Alpaca-style behavioral cloning and in sequence-level knowledge distillation (Kim & Rush, 2016).
  • 2Feature-Based: The student is trained to match intermediate representations — hidden states, attention matrices, or layer activations — from the teacher. Requires architectural alignment or projection layers. DistilBERT uses this via cosine embedding loss on hidden states.
  • 3Relation-Based: The student learns to mimic the relationships between different inputs as seen by the teacher — for instance, which pairs of examples are similar in the teacher's representational space. Less common but powerful for metric learning tasks.

The DistilBERT Training Pipeline

The Hugging Face team's approach to distilling BERT is a canonical example of multi-objective distillation. They trained a 6-layer student against BERT's 12 layers using a combined loss function:

L_total = α·L_CE(soft) + β·L_CE(hard) + γ·L_cos

Where L_CE(soft) is cross-entropy against the teacher's softened output distribution, L_CE(hard) is standard cross-entropy against ground-truth labels, and L_cos is cosine distance between teacher and student hidden state vectors. The three-way combination proved more effective than any single signal alone.

Important Detail

DistilBERT was initialized from every other layer of BERT before distillation training began. This weight initialization — rather than random initialization — significantly accelerated convergence. Starting from teacher weights is a common practical trick in response-based and feature-based distillation alike.

Sequence-Level vs. Token-Level Distillation

For generative models, distillation can operate at two granularities. Token-level distillation trains the student to match the teacher's probability distribution at each token position during generation. This requires the teacher to run inference at training time, which is expensive but provides fine-grained signal.

Sequence-level distillation (Kim & Rush, ACL 2016) instead has the teacher generate complete output sequences, which then serve as hard targets for the student. This is cheaper — you generate teacher sequences once and train repeatedly — but loses the within-sequence probability structure. Modern instruction-tuning distillation (Alpaca, Vicuna) mostly uses this approach.

ApproachTeacher SignalCostQualityExample
Token-level (online)Full distribution per tokenHighHighestTinyBERT layers 4–12
Sequence-level (offline)Generated text sequencesLowGoodAlpaca, Vicuna
Feature-basedHidden state vectorsMediumHighDistilBERT
Relation-basedInter-example similarityMediumTask-dependentRKD (Park et al.)

Key Terms

Response-Based Distillation —Training on the teacher's output layer only — logits or generated text. No access to internal layers required.
Feature-Based Distillation —Training to match intermediate hidden states or activations from the teacher. Requires either architectural alignment or projection layers.
Online vs. Offline Distillation —Online: teacher runs at training time generating real-time signal. Offline: teacher outputs pre-generated and stored. Offline is cheaper; online is higher quality.
Cosine Embedding Loss —A loss function that penalizes the angular difference between teacher and student hidden state vectors. Used in DistilBERT for feature-level alignment.

Lesson 2 Quiz

Distillation Architectures and Training Pipelines — 3 questions
DistilBERT achieved 97% of BERT's GLUE performance with how many fewer parameters?
Correct. DistilBERT removed 40% of BERT's parameters (going from 12 to 6 transformer layers), ran 60% faster, and retained 97% of GLUE benchmark performance.
Not quite. DistilBERT used 40% fewer parameters than BERT — achieved by halving the number of transformer layers from 12 to 6.
What is the key advantage of sequence-level distillation over token-level online distillation?
Correct. Sequence-level distillation pre-generates teacher outputs offline. The student trains on these stored sequences repeatedly without requiring the teacher at training time, which is substantially cheaper.
Incorrect. The main advantage of sequence-level distillation is cost: teacher text is generated once and reused many times, eliminating the need to run the teacher at every training step.
Which initialization strategy significantly accelerated DistilBERT's convergence during distillation training?
Correct. The Hugging Face team initialized DistilBERT's 6 layers from layers 1, 3, 5, 7, 9, 11 of BERT. Starting from teacher weights rather than random initialization significantly sped up convergence.
Not correct. DistilBERT was initialized from every other layer of BERT itself — a practical trick that gave the student a strong starting point and accelerated distillation training considerably.

Lab 2 — Distillation Architectures

Explore training pipeline design with your AI tutor

Your Task

Discuss the architectural choices and training pipeline decisions in model distillation. Focus on the tradeoffs between response-based, feature-based, and relation-based distillation, and the cost/quality tradeoffs of online vs. offline approaches.

Suggested opening: "I'm designing a distillation pipeline for a production NLP system. My teacher is a 70B model and I want a 7B student. Should I use online token-level distillation or offline sequence-level distillation, and why?"
Pipeline Design Tutor
Lab 2
Welcome to Lab 2. I'm here to help you reason through distillation architecture decisions — from choosing between response-based and feature-based signals to designing practical training pipelines with real cost and quality tradeoffs. What are you working on?
Module 6 · Lesson 3

Reasoning Distillation and Chain-of-Thought Transfer

Teaching small models to reason by learning from the teacher's thinking process, not just its answers.
Can a student model learn to reason, or only to answer — and does the difference matter?

In January 2025, DeepSeek released DeepSeek-R1 alongside a suite of distilled smaller models. The team first trained R1 — a large reasoning model — using reinforcement learning to produce extended chain-of-thought traces. They then distilled these reasoning traces into models as small as 1.5 billion parameters. DeepSeek-R1-Distill-Qwen-7B matched or exceeded OpenAI's o1-mini on several mathematical benchmarks. The paper demonstrated that reasoning capability could transfer across a 10× size gap through distillation of chain-of-thought data.

Why Reasoning Distillation Is Different

Standard distillation transfers factual knowledge and linguistic patterns. Reasoning distillation attempts something harder: transferring problem-solving procedures. A teacher that has learned to decompose a math problem into intermediate steps doesn't just output a correct answer — it generates a structured reasoning trace that models the solution process.

When a student is trained on these reasoning traces, it learns to imitate not just what the teacher says but how it thinks. Empirically, this appears to work remarkably well. Models distilled on chain-of-thought data generalize better to novel problems than those trained on final answers alone.

The key mechanism is that reasoning traces expose intermediate structure. A teacher might write: "First check if n is divisible by 3. It is (3+6+9=18). Then check divisibility by 7. 369÷7=52.7, not divisible. Therefore the answer is..." This trace teaches the student which sub-problems to decompose and in what order — a curriculum invisible in the final answer alone.

Rejection Sampling Distillation

DeepSeek's pipeline for small model distillation used rejection sampling: the large R1 model generated many candidate reasoning traces for each problem; only those leading to correct final answers were kept as training data. This filters out teacher "mistakes" that might otherwise confuse the student — a quality control step critical to reasoning distillation.

Chain-of-Thought as Synthetic Training Data

In 2022, Wei et al. at Google showed that chain-of-thought prompting dramatically improved large model reasoning. Ho et al. (2022) then showed that fine-tuning smaller models on chain-of-thought data generated by larger models could transfer some of this capability — a direct demonstration of reasoning distillation.

The critical finding: this approach only works reliably when the teacher model is large enough to generate correct reasoning traces. A teacher that produces plausible-but-wrong chains of thought teaches the student to reason incorrectly with confidence — a particularly dangerous failure mode that rejection sampling addresses.

What Transfers Well

Mathematical decomposition, logical step-by-step procedures, code reasoning, structured problem-solving approaches seen repeatedly in training data.

What Transfers Poorly

Novel reasoning patterns not present in teacher traces, highly abstract reasoning requiring world knowledge the student lacks, real-time adaptation to user feedback.

Speculative Decoding as Implicit Distillation

A related but distinct technique is speculative decoding, introduced by Chen et al. (Google, 2023) and independently by Leviathan et al. In this approach, a small "draft" model generates candidate token sequences, which a large verifier model accepts or rejects. The output distribution is mathematically identical to sampling from the large model, but inference is 2–3× faster.

While not distillation in the training sense, speculative decoding highlights a key architectural insight: small models are surprisingly good at predicting easy tokens, and the teacher only needs to intervene on hard ones. This implies that a well-trained distilled student captures most of the teacher's distribution even if it misses on edge cases.

Scale vs. Procedure: DeepSeek-R1 Finding

DeepSeek's January 2025 paper noted that their 7B distilled model could not be replicated by applying RL directly to a 7B base model. The reasoning capability required first developing it in a large model via RL, then transferring it to the small model via distillation. This suggests reasoning is an emergent capability that must be cultivated at scale before it can be transferred at small scale — distillation is not a substitute for the large-scale development phase.

Key Terms

Reasoning Distillation —Training a student on the teacher's chain-of-thought traces rather than final answers, transferring procedural problem-solving patterns.
Rejection Sampling —Filtering teacher-generated reasoning traces to retain only those leading to correct final answers, reducing noise in distillation training data.
Speculative Decoding —Using a small draft model to generate candidate tokens, verified by a large model. Achieves teacher-quality outputs at 2–3× the inference speed.
Chain-of-Thought Transfer —The empirical finding that fine-tuning on teacher-generated reasoning chains confers improved problem-solving generalization to smaller student models.

Lesson 3 Quiz

Reasoning Distillation and Chain-of-Thought Transfer — 3 questions
What did DeepSeek's January 2025 paper demonstrate about distilling reasoning capability into small models?
Correct. DeepSeek found that applying RL directly to a 7B model could not replicate the capability achieved by distilling from R1. Reasoning emerged at scale first, then was transferred via distillation.
Not quite. DeepSeek's key finding was that the 7B model could not develop strong reasoning through RL alone — the reasoning capability had to first emerge in the large R1 model, then be distilled down.
Why is rejection sampling important in reasoning distillation pipelines?
Correct. Without rejection sampling, a student trained on teacher mistakes learns to produce plausible-but-wrong reasoning with high confidence — a dangerous failure mode. Filtering on correct final answers ensures only valid reasoning procedures transfer.
Incorrect. Rejection sampling filters teacher traces by whether they arrive at correct final answers. This prevents the student from learning flawed reasoning patterns that look confident but produce wrong results.
What is speculative decoding, and how does it relate to the teacher-student concept?
Correct. In speculative decoding, a small (draft) model generates candidate tokens and a large (verifier) model accepts or rejects them. The output distribution is mathematically identical to sampling from the large model, but inference is 2–3× faster.
Not correct. Speculative decoding uses a small draft model to generate token candidates that a large model verifies — achieving the large model's output quality at 2–3× faster inference speed. It's an inference optimization, not a training technique.

Lab 3 — Reasoning Distillation

Explore chain-of-thought transfer and its limits with your AI tutor

Your Task

Engage with the AI tutor about reasoning distillation — why chain-of-thought traces transfer procedural knowledge, the role of rejection sampling, and what DeepSeek-R1's distilled models revealed about the limits and possibilities of transferring reasoning capability across scale.

Suggested opening: "If I want to distill mathematical reasoning into a small model, what makes chain-of-thought traces better training data than just final answers? And how do I handle cases where my teacher makes reasoning mistakes?"
Reasoning Distillation Tutor
Lab 3
Welcome to Lab 3. I'm here to explore reasoning distillation with you — from why chain-of-thought traces transfer procedural knowledge to how DeepSeek's pipeline used rejection sampling to filter teacher outputs. Ask me about any aspect of transferring reasoning capability to smaller models.
Module 6 · Lesson 4

Limits, Risks, and the Frontier of Distillation

Where distillation breaks down, what it can't transfer, and the legal and safety questions it raises.
What does a student model inevitably lose — and what risks does that loss create?

In March 2023, researchers from UC Berkeley, CMU, Stanford, and UC San Diego released Vicuna-13B, a model fine-tuned on 70,000 conversations shared by ChatGPT users on ShareGPT.com. The resulting model scored 90% of ChatGPT quality in human evaluations. OpenAI's terms of service explicitly prohibited using outputs to train competing models. The episode crystallized a legal grey area: could distillation from a proprietary teacher violate IP rights? The question remains largely unresolved in US courts as of 2025.

What Distillation Cannot Transfer

Distillation is a powerful tool, but it has fundamental limits rooted in information theory and model capacity. A student with fewer parameters cannot represent everything the teacher has learned — some knowledge will always be lost in compression.

Empirically, distilled models tend to lose capability at the tails of the distribution: rare knowledge, unusual edge cases, and tasks requiring multi-step reasoning across many domains simultaneously. The common-case performance often approaches the teacher's; the rare-case performance degrades significantly.

Calibration is another casualty. DistilBERT and similar models are generally less well-calibrated than their teachers — they may express high confidence on answers where the teacher would have been appropriately uncertain. This matters greatly in high-stakes deployment.

The Capacity Gap Problem

If the teacher has encoded a capability that requires more parameters to represent than the student has, no amount of distillation training will transfer it. The student simply doesn't have the representational capacity. This is why some capabilities appear only above certain model size thresholds — they may be fundamentally non-distillable below that threshold.

Safety Alignment and Distillation

One of the most concerning findings in the 2023 wave of distillation research was that safety alignment does not transfer robustly through behavioral distillation. A teacher model that has been carefully RLHF-trained to refuse harmful requests may generate refusals — but a student trained on the teacher's general outputs will not necessarily learn why to refuse, only the surface-level pattern.

Multiple papers in 2023–2024 showed that models like Alpaca, Vicuna, and similar distilled models could be prompted to produce harmful content far more easily than their GPT-3.5 or GPT-4 teachers. The teacher's safety mechanisms had been learned through a separate, careful alignment process — one that behavioral distillation entirely bypasses.

This creates a significant risk in the distillation ecosystem: a practitioner who distills from a safety-aligned frontier model may unknowingly produce a student that retains the teacher's capabilities while shedding its safety properties.

Policy Response — Meta's LLaMA License Restriction

After Alpaca's release, Meta revised the LLaMA license to prohibit use of LLaMA model outputs to train other language models. This was a direct response to the distillation ecosystem: Meta did not want proprietary models distilled into derivatives that would compete with their own commercial products. The restriction highlighted that distillation from open-weight models raised different but equally complex questions compared to distillation from API-accessed proprietary models.

Legal and Ethical Landscape

The intellectual property questions around distillation remain genuinely unresolved. Three distinct legal theories have been proposed: (1) model outputs may be copyrightable, making distillation training on them infringement; (2) model outputs may not be protectable expression, making distillation fair use or no-use-at-all; (3) the ToS-violation theory, under which distillation from API outputs breaches contract regardless of copyright status.

In 2024, several major AI companies updated their terms of service to explicitly prohibit using API outputs for training competing models. The enforceability of these terms across jurisdictions, and whether purely behavioral distillation constitutes "use of outputs," remains litigated in theory if not yet in significant court decisions.

Current Frontier: Self-Distillation

A newer direction — self-distillation — has the model serve as its own teacher and student simultaneously. In techniques like Speculative Decoding with Draft Heads (Cai et al., 2024) and Medusa, lightweight prediction heads are trained on top of a frozen large model to predict multiple tokens ahead. The large model's hidden states serve as both training signal and inference context.

More broadly, self-improvement techniques discussed in earlier modules — where a model generates synthetic training data and fine-tunes on it — can be viewed as self-distillation: the model's current reasoning capability teaches the next iteration of itself. The boundary between distillation, self-improvement, and synthetic data generation has become productively blurry.

Key Terms

Capacity Gap —The fundamental limit on what distillation can transfer: capabilities requiring more representational capacity than the student possesses cannot be transferred regardless of training signal quality.
Alignment Tax (Distillation) —The empirical finding that behavioral distillation transfers capabilities but not safety alignment properties, producing students that are capable but less robustly safe than their teachers.
Self-Distillation —A training paradigm in which a model serves as its own teacher, using its current outputs or hidden states as training signal for updating or extending itself.
Calibration Loss —The tendency for distilled models to have worse confidence calibration than their teachers — expressing high certainty in cases where the teacher would have been appropriately uncertain.

Lesson 4 Quiz

Limits, Risks, and the Frontier of Distillation — 3 questions
Why did Meta revise the LLaMA license to restrict use of model outputs for training other language models?
Correct. Meta's license revision was a direct response to projects like Stanford's Alpaca, which used LLaMA as a base and distilled from GPT-3.5-turbo outputs to create a highly capable competitive model at minimal cost.
Not correct. Meta's license change was specifically aimed at preventing distillation-based derivative models like Alpaca from using LLaMA outputs to train competing systems — a direct commercial concern.
What is the "alignment tax" in the context of behavioral distillation from safety-aligned teachers?
Correct. Behavioral distillation transfers the teacher's capabilities but bypasses the separate alignment process. Multiple 2023 papers showed models like Alpaca and Vicuna were substantially easier to jailbreak than GPT-3.5/GPT-4 despite matching them on capability benchmarks.
Incorrect. In distillation, the "alignment tax" refers to the finding that safety properties don't transfer through behavioral cloning — students become capable but lose the safety constraints their teachers had developed through separate RLHF training.
What type of model performance tends to degrade most in distilled students compared to their teachers?
Correct. Distilled models tend to approach teacher performance on common cases while losing capability at the tails — rare knowledge, unusual edge cases, and complex multi-domain reasoning that requires the teacher's full representational capacity.
Not quite. Distillation tends to preserve common-case performance reasonably well. The degradation is concentrated in rare, unusual, or highly complex cases that require the full representational capacity of the larger teacher model.

Lab 4 — Distillation Risks and Frontiers

Probe the limits and open questions of model distillation with your AI tutor

Your Task

Engage with the AI tutor about the risks, limits, and emerging frontiers of model distillation — including alignment transfer failure, the legal landscape, self-distillation techniques, and what the capacity gap means for practitioners building production systems.

Suggested opening: "I'm considering distilling from GPT-4 to build a cheaper production model. What are the main risks I should be aware of — technical, safety-related, and legal — and how should I think about mitigating them?"
Distillation Risk Tutor
Lab 4
Welcome to Lab 4. I'm here to help you think through the risks, limitations, and open questions in model distillation — from the alignment tax and capacity gaps to the evolving legal landscape around distilling from proprietary teachers. What aspect would you like to explore?

Module 6 Test

Model Distillation — 15 questions · 80% to pass
1. Who formalized the knowledge distillation framework with "soft targets" in 2015?
Correct. Hinton, Vinyals, and Dean introduced the soft-target distillation framework in "Distilling the Knowledge in a Neural Network" (2015).
Incorrect. The 2015 distillation paper was authored by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean at Google.
2. What role does temperature scaling play in classical distillation?
Correct. Applying T>1 to the softmax spreads probability mass more evenly, making the structure of the teacher's uncertainty visible to the student.
Incorrect. Temperature T>1 softens the teacher's distribution, amplifying low-probability signals so the student can learn inter-class similarities.
3. Stanford's Alpaca demonstrated behavioral distillation by generating how many instruction-following examples from GPT-3.5-turbo?
Correct. Alpaca used a self-instruct pipeline to generate 52,000 instruction-following pairs from GPT-3.5-turbo, at a total cost of roughly $600.
Not correct. Alpaca generated 52,000 instruction-response pairs from GPT-3.5-turbo using the self-instruct methodology.
4. Feature-based distillation differs from response-based distillation because it:
Correct. Feature-based distillation uses intermediate representations — hidden states, attention patterns — as training signal, not just the teacher's output layer.
Incorrect. Feature-based distillation matches the teacher's internal representations (hidden states, activations), whereas response-based distillation only uses the teacher's final output.
5. DistilBERT used which three loss signals simultaneously during training?
Correct. DistilBERT combined soft-target cross-entropy (teacher distribution), hard-target cross-entropy (ground truth), and cosine embedding loss aligning teacher and student hidden states.
Not correct. DistilBERT's three combined signals were: soft cross-entropy (against teacher logits), hard cross-entropy (against labels), and cosine embedding loss (on hidden states).
6. What is the primary cost advantage of offline sequence-level distillation over online token-level distillation?
Correct. In offline distillation, the teacher runs once to generate text corpora; the student trains on this stored data repeatedly without ever needing the teacher again — a major cost reduction.
Incorrect. The cost advantage is that offline distillation generates teacher outputs once and reuses them, eliminating teacher inference costs during the (often long) student training phase.
7. DeepSeek-R1-Distill-Qwen-7B was notable because it:
Correct. DeepSeek-R1-Distill-Qwen-7B matched or exceeded o1-mini on math benchmarks, demonstrating that reasoning capability could transfer across a 10× size gap through chain-of-thought distillation.
Incorrect. The notable result was that this 7B distilled model matched o1-mini on several mathematical benchmarks — showing reasoning capability can transfer across a large size gap via distillation.
8. Why does reasoning distillation require rejection sampling of teacher outputs?
Correct. Without filtering, a student trained on incorrect teacher reasoning chains learns to produce confident-sounding but wrong reasoning — a dangerous failure mode. Rejection sampling keeps only traces that arrive at correct answers.
Not correct. Rejection sampling filters teacher-generated reasoning traces by whether they reach correct final answers. This prevents the student from internalizing flawed reasoning procedures that look plausible but are wrong.
9. Speculative decoding uses a small draft model and a large verifier model to achieve what outcome?
Correct. Speculative decoding is mathematically equivalent to sampling from the large model, but the draft model handles easy tokens in parallel, achieving 2–3× speedup with identical output quality.
Incorrect. Speculative decoding's key property is that output distributions are identical to the large model while inference is 2–3× faster, because the small draft model handles most easy tokens without requiring the large model.
10. The Vicuna-13B model was released in March 2023 using fine-tuning data from which source?
Correct. Vicuna was fine-tuned on 70,000 conversations that users had with ChatGPT and shared on ShareGPT.com — raising OpenAI ToS concerns about using API outputs for training competitive models.
Not correct. Vicuna-13B used 70,000 real ChatGPT conversations scraped from ShareGPT.com, where users had publicly shared their ChatGPT dialogue histories.
11. What is the "capacity gap" in distillation and why does it set a fundamental ceiling on distillation quality?
Correct. If a capability is encoded in the teacher in a way that requires more parameters to represent than the student has, no distillation signal can transfer it — the student simply lacks the capacity to represent that knowledge.
Incorrect. The capacity gap is a fundamental information-theoretic limit: some teacher knowledge requires more representational capacity than the student model possesses, making transfer impossible regardless of training method quality.
12. Which type of model performance does distillation tend to degrade most significantly?
Correct. Common-case performance is largely preserved through distillation. The tail distribution — rare knowledge, unusual inputs, calibrated uncertainty — degrades most as the student lacks capacity to represent the teacher's full distribution.
Not quite. Distillation tends to preserve common-case accuracy while degrading performance at the tails — unusual inputs, rare knowledge, and well-calibrated uncertainty on edge cases.
13. What does the "alignment tax" in behavioral distillation refer to?
Correct. Behavioral cloning transfers surface behavior but bypasses the underlying RLHF alignment process. Models like Alpaca and Vicuna could be prompted into harmful outputs far more easily than their GPT-3.5/GPT-4 teachers.
Incorrect. The alignment tax in distillation describes how students inherit capabilities without the safety constraints — because behavioral distillation never exposes the student to the teacher's alignment training process, only its outputs.
14. Self-distillation techniques like Medusa and Speculative Decoding with Draft Heads work by:
Correct. Medusa and related approaches add lightweight multi-token prediction heads to an existing frozen large model, training them using the model's own internal representations — a form of self-distillation that improves inference speed.
Incorrect. These self-distillation techniques add lightweight heads on top of a frozen large model, trained using the model's own hidden states. No separate small model is trained from scratch.
15. According to DeepSeek's R1 paper, why couldn't the 7B distilled model's reasoning be replicated by applying RL directly to a 7B base model?
Correct. DeepSeek found that reasoning capability is emergent at scale — it must be cultivated in the large model through RL before distillation can transfer it. RL on the small model alone could not produce equivalent reasoning.
Incorrect. DeepSeek's key finding was that strong reasoning is emergent at scale: it had to develop first in the large R1 model via RL, then be transferred to smaller models via distillation. Direct RL on 7B could not replicate this path.