When OpenAI published its paper on InstructGPT in early 2022, it revealed something counterintuitive: a model with 1.3 billion parameters that had been fine-tuned on human feedback outperformed the raw 175-billion-parameter GPT-3 on human preference ratings. The difference wasn't intelligence — it was alignment. The fine-tuning process hadn't added knowledge. It had reshaped how the model used the knowledge it already had.
Every modern LLM passes through at least two distinct training phases. The first is pre-training: the model ingests hundreds of billions of tokens from the web, books, and code, learning statistical patterns in language at an enormous scale. After pre-training, the model is extraordinarily capable — it can complete sentences, write poetry, solve math problems — but it has no consistent persona, no reliable instruction-following, and no particular preference for being helpful over harmful. It simply predicts what text comes next.
The second phase is fine-tuning: a targeted, comparatively tiny training run on a curated dataset designed to shift the model's behavior toward a specific goal. Where pre-training might consume 300 billion tokens over weeks on thousands of GPUs, fine-tuning might use 100,000 examples over a few hours. The weight updates are smaller, more surgical, aimed at particular behavioral traits rather than broad language competence.
Fine-tuning modifies the model's weights — the billions of floating-point numbers that encode everything the model knows. However, the changes are not uniformly distributed. Research into fine-tuning dynamics shows that the updates concentrate in layers responsible for task-specific behavior, particularly the later transformer layers and the attention heads that control output style and format. The deep factual knowledge encoded in early and middle layers is largely preserved.
Think of it this way: pre-training fills the model's "memory" with the entire contents of a vast library. Fine-tuning doesn't replace those books — it trains the librarian to respond to requests in a particular style, to prioritize certain sections, and to refuse to discuss certain topics. The books remain unchanged.
This is why fine-tuned models can sometimes be "jail-broken" by sufficiently crafty prompts: the underlying knowledge from pre-training is still there. Fine-tuning adds a behavioral layer on top of the base representation, but it doesn't surgically remove dangerous knowledge — it trains the model to decline to deploy it.
OpenAI's InstructGPT paper showed that fine-tuning GPT-3 on approximately 13,000 human-labeled instruction-response pairs, followed by reinforcement learning from human feedback (RLHF), produced a model that human raters preferred 85% of the time over the raw 175B parameter GPT-3. The fine-tuned model had identical pre-training weights as its starting point — only the behavioral layer changed.
The most common form of fine-tuning used in production is supervised fine-tuning (SFT). A dataset of (prompt, ideal response) pairs is assembled — typically through human labelers writing or rating responses. The model is then trained to maximize the probability of producing those ideal responses given those prompts. It is standard supervised learning, applied to a pre-trained base.
The size of the SFT dataset matters less than its quality. Meta's LLaMA 2 paper, published in 2023, noted that their instruction-tuned models were trained on fewer than 30,000 SFT examples — but each example was carefully reviewed. Quality control, they found, mattered more than scale: one well-chosen example could be worth thousands of mediocre ones.
SFT alone, however, does not guarantee the model will follow nuanced human preferences. It teaches the model to imitate the labeled outputs, but those outputs may not capture the full range of what a human would find helpful, harmless, and honest. This is where RLHF enters — but that's Lesson 3. First, we need to understand what prompting can accomplish before we reach for fine-tuning at all.
Fine-tuning changes behavior without replacing knowledge. A fine-tuned model and its base model share the same foundational representations — the difference is in how those representations are accessed and expressed. This distinction is essential for understanding when fine-tuning is necessary and when a well-designed prompt is sufficient.
You're talking with an AI assistant that specializes in fine-tuning mechanics. Use this lab to deepen your understanding of what actually happens when a model is fine-tuned. Ask about weight updates, SFT datasets, the difference between pre-training and fine-tuning, or why fine-tuned models can still be "jailbroken."
In early 2022, researchers at Google Brain published a paper that stunned the NLP community: by simply adding the phrase "Let's think step by step" to the end of a prompt, they could dramatically improve a model's performance on multi-step reasoning tasks — without any fine-tuning whatsoever. The phenomenon, called chain-of-thought prompting, revealed that enormous capability was already latent in large models, waiting to be unlocked by the right sequence of tokens.
Prompting is the process of crafting the input context given to a model to steer its output without changing any weights. Because transformer models are conditioned on their entire context window, the prompt functions as an implicit specification of task, style, format, persona, and constraints. A skilled prompt engineer can achieve results that look — and often are — equivalent to fine-tuning for many tasks.
The key mechanisms through which prompting influences behavior are in-context learning (the model generalizes from examples within the prompt), instruction following (the model has learned to interpret imperative statements as directives), and role conditioning (prefacing a prompt with "You are an expert in X" genuinely shifts the probability distribution over outputs). None of these require weight updates.
Chain-of-Thought (Wei et al., Google Brain, 2022): Adding intermediate reasoning steps to prompts improved GPT-3's performance on the GSM8K math benchmark from roughly 17% to 58% — without any fine-tuning. The same technique applied to PaLM 540B achieved near-human performance on several reasoning benchmarks.
Automatic Prompt Engineer (Zhou et al., 2022): Researchers at Toronto and Google showed that using an LLM to generate and score its own prompts — essentially automating prompt engineering — could discover prompts that outperformed human-written ones on several benchmarks, suggesting the prompt space is far richer than manual exploration reveals.
System-Prompt Persona Conditioning: Anthropic's public documentation on Claude describes how the system prompt is used to condition the model's persona, constraints, and knowledge domain. A single paragraph of system-prompt text can transform Claude into a specialized legal assistant, a creative writing collaborator, or a terse data analyst — behaviors that would previously have required fine-tuned variants.
Prompting's power, however, is bounded by what the model has already learned. If the base model has never encountered a domain — say, a highly specialized medical coding system introduced after its training cutoff — no prompt can conjure accurate knowledge from nothing. Similarly, if the base model has strong priors toward a certain output format or reasoning style, prompting can shift but not fully override those priors.
There are also consistency limits. A prompted persona is re-initialized every context window. A fine-tuned model carries its behavioral changes into every interaction by default; a prompted model requires the system prompt to be resent each time. For production systems handling millions of requests, a 500-token system prompt has a real compute cost. Fine-tuning that cost into the weights can be economically significant at scale.
Finally, prompting is inherently fragile against adversarial inputs. A user who knows the system prompt structure can craft messages designed to override or ignore it. Fine-tuning behavioral constraints into the weights provides a deeper, more robust layer of alignment — though not an impenetrable one, as jailbreaking research consistently demonstrates.
| Dimension | Prompting | Fine-Tuning |
|---|---|---|
| Speed to deploy | Minutes | Hours to days |
| Cost | Inference tokens only | Training compute + storage |
| Knowledge injection | Limited to context window | Can add domain knowledge to weights |
| Behavioral consistency | Per-request (requires system prompt) | Persistent across all interactions |
| Adversarial robustness | Fragile | More robust (but not impenetrable) |
| Reversibility | Instant — just change the prompt | Requires retraining to undo |
| Interpretability | High — human-readable instructions | Low — encoded in weight deltas |
Google's internal guidance, referenced in their 2023 paper on prompt engineering best practices, articulates a "prompt-first" principle: before investing in fine-tuning, exhaust what prompting can accomplish. Fine-tuning should be reserved for cases where prompting demonstrably fails or where the economics of inference tokens at scale make weight-encoded behavior cheaper.
This lab is a prompt engineering sandbox. The AI assistant will help you explore what prompting can and cannot accomplish — including chain-of-thought techniques, few-shot examples, role conditioning, and the boundaries where prompting fails. Try pushing the limits.
By mid-2021, it was clear to researchers at OpenAI that GPT-3, despite its raw capability, exhibited a fundamental problem: it was optimized to predict human text, not to be helpful to humans. When asked to write an essay arguing for a dangerous position, it would comply — because such essays exist in its training data. The model had learned to imitate humanity's writing; it had not learned to care about being useful, truthful, or safe. Supervised fine-tuning on good examples helped, but it couldn't capture the full richness of what humans actually preferred — because preference is comparative, contextual, and difficult to express as a single "ideal" output.
The core limitation of SFT is that it requires labelers to produce the ideal output — but for many tasks, it's far easier for a human to judge which of two outputs is better than to write the ideal output from scratch. A medical professional can reliably say "Response A is more accurate and cautious than Response B" without being able to dictate the perfect response themselves. SFT squanders this comparative judgment signal.
There is also a distribution problem. SFT trains the model on a fixed set of examples, but the distribution of prompts it will actually receive in deployment may differ substantially from the training distribution. RLHF, by using a reward model that generalizes human preferences, can guide the model toward good behavior on unseen prompts as well.
Phase 1 — Supervised Fine-Tuning: As described in Lesson 1, an SFT model is trained on a set of human-written (prompt, response) pairs. This gives the model a reasonable starting behavioral baseline.
Phase 2 — Reward Model Training: Human labelers are shown multiple model responses to the same prompt and asked to rank them by quality. These preference rankings are used to train a separate model — the reward model — that learns to predict how much a human would prefer any given response. The reward model is essentially a compressed representation of human judgment.
Phase 3 — Reinforcement Learning Optimization: The SFT model is then treated as a policy and updated using PPO (Proximal Policy Optimization), a reinforcement learning algorithm. For each prompt, the model generates a response; the reward model scores it; the RL algorithm updates the model's weights to increase the probability of responses the reward model predicts humans will prefer. A KL divergence penalty is added to prevent the model from drifting too far from the original SFT model (which would cause "reward hacking" — finding responses that fool the reward model without being genuinely good).
Anthropic proposed a variant called Constitutional AI (CAI), published in December 2022. Rather than relying solely on human labelers for preference rankings, CAI uses the model itself to critique its own responses according to a written "constitution" of principles (e.g., "Be helpful, harmless, and honest"). The model generates responses, critiques them, revises them, and these revised responses become training data — a process called RLAIF (Reinforcement Learning from AI Feedback). This approach reduced the need for human labeling at scale while improving safety behavior, and was used in training Claude.
RLHF is not free. Research from OpenAI's InstructGPT paper and subsequent work has documented an alignment tax: RLHF-trained models sometimes perform slightly worse on certain raw capability benchmarks compared to the base SFT model, even while performing better on human preference ratings. The model is optimized to produce responses humans like — and humans don't always prefer the most technically accurate answer; they often prefer fluent, confident-sounding responses.
This tension — between optimizing for what humans prefer and optimizing for what is actually correct — is one of the central unsolved problems in AI alignment. It surfaces in documented cases: GPT-4 and Claude sometimes produce confidently stated incorrect information that "sounds right" to human raters, a failure mode amplified by the very RLHF process designed to make them better. The Goodhart's Law problem — when the proxy measure becomes the target, it ceases to be a good proxy — applies directly.
RLHF works because human preferences are easier to elicit as comparisons than as specifications. It's harder to write a perfect answer than to recognize one. The reward model learns to encode that recognition ability, and RL uses it to steer the policy. The result is a model that acts as if it understands what humans want — whether or not it actually does.
Explore RLHF mechanics in depth: the reward model training process, how human preference rankings become a training signal, the alignment tax, reward hacking, and how Constitutional AI differs from standard RLHF. Challenge the AI with hard questions about the tradeoffs.
In 2023, as fine-tuning APIs became widely available, a pattern emerged among AI practitioners: teams would spend weeks and thousands of dollars fine-tuning a model on domain-specific data, only to discover that a carefully engineered prompt on the base model performed equivalently or better. The reverse also happened — teams would burn through token budgets constructing elaborate system prompts, not realizing their problem was a knowledge gap that only fine-tuning could close. The industry needed a decision framework. Several emerged; they largely converged.
Practitioners at Google, Anthropic, and OpenAI have articulated variants of the same core decision logic. The questions to ask, in order:
1. Is the required behavior achievable through prompting alone? Before any fine-tuning consideration, attempt the task with best-effort prompt engineering — few-shot examples, chain-of-thought, role conditioning, explicit format instructions. If prompting achieves acceptable performance, stop here.
2. Is the gap a style/format problem or a knowledge problem? Style problems (tone, verbosity, format conventions, persona consistency) are almost always better solved with fine-tuning than prompting, because they require persistent behavioral modification. Knowledge problems (facts the model doesn't know) require fine-tuning or retrieval-augmented generation (RAG) — prompting cannot inject knowledge the model was never trained on.
3. What is the inference volume, and what are the economics? At low volume (thousands of requests/day), a long system prompt is economically fine. At high volume (millions of requests/day), token overhead from system prompts becomes significant. Fine-tuning that behavior into the weights can be cheaper over the model's lifetime.
4. How frequently will the required behavior change? Prompting is instantly reversible — update the text and the behavior changes. Fine-tuning requires retraining, quality evaluation, and deployment. If the task definition changes weekly, fine-tuning is expensive to maintain. If the behavior is stable, fine-tuning amortizes well.
A practical objection to fine-tuning has always been cost: updating all of a 70-billion-parameter model's weights is computationally expensive and requires storing a full separate copy of the model for each task. LoRA (Low-Rank Adaptation), introduced by Hu et al. at Microsoft in 2021, elegantly sidesteps this. Instead of updating all weights, LoRA adds small low-rank matrices alongside the original weight matrices and trains only those — representing the fine-tuning "delta" as the product of two small matrices.
The result: a 70B model can be fine-tuned for a specific task by training only tens of millions of parameters instead of 70 billion, reducing compute and storage requirements by orders of magnitude. The LoRA adapter (a few hundred megabytes) can be swapped in at inference time, while the base model weights remain frozen on GPU. Multiple LoRA adapters can be maintained for different tasks, each loadable on demand.
LoRA has become the dominant fine-tuning method for open-source models. Meta's LLaMA fine-tuning ecosystem, Hugging Face's PEFT library, and essentially every practical fine-tuning tutorial released since 2022 uses LoRA or a variant. It lowered the cost of fine-tuning from "requires a cloud GPU cluster" to "achievable on a single consumer GPU for smaller models."
When OpenAI opened fine-tuning access for GPT-3.5-turbo in August 2023, they published guidance noting that most customers who attempted fine-tuning discovered they could achieve equivalent results with improved prompt engineering. They recommended the workflow: try prompt engineering first, then few-shot examples, then fine-tuning only when those fail. Their internal benchmark showed fine-tuning improved performance meaningfully only when the task required either style consistency or domain knowledge not in the base model.
In practice, the most capable production deployments combine both approaches. A model is fine-tuned for broad behavioral alignment (via RLHF) and domain specialization (via SFT or LoRA), then further conditioned at inference time with a system prompt that specifies task-specific behavior, user context, and constraints. The fine-tuning provides the stable behavioral foundation; the prompt provides the dynamic, per-deployment specialization. Neither approach alone captures everything the combination achieves.
Understanding where fine-tuning ends and prompting begins — and why both exist — gives practitioners the conceptual vocabulary to make better architectural decisions, evaluate vendor claims more critically, and understand why the same underlying model can behave so differently across deployments.
Fine-tuning reshapes a model's behavioral layer without replacing its knowledge base. Prompting activates latent capabilities without touching weights. RLHF encodes human preferences through a learned reward signal. LoRA makes fine-tuning economically tractable. RAG solves knowledge gaps without retraining. The practitioner's art is knowing which tool the problem actually calls for — and having the evidence to defend that choice.
Bring your own use cases to this lab. Describe a task or product scenario, and the AI will help you work through the decision framework: should you fine-tune, prompt engineer, use RAG, or combine approaches? The advisor will ask clarifying questions and explain the reasoning behind each recommendation.