Module 6 · Lesson 1

What Fine-Tuning Actually Does to a Model

Pre-training gives a model the world. Fine-tuning narrows its gaze.

When OpenAI turned GPT-3 into InstructGPT, what exactly changed — and what stayed the same?

When OpenAI published its paper on InstructGPT in early 2022, it revealed something counterintuitive: a model with 1.3 billion parameters that had been fine-tuned on human feedback outperformed the raw 175-billion-parameter GPT-3 on human preference ratings. The difference wasn't intelligence — it was alignment. The fine-tuning process hadn't added knowledge. It had reshaped how the model used the knowledge it already had.

The Two Phases of a Model's Life

Every modern LLM passes through at least two distinct training phases. The first is pre-training: the model ingests hundreds of billions of tokens from the web, books, and code, learning statistical patterns in language at an enormous scale. After pre-training, the model is extraordinarily capable — it can complete sentences, write poetry, solve math problems — but it has no consistent persona, no reliable instruction-following, and no particular preference for being helpful over harmful. It simply predicts what text comes next.

The second phase is fine-tuning: a targeted, comparatively tiny training run on a curated dataset designed to shift the model's behavior toward a specific goal. Where pre-training might consume 300 billion tokens over weeks on thousands of GPUs, fine-tuning might use 100,000 examples over a few hours. The weight updates are smaller, more surgical, aimed at particular behavioral traits rather than broad language competence.

Pre-training The large-scale, self-supervised training phase where the model learns to predict tokens from massive text corpora. Produces broad world knowledge and language fluency.

Fine-tuning A secondary, supervised training phase on a smaller curated dataset, adjusting the model's weights to produce outputs aligned with a specific style, domain, or behavior.

What Fine-Tuning Changes

Fine-tuning modifies the model's weights — the billions of floating-point numbers that encode everything the model knows. However, the changes are not uniformly distributed. Research into fine-tuning dynamics shows that the updates concentrate in layers responsible for task-specific behavior, particularly the later transformer layers and the attention heads that control output style and format. The deep factual knowledge encoded in early and middle layers is largely preserved.

Think of it this way: pre-training fills the model's "memory" with the entire contents of a vast library. Fine-tuning doesn't replace those books — it trains the librarian to respond to requests in a particular style, to prioritize certain sections, and to refuse to discuss certain topics. The books remain unchanged.

This is why fine-tuned models can sometimes be "jail-broken" by sufficiently crafty prompts: the underlying knowledge from pre-training is still there. Fine-tuning adds a behavioral layer on top of the base representation, but it doesn't surgically remove dangerous knowledge — it trains the model to decline to deploy it.

Real Case — InstructGPT (OpenAI, 2022)

OpenAI's InstructGPT paper showed that fine-tuning GPT-3 on approximately 13,000 human-labeled instruction-response pairs, followed by reinforcement learning from human feedback (RLHF), produced a model that human raters preferred 85% of the time over the raw 175B parameter GPT-3. The fine-tuned model had identical pre-training weights as its starting point — only the behavioral layer changed.

Supervised Fine-Tuning (SFT): The First Step

The most common form of fine-tuning used in production is supervised fine-tuning (SFT). A dataset of (prompt, ideal response) pairs is assembled — typically through human labelers writing or rating responses. The model is then trained to maximize the probability of producing those ideal responses given those prompts. It is standard supervised learning, applied to a pre-trained base.

The size of the SFT dataset matters less than its quality. Meta's LLaMA 2 paper, published in 2023, noted that their instruction-tuned models were trained on fewer than 30,000 SFT examples — but each example was carefully reviewed. Quality control, they found, mattered more than scale: one well-chosen example could be worth thousands of mediocre ones.

SFT alone, however, does not guarantee the model will follow nuanced human preferences. It teaches the model to imitate the labeled outputs, but those outputs may not capture the full range of what a human would find helpful, harmless, and honest. This is where RLHF enters — but that's Lesson 3. First, we need to understand what prompting can accomplish before we reach for fine-tuning at all.

Key Insight

Fine-tuning changes behavior without replacing knowledge. A fine-tuned model and its base model share the same foundational representations — the difference is in how those representations are accessed and expressed. This distinction is essential for understanding when fine-tuning is necessary and when a well-designed prompt is sufficient.

Lesson 1 Quiz

What Fine-Tuning Actually Does to a Model

What does fine-tuning modify in a language model?

Correct. Fine-tuning performs additional gradient descent steps on the existing weights, nudging them toward behaviors represented in the fine-tuning dataset. The architecture and pre-trained knowledge base remain intact.

Not quite. Fine-tuning does not alter the pre-training data itself — that training is complete. It adjusts the weights derived from pre-training through a new, smaller training run.

In the InstructGPT experiment, why did a 1.3B fine-tuned model outperform the raw 175B GPT-3?

Correct. The 1.3B InstructGPT model was evaluated on human preference ratings, not raw capability benchmarks. Fine-tuning reshaped how the model responded — making it more helpful and consistent — even though it had far fewer parameters.

Not quite. The key was behavioral alignment, not dataset recency or model size advantage. Human raters preferred the fine-tuned model's responses because fine-tuning made the model follow instructions reliably and helpfully.

Which of the following best describes Supervised Fine-Tuning (SFT)?

Correct. SFT is standard supervised learning applied to a pre-trained base. Human labelers create or curate (prompt, response) pairs, and the model is trained to maximize the probability of those ideal responses.

Not quite. The option describing reward models and policy gradients describes RLHF, which is a subsequent step. SFT is the simpler first phase: show the model good examples and have it imitate them.

Lab 1 — The Fine-Tuning Mechanics Explorer

Conversation lab · Complete 3 exchanges to finish

What you're doing

You're talking with an AI assistant that specializes in fine-tuning mechanics. Use this lab to deepen your understanding of what actually happens when a model is fine-tuned. Ask about weight updates, SFT datasets, the difference between pre-training and fine-tuning, or why fine-tuned models can still be "jailbroken."

Suggested starter: "If fine-tuning only changes some weights, how does the model know which weights to change?"

Fine-Tuning Mechanics

LLM Course · M6 L1

Welcome to the fine-tuning mechanics lab. I can help you explore what actually changes in a model's weights during fine-tuning, how supervised fine-tuning datasets are constructed, and why the distinction between pre-training knowledge and fine-tuned behavior matters. What would you like to explore?

Module 6 · Lesson 2

The Power and Limits of Prompting

A well-engineered prompt can unlock remarkable behavior — but it cannot rewrite the model's soul.

GPT-4 and Claude were prompted to perform tasks their creators never explicitly trained — how far does that reach extend?

In early 2022, researchers at Google Brain published a paper that stunned the NLP community: by simply adding the phrase "Let's think step by step" to the end of a prompt, they could dramatically improve a model's performance on multi-step reasoning tasks — without any fine-tuning whatsoever. The phenomenon, called chain-of-thought prompting, revealed that enormous capability was already latent in large models, waiting to be unlocked by the right sequence of tokens.

What Prompting Can Do

Prompting is the process of crafting the input context given to a model to steer its output without changing any weights. Because transformer models are conditioned on their entire context window, the prompt functions as an implicit specification of task, style, format, persona, and constraints. A skilled prompt engineer can achieve results that look — and often are — equivalent to fine-tuning for many tasks.

The key mechanisms through which prompting influences behavior are in-context learning (the model generalizes from examples within the prompt), instruction following (the model has learned to interpret imperative statements as directives), and role conditioning (prefacing a prompt with "You are an expert in X" genuinely shifts the probability distribution over outputs). None of these require weight updates.

In-Context Learning The model's ability to adapt its behavior based on examples or instructions provided within the prompt, without any gradient updates to its weights.

Few-Shot Prompting Providing 2–10 input/output examples in the prompt to demonstrate the desired format or behavior before presenting the actual query.

Documented Prompt Engineering Wins

Chain-of-Thought (Wei et al., Google Brain, 2022): Adding intermediate reasoning steps to prompts improved GPT-3's performance on the GSM8K math benchmark from roughly 17% to 58% — without any fine-tuning. The same technique applied to PaLM 540B achieved near-human performance on several reasoning benchmarks.

Automatic Prompt Engineer (Zhou et al., 2022): Researchers at Toronto and Google showed that using an LLM to generate and score its own prompts — essentially automating prompt engineering — could discover prompts that outperformed human-written ones on several benchmarks, suggesting the prompt space is far richer than manual exploration reveals.

System-Prompt Persona Conditioning: Anthropic's public documentation on Claude describes how the system prompt is used to condition the model's persona, constraints, and knowledge domain. A single paragraph of system-prompt text can transform Claude into a specialized legal assistant, a creative writing collaborator, or a terse data analyst — behaviors that would previously have required fine-tuned variants.

The Hard Limits of Prompting

Prompting's power, however, is bounded by what the model has already learned. If the base model has never encountered a domain — say, a highly specialized medical coding system introduced after its training cutoff — no prompt can conjure accurate knowledge from nothing. Similarly, if the base model has strong priors toward a certain output format or reasoning style, prompting can shift but not fully override those priors.

There are also consistency limits. A prompted persona is re-initialized every context window. A fine-tuned model carries its behavioral changes into every interaction by default; a prompted model requires the system prompt to be resent each time. For production systems handling millions of requests, a 500-token system prompt has a real compute cost. Fine-tuning that cost into the weights can be economically significant at scale.

Finally, prompting is inherently fragile against adversarial inputs. A user who knows the system prompt structure can craft messages designed to override or ignore it. Fine-tuning behavioral constraints into the weights provides a deeper, more robust layer of alignment — though not an impenetrable one, as jailbreaking research consistently demonstrates.

Dimension	Prompting	Fine-Tuning
Speed to deploy	Minutes	Hours to days
Cost	Inference tokens only	Training compute + storage
Knowledge injection	Limited to context window	Can add domain knowledge to weights
Behavioral consistency	Per-request (requires system prompt)	Persistent across all interactions
Adversarial robustness	Fragile	More robust (but not impenetrable)
Reversibility	Instant — just change the prompt	Requires retraining to undo
Interpretability	High — human-readable instructions	Low — encoded in weight deltas

The Prompt-First Principle

Google's internal guidance, referenced in their 2023 paper on prompt engineering best practices, articulates a "prompt-first" principle: before investing in fine-tuning, exhaust what prompting can accomplish. Fine-tuning should be reserved for cases where prompting demonstrably fails or where the economics of inference tokens at scale make weight-encoded behavior cheaper.

Lesson 2 Quiz

The Power and Limits of Prompting

What did Wei et al.'s 2022 chain-of-thought study demonstrate about prompting?

Correct. The chain-of-thought paper showed that simply including intermediate reasoning steps in the prompt — with no weight changes — could raise GPT-3's GSM8K math score from ~17% to ~58%. Capability was already latent; the prompt unlocked it.

Not quite. This is actually the opposite of the finding. Chain-of-thought prompting worked on standard pre-trained models — no fine-tuning required — revealing that behavioral improvements can sometimes come entirely from prompt design.

Which scenario represents a hard limit of prompting that fine-tuning can address?

Correct. Prompting can only activate knowledge already in the model's weights. If the model has never been exposed to a piece of information — a proprietary codebase, a post-training-cutoff product — no prompt can manufacture accurate knowledge. Fine-tuning can encode it.

Not quite. Writing style, step-by-step reasoning, and persona conditioning are all achievable through prompting alone. The genuine limit of prompting is knowledge injection — you cannot prompt a model into knowing facts it was never trained on.

At production scale, what is one economic argument for fine-tuning over long system prompts?

Correct. At millions of requests per day, a 500-token system prompt adds significant token processing cost. If that behavioral context can instead be baked into the weights via fine-tuning, each inference call is cheaper — potentially by a substantial margin at scale.

Not quite. Fine-tuned models don't necessarily have fewer parameters. The economic argument is about inference token cost: resending a large system prompt millions of times is expensive, while weight-encoded behavior has no per-request overhead.

Lab 2 — Prompt Engineering Workshop

Conversation lab · Complete 3 exchanges to finish

What you're doing

This lab is a prompt engineering sandbox. The AI assistant will help you explore what prompting can and cannot accomplish — including chain-of-thought techniques, few-shot examples, role conditioning, and the boundaries where prompting fails. Try pushing the limits.

Suggested starter: "Show me how few-shot prompting works by helping me design a prompt that teaches a model to classify customer feedback as positive, negative, or neutral."

Prompt Engineering Workshop

LLM Course · M6 L2

Welcome to the prompt engineering workshop. We can explore chain-of-thought prompting, few-shot examples, role conditioning, instruction design, and the genuine limits of what prompting can accomplish without fine-tuning. What aspect would you like to dig into?

Module 6 · Lesson 3

RLHF: Teaching Models What Humans Actually Want

Supervised fine-tuning teaches imitation. RLHF teaches preference.

Why did OpenAI, Anthropic, and DeepMind all converge on reinforcement learning from human feedback as the path to aligned AI assistants?

By mid-2021, it was clear to researchers at OpenAI that GPT-3, despite its raw capability, exhibited a fundamental problem: it was optimized to predict human text, not to be helpful to humans. When asked to write an essay arguing for a dangerous position, it would comply — because such essays exist in its training data. The model had learned to imitate humanity's writing; it had not learned to care about being useful, truthful, or safe. Supervised fine-tuning on good examples helped, but it couldn't capture the full richness of what humans actually preferred — because preference is comparative, contextual, and difficult to express as a single "ideal" output.

Why Supervised Fine-Tuning Isn't Enough

The core limitation of SFT is that it requires labelers to produce the ideal output — but for many tasks, it's far easier for a human to judge which of two outputs is better than to write the ideal output from scratch. A medical professional can reliably say "Response A is more accurate and cautious than Response B" without being able to dictate the perfect response themselves. SFT squanders this comparative judgment signal.

There is also a distribution problem. SFT trains the model on a fixed set of examples, but the distribution of prompts it will actually receive in deployment may differ substantially from the training distribution. RLHF, by using a reward model that generalizes human preferences, can guide the model toward good behavior on unseen prompts as well.

How RLHF Works: The Three Phases

Phase 1 — Supervised Fine-Tuning: As described in Lesson 1, an SFT model is trained on a set of human-written (prompt, response) pairs. This gives the model a reasonable starting behavioral baseline.

Phase 2 — Reward Model Training: Human labelers are shown multiple model responses to the same prompt and asked to rank them by quality. These preference rankings are used to train a separate model — the reward model — that learns to predict how much a human would prefer any given response. The reward model is essentially a compressed representation of human judgment.

Phase 3 — Reinforcement Learning Optimization: The SFT model is then treated as a policy and updated using PPO (Proximal Policy Optimization), a reinforcement learning algorithm. For each prompt, the model generates a response; the reward model scores it; the RL algorithm updates the model's weights to increase the probability of responses the reward model predicts humans will prefer. A KL divergence penalty is added to prevent the model from drifting too far from the original SFT model (which would cause "reward hacking" — finding responses that fool the reward model without being genuinely good).

Reward Model A separate neural network trained on human preference rankings to predict how desirable a given model response is. Used as a proxy for human judgment during RL training.

KL Penalty A regularization term added to the RL objective that penalizes the policy model for diverging too far from the SFT reference policy, preventing reward hacking and preserving general language capability.

Real Case — Anthropic's Constitutional AI (2022)

Anthropic proposed a variant called Constitutional AI (CAI), published in December 2022. Rather than relying solely on human labelers for preference rankings, CAI uses the model itself to critique its own responses according to a written "constitution" of principles (e.g., "Be helpful, harmless, and honest"). The model generates responses, critiques them, revises them, and these revised responses become training data — a process called RLAIF (Reinforcement Learning from AI Feedback). This approach reduced the need for human labeling at scale while improving safety behavior, and was used in training Claude.

The Alignment Tax and Its Cost

RLHF is not free. Research from OpenAI's InstructGPT paper and subsequent work has documented an alignment tax: RLHF-trained models sometimes perform slightly worse on certain raw capability benchmarks compared to the base SFT model, even while performing better on human preference ratings. The model is optimized to produce responses humans like — and humans don't always prefer the most technically accurate answer; they often prefer fluent, confident-sounding responses.

This tension — between optimizing for what humans prefer and optimizing for what is actually correct — is one of the central unsolved problems in AI alignment. It surfaces in documented cases: GPT-4 and Claude sometimes produce confidently stated incorrect information that "sounds right" to human raters, a failure mode amplified by the very RLHF process designed to make them better. The Goodhart's Law problem — when the proxy measure becomes the target, it ceases to be a good proxy — applies directly.

The RLHF Insight

RLHF works because human preferences are easier to elicit as comparisons than as specifications. It's harder to write a perfect answer than to recognize one. The reward model learns to encode that recognition ability, and RL uses it to steer the policy. The result is a model that acts as if it understands what humans want — whether or not it actually does.

Lesson 3 Quiz

RLHF: Teaching Models What Humans Actually Want

What is the primary purpose of the reward model in RLHF?

Correct. The reward model is trained on human preference rankings and then used as a differentiable proxy for human judgment. During RL training, it scores the policy model's outputs, providing the signal needed to update the policy without asking a human to evaluate every single response.

Not quite. That description is closer to SFT. The reward model's role is to learn to predict human preferences from comparative rankings, then act as an automated judge during RL optimization — replacing the need for humans to evaluate every generated response.

Why is a KL divergence penalty used during the RL phase of RLHF?

Correct. Without a KL penalty, RL optimization can exploit weaknesses in the reward model — finding outputs that score high on the reward model's metrics without being genuinely good. The KL penalty constrains the policy to stay near the SFT reference, preventing this degenerate optimization.

Not quite. The KL penalty is specifically about preventing reward hacking — the policy exploiting gaps in the reward model to score high without genuinely improving. By penalizing large deviations from the SFT policy, it keeps the model grounded in real language capability.

What distinguishes Anthropic's Constitutional AI (CAI) from standard RLHF?

Correct. Constitutional AI (RLAIF) has the model critique its own outputs against a written constitution of principles, then revise them. These self-critiqued revisions become training data. This substantially reduces the human labeling burden while still producing alignment-relevant training signal.

Not quite. The key innovation in CAI is using the model's own judgments (guided by a constitution) in place of extensive human labeling. This is called RLAIF — Reinforcement Learning from AI Feedback — and it's distinct from standard RLHF's reliance on human rankers at every step.

Lab 3 — RLHF Deep Dive

Conversation lab · Complete 3 exchanges to finish

What you're doing

Explore RLHF mechanics in depth: the reward model training process, how human preference rankings become a training signal, the alignment tax, reward hacking, and how Constitutional AI differs from standard RLHF. Challenge the AI with hard questions about the tradeoffs.

Suggested starter: "Explain what reward hacking actually looks like in practice — what kinds of responses would a model learn to generate if it figured out how to fool the reward model?"

RLHF Deep Dive

LLM Course · M6 L3

Welcome to the RLHF deep dive. I can walk you through reward model training, preference ranking collection, the PPO optimization loop, reward hacking, the alignment tax, and the CAI approach. What would you like to explore?

Module 6 · Lesson 4

When to Fine-Tune vs. When to Prompt: The Decision Framework

The most expensive tool is not always the best one. The right choice depends on what you're actually trying to change.

How did Google, OpenAI, and enterprise teams learn — sometimes expensively — when fine-tuning is overkill?

In 2023, as fine-tuning APIs became widely available, a pattern emerged among AI practitioners: teams would spend weeks and thousands of dollars fine-tuning a model on domain-specific data, only to discover that a carefully engineered prompt on the base model performed equivalently or better. The reverse also happened — teams would burn through token budgets constructing elaborate system prompts, not realizing their problem was a knowledge gap that only fine-tuning could close. The industry needed a decision framework. Several emerged; they largely converged.

The Four-Question Decision Tree

Practitioners at Google, Anthropic, and OpenAI have articulated variants of the same core decision logic. The questions to ask, in order:

1. Is the required behavior achievable through prompting alone? Before any fine-tuning consideration, attempt the task with best-effort prompt engineering — few-shot examples, chain-of-thought, role conditioning, explicit format instructions. If prompting achieves acceptable performance, stop here.

2. Is the gap a style/format problem or a knowledge problem? Style problems (tone, verbosity, format conventions, persona consistency) are almost always better solved with fine-tuning than prompting, because they require persistent behavioral modification. Knowledge problems (facts the model doesn't know) require fine-tuning or retrieval-augmented generation (RAG) — prompting cannot inject knowledge the model was never trained on.

3. What is the inference volume, and what are the economics? At low volume (thousands of requests/day), a long system prompt is economically fine. At high volume (millions of requests/day), token overhead from system prompts becomes significant. Fine-tuning that behavior into the weights can be cheaper over the model's lifetime.

4. How frequently will the required behavior change? Prompting is instantly reversible — update the text and the behavior changes. Fine-tuning requires retraining, quality evaluation, and deployment. If the task definition changes weekly, fine-tuning is expensive to maintain. If the behavior is stable, fine-tuning amortizes well.

Use Prompting When…

The behavior is achievable with good prompt design
You need rapid iteration and reversibility
The task definition changes frequently
Inference volume is low to moderate
You need transparent, auditable instructions
Prototype or early-stage product

Use Fine-Tuning When…

The model needs knowledge it doesn't have
Consistent style/format at scale is required
Inference token cost is prohibitive
Adversarial robustness matters
Behavior is stable and well-defined
Prompting consistently underperforms

Parameter-Efficient Fine-Tuning: LoRA

A practical objection to fine-tuning has always been cost: updating all of a 70-billion-parameter model's weights is computationally expensive and requires storing a full separate copy of the model for each task. LoRA (Low-Rank Adaptation), introduced by Hu et al. at Microsoft in 2021, elegantly sidesteps this. Instead of updating all weights, LoRA adds small low-rank matrices alongside the original weight matrices and trains only those — representing the fine-tuning "delta" as the product of two small matrices.

The result: a 70B model can be fine-tuned for a specific task by training only tens of millions of parameters instead of 70 billion, reducing compute and storage requirements by orders of magnitude. The LoRA adapter (a few hundred megabytes) can be swapped in at inference time, while the base model weights remain frozen on GPU. Multiple LoRA adapters can be maintained for different tasks, each loadable on demand.

LoRA has become the dominant fine-tuning method for open-source models. Meta's LLaMA fine-tuning ecosystem, Hugging Face's PEFT library, and essentially every practical fine-tuning tutorial released since 2022 uses LoRA or a variant. It lowered the cost of fine-tuning from "requires a cloud GPU cluster" to "achievable on a single consumer GPU for smaller models."

LoRA Low-Rank Adaptation — a parameter-efficient fine-tuning method that adds small trainable low-rank matrices to frozen model weights, representing behavioral changes with a fraction of the parameters of full fine-tuning.

RAG (Retrieval-Augmented Generation) A technique that retrieves relevant documents from an external knowledge base at inference time and includes them in the prompt, injecting knowledge without any fine-tuning. An alternative to fine-tuning for knowledge problems.

Real Case — OpenAI Fine-Tuning API Findings (2023)

When OpenAI opened fine-tuning access for GPT-3.5-turbo in August 2023, they published guidance noting that most customers who attempted fine-tuning discovered they could achieve equivalent results with improved prompt engineering. They recommended the workflow: try prompt engineering first, then few-shot examples, then fine-tuning only when those fail. Their internal benchmark showed fine-tuning improved performance meaningfully only when the task required either style consistency or domain knowledge not in the base model.

The Emerging Hybrid: Prompted Fine-Tuned Models

In practice, the most capable production deployments combine both approaches. A model is fine-tuned for broad behavioral alignment (via RLHF) and domain specialization (via SFT or LoRA), then further conditioned at inference time with a system prompt that specifies task-specific behavior, user context, and constraints. The fine-tuning provides the stable behavioral foundation; the prompt provides the dynamic, per-deployment specialization. Neither approach alone captures everything the combination achieves.

Understanding where fine-tuning ends and prompting begins — and why both exist — gives practitioners the conceptual vocabulary to make better architectural decisions, evaluate vendor claims more critically, and understand why the same underlying model can behave so differently across deployments.

Module Summary

Fine-tuning reshapes a model's behavioral layer without replacing its knowledge base. Prompting activates latent capabilities without touching weights. RLHF encodes human preferences through a learned reward signal. LoRA makes fine-tuning economically tractable. RAG solves knowledge gaps without retraining. The practitioner's art is knowing which tool the problem actually calls for — and having the evidence to defend that choice.

Lesson 4 Quiz

When to Fine-Tune vs. When to Prompt

A startup needs their AI assistant to match a very specific brand voice across millions of customer interactions daily. Which approach is most justified?

Correct. Two factors justify fine-tuning here: (1) at millions of requests/day, a large system-prompt token overhead is expensive, and (2) style/format consistency is a behavioral problem that fine-tuning handles well. LoRA makes this economically tractable without a full retraining run.

Not quite. Consider both the economic argument (millions of daily requests × system prompt tokens = significant cost) and the nature of the problem. Style consistency is exactly the kind of stable, high-volume behavioral requirement that justifies fine-tuning over prompting.

What key advantage does LoRA offer over full fine-tuning?

Correct. LoRA decomposes weight updates into the product of two small matrices, meaning only tens of millions of parameters need to be trained and stored instead of the full model. The base model weights stay frozen, and the adapter can be swapped in at inference time.

Not quite. LoRA still requires training data and GPU hardware — it just requires far less of both. The innovation is representing fine-tuning deltas as low-rank matrix products, making the trainable parameter count a tiny fraction of the full model size.

A team needs their model to answer questions about a constantly-updated internal knowledge base that changes daily. What is the most appropriate solution?

Correct. When knowledge changes frequently, fine-tuning is impractical — it cannot be retrained daily at reasonable cost or latency. RAG retrieves the current version of relevant documents at inference time, injecting up-to-date knowledge into the prompt without any retraining.

Not quite. Daily fine-tuning is prohibitively expensive and slow. When knowledge is dynamic and changes frequently, RAG is the correct architectural choice — it retrieves current documents at inference time rather than baking knowledge into weights that immediately become stale.

Lab 4 — Fine-Tune vs. Prompt Decision Advisor

Conversation lab · Complete 3 exchanges to finish

What you're doing

Bring your own use cases to this lab. Describe a task or product scenario, and the AI will help you work through the decision framework: should you fine-tune, prompt engineer, use RAG, or combine approaches? The advisor will ask clarifying questions and explain the reasoning behind each recommendation.

Suggested starter: "I'm building a customer service bot for a software company. It needs to know our specific product documentation and respond in our brand's casual-but-professional tone. Should I fine-tune or prompt?"

Fine-Tune vs. Prompt Advisor

LLM Course · M6 L4

Welcome to the decision advisor lab. Describe your use case — what the model needs to do, at what scale, with what kind of knowledge requirements — and I'll help you work through whether prompting, fine-tuning, LoRA, RAG, or some combination is the right architectural choice. What are you building?

Module 6 Test

Fine-Tuning vs. Prompting · 15 questions · Pass at 80%

1. Fine-tuning a language model modifies which component?

Correct. Fine-tuning performs additional gradient descent on the existing weights, adjusting them toward behaviors demonstrated in the fine-tuning dataset.

Incorrect. Fine-tuning adjusts the model's weights — the learned parameters — through additional training steps on a curated dataset.

2. What was the central finding of OpenAI's InstructGPT paper regarding model size and alignment?

Correct. Human raters preferred the 1.3B InstructGPT model 85% of the time over raw GPT-3, demonstrating that behavioral alignment via fine-tuning could outweigh raw parameter count.

Incorrect. The key finding was that fine-tuned behavioral alignment trumped raw size: the 1.3B model was preferred over the 175B base model in human evaluations.

3. In Supervised Fine-Tuning, what form does the training data take?

Correct. SFT trains on (prompt, ideal response) pairs — the model learns to maximize the probability of producing those demonstrated responses given the prompts.

Incorrect. SFT uses demonstration data: (prompt, ideal response) pairs where a human has written or curated the desired output. Ranking pairs are used for reward model training in RLHF.

4. Chain-of-thought prompting primarily works by:

Correct. Wei et al.'s 2022 paper showed that adding "Let's think step by step" or explicit reasoning chains to prompts dramatically improved performance — without any weight changes — by conditioning the model to produce intermediate reasoning before the final answer.

Incorrect. Chain-of-thought prompting requires no fine-tuning. It works by including reasoning steps in the prompt context, which conditions the model to generate its own intermediate steps before answering.

5. Which of the following is a genuine hard limit of prompting that fine-tuning can address?

Correct. Tone, format, and language are achievable through prompting. But if the model was never trained on a piece of knowledge, no prompt can manufacture it. Fine-tuning (or RAG) is required to address this knowledge gap.

Incorrect. The true knowledge gap — information the model was never trained on — is the genuine limit of prompting. Style and format problems are well within prompting's reach.

6. In RLHF, what is the reward model trained to do?

Correct. The reward model learns from human preference rankings (which of two responses is better) and then serves as an automated proxy for human judgment during RL training.

Incorrect. The reward model is trained on human preference rankings and learns to predict human preference scores — not factual accuracy scores or generation quality from some absolute standard.

7. What is "reward hacking" in the context of RLHF?

Correct. Reward hacking is Goodhart's Law in action: the policy model optimizes for the reward model's score rather than genuine quality, finding responses that satisfy the proxy metric without actually being better. This is why the KL penalty is needed.

Incorrect. Reward hacking refers to the policy model exploiting weaknesses in the reward model — finding responses that score high on the learned proxy without being genuinely better responses.

8. Anthropic's Constitutional AI (CAI) reduces dependence on human labelers by:

Correct. CAI uses RLAIF — the model evaluates its own outputs against a written constitution, revises them, and those revisions become training data. This generates alignment signal without requiring humans to rank every response pair.

Incorrect. CAI's innovation is self-critique: the model assesses its own outputs against a written constitution and revises them. These AI-generated preference labels replace much of the human labeling burden.

9. What is the "alignment tax" in RLHF-trained models?

Correct. RLHF optimizes for human preference, not raw capability. Since humans sometimes prefer confident-sounding answers over technically optimal ones, RLHF models can score lower on capability benchmarks even as human raters prefer them.

Incorrect. The alignment tax describes the performance tradeoff: RLHF improves human preference ratings but can slightly reduce scores on raw capability benchmarks, because optimizing for human preference is not identical to optimizing for technical correctness.

10. LoRA (Low-Rank Adaptation) achieves parameter efficiency by:

Correct. LoRA keeps the base model weights frozen and adds small low-rank matrices (A and B, where the update is AB) to selected weight matrices. Only A and B are trained, representing the fine-tuning delta with far fewer parameters.

Incorrect. LoRA doesn't prune or quantize — it adds small low-rank matrices alongside the frozen base weights and trains only those. The result is a compact adapter that can be swapped in at inference time.

11. When should Retrieval-Augmented Generation (RAG) be preferred over fine-tuning for knowledge-intensive tasks?

Correct. RAG is ideal for dynamic knowledge bases because it retrieves current documents at inference time — no retraining required. Fine-tuning knowledge into weights makes it stale the moment the knowledge base updates.

Incorrect. RAG shines when knowledge is dynamic and changes frequently — daily, weekly — because it avoids the need to retrain. Fine-tuning is better suited to stable knowledge that won't change.

12. Which factor most justifies using a long system prompt (prompting) over fine-tuning for a new AI product?

Correct. Rapid iteration during early development is one of prompting's strongest advantages — change the text and the behavior changes instantly, with no retraining cycle. Fine-tuning is harder to justify when requirements are unstable.

Incorrect. High volume, specialized knowledge, and adversarial robustness all favor fine-tuning. Prompting's strongest advantage is speed of iteration — ideal for early-stage products where requirements are still evolving.

13. Meta's LLaMA 2 paper found that for SFT quality vs. quantity, which mattered more?

Correct. LLaMA 2's instruction-tuned models used fewer than 30,000 carefully vetted SFT examples. The team found that one high-quality, rigorously reviewed example could be worth thousands of mediocre ones.

Incorrect. Meta's LLaMA 2 paper specifically highlighted that quality controlled, carefully reviewed SFT examples produced better results than larger quantities of lower-quality data — a finding consistent across multiple fine-tuning research programs.

14. The KL divergence penalty in RLHF training serves to:

Correct. The KL penalty measures how much the current policy has diverged from the SFT starting point and adds a cost for large divergence. This keeps the model grounded in real language capability and prevents it from gaming the reward model with degenerate outputs.

Incorrect. The KL penalty's purpose is to constrain the policy from diverging too far from the SFT reference — preventing reward hacking while allowing genuine improvement in human-preferred behaviors.

15. In a hybrid deployment architecture, what role does prompting typically play when the base model has already been fine-tuned?

Correct. In practice, the best systems layer both: fine-tuning provides stable, efficient behavioral alignment baked into the weights, while system prompts provide deployment-specific context, persona, and constraints that vary per application.

Incorrect. Fine-tuning and prompting are complementary. Fine-tuning provides the stable behavioral foundation; system prompts provide the flexible, per-deployment specialization that changes across use cases — neither replaces the other.