Module 3 · Lesson 1

What Fine-Tuning Actually Is

Why pre-trained models are a starting point, not a destination — and what happens inside the model when you fine-tune it.

When is fine-tuning worth the cost, and when is it a waste of compute?

In September 2022, Bloomberg published research on BloombergGPT — a 50-billion-parameter model trained from scratch on financial text. The project cost tens of millions of dollars. Six months later, teams at Alpaca and Vicuna showed that fine-tuning a 7B-parameter open model on roughly 52,000 instruction examples for under $600 produced behavior nearly matching GPT-3.5 on many benchmarks. The lesson was immediate: fine-tuning had crossed from research curiosity to everyday engineering tool.

Pre-training vs. Fine-Tuning — The Conceptual Split

A foundation model like LLaMA 3 or GPT-4 is pre-trained on trillions of tokens of general text. This phase teaches the model language, reasoning, and world knowledge. It is enormously expensive — LLaMA 3 70B required roughly 15 trillion tokens and millions of GPU-hours.

Fine-tuning starts from those learned weights and continues training on a much smaller, targeted dataset. Only a fraction of the parameters need updating to reshape the model's behavior for a specific domain, task, or style.

Think of pre-training as a medical school education and fine-tuning as a residency — the resident already knows medicine; the residency specializes that knowledge for cardiology, surgery, or psychiatry.

Pre-Training

General knowledge

Trillions of tokens. Learns language, reasoning, facts. Cost: millions of dollars. Done once by large labs.

Fine-Tuning

Task specialization

Hundreds to millions of examples. Reshapes behavior. Cost: tens to thousands of dollars. Done by teams like yours.

Prompting / RAG

Runtime steering

No weight updates. Context window injection. Zero training cost. Best starting point before fine-tuning.

What Changes in the Weights

A transformer model is a stack of layers, each containing attention heads and feed-forward networks (FFN). During fine-tuning, gradient descent adjusts these weights so the model assigns higher probability to your target outputs.

In full fine-tuning, every parameter is updated. In parameter-efficient methods like LoRA (Low-Rank Adaptation, introduced by Hu et al. at Microsoft in 2021), only small low-rank matrices added alongside frozen layers are trained. The original weights are never touched — only the delta is learned.

# Conceptual view of LoRA — not training from scratch
W_output = W_pretrained + A @ B
# W_pretrained: frozen (7B params, untouched)
# A: shape (d, r)  B: shape (r, d)  where r << d
# Only A and B are trained — millions, not billions of params

The Decision Framework — When to Fine-Tune

Fine-tuning is not always the right answer. Before committing, test prompting and RAG first. Fine-tuning adds value in specific situations:

Consistent format — The model must always output structured JSON, a specific clinical note format, or a legal citation style. Prompting is brittle; fine-tuning bakes it in.
Domain vocabulary — Specialized jargon (radiology, derivatives trading, semiconductor fabrication) that the base model handles awkwardly. Fine-tuning shifts token probabilities toward correct usage.
Cost/latency at scale — A fine-tuned small model (7B) can outperform a prompted large model (70B) on a narrow task, at a fraction of the inference cost.
Behavior alignment — Teaching a model to refuse certain requests, respond in a specific persona, or follow company-specific policies reliably.
Data privacy — Sensitive examples cannot be put in a system prompt but can train a private model.

Real Case — BloombergGPT (2023)

Bloomberg built a 50B model from scratch on financial data. When researchers later evaluated instruction-tuned LLaMA variants against it on financial NLP benchmarks, the fine-tuned smaller models matched or exceeded BloombergGPT on sentiment analysis and NER tasks — at roughly 1% of the build cost. This established the fine-tuning playbook as the default for domain adaptation.

Types of Fine-Tuning You'll Use

Supervised Fine-Tuning (SFT) is the baseline: you provide input→output pairs and the model learns to replicate them. This is what Alpaca used with Stanford's 52K instruction pairs.

RLHF (Reinforcement Learning from Human Feedback) is what OpenAI used to create InstructGPT and the GPT-4 family. A reward model scores outputs; the LLM is optimized to maximize that score via PPO. Expensive but produces strong alignment.

DPO (Direct Preference Optimization), introduced by Rafailov et al. in 2023, achieves RLHF-like alignment without a separate reward model. You provide preferred vs. rejected pairs; the loss directly maximizes the probability gap. Simpler and increasingly dominant in open-source pipelines.

Key Takeaway

Fine-tuning is not magic — it is supervised learning on a very large pre-initialized model. The quality of your training data, the choice of method (SFT, DPO, LoRA), and the evaluation setup determine success. The labs in this module will walk you through each decision point hands-on.

Lesson 1 Quiz

What Fine-Tuning Actually Is — 4 questions

1. What does LoRA do differently from full fine-tuning?

Correct. LoRA adds low-rank matrices A and B alongside frozen pretrained weights. Only A and B are updated, reducing trainable parameters by orders of magnitude.

Not quite. LoRA's key innovation is freezing the pretrained weights and training only small adapter matrices — so the original knowledge is preserved while new behavior is learned efficiently.

2. The Stanford Alpaca project (2023) demonstrated that:

Exactly right. Alpaca showed that supervised fine-tuning with a small, high-quality dataset dramatically narrows the gap with much larger models — at a tiny fraction of the cost.

The Alpaca finding was specifically about cost efficiency — fine-tuning a 7B model on 52K instruction pairs for roughly $600 produced competitive instruction-following behavior.

3. DPO (Direct Preference Optimization) differs from RLHF primarily because:

Correct. DPO's key insight is that the optimal RLHF policy can be expressed as a classification loss over preferred/rejected pairs, removing the expensive reward modeling step entirely.

DPO actually simplifies the pipeline — it takes preferred vs. rejected output pairs and optimizes the model directly, without training a separate reward model as RLHF requires.

4. Which scenario is the STRONGEST case for fine-tuning over prompting?

Right. Consistent structured output at scale — where prompting is brittle and inference cost matters — is the canonical fine-tuning use case. A fine-tuned small model can outperform a prompted large model and cost far less per call.

The strongest fine-tuning case involves consistent format requirements at scale and domain-specific behavior that prompting cannot reliably produce. General FAQ and catalog search are better handled by prompting + RAG first.

Lab 1 — Fine-Tuning Strategy Advisor

Describe your use case. The AI will assess whether fine-tuning is the right tool and which method fits.

Your Task

You are a developer evaluating whether to fine-tune a model for a new product feature. Use this lab to pressure-test your reasoning. Describe your scenario — the AI advisor will interrogate your assumptions, help you decide between SFT / DPO / LoRA / prompting, and estimate data requirements.

Describe the task you want to fine-tune for (domain, format requirements, scale)
Ask the advisor to compare fine-tuning vs. prompting vs. RAG for your case
Ask which fine-tuning method (SFT, DPO, LoRA) fits your constraints
Ask how much training data you'd realistically need

Try: "I want a model that always outputs clinical SOAP notes in a strict JSON format from free-text doctor dictation. We process 50,000 notes per day. Is fine-tuning worth it?"

Fine-Tuning Strategy Advisor

Lab 1

Welcome. I'm your fine-tuning strategy advisor for this lab. Tell me about your use case — what task are you trying to solve, and what's your current approach? I'll help you figure out whether fine-tuning is the right move, and if so, which method fits your constraints.

Module 3 · Lesson 2

Building Your Training Dataset

The quality of your fine-tuned model is bounded by the quality of your training data. How to build it right.

What makes a training dataset good enough — and how do you know when you have enough examples?

When Orca (Microsoft, 2023) and Orca 2 were published, they revealed something surprising: a 13B model trained on synthetic reasoning traces generated by GPT-4 outperformed much larger models on reasoning benchmarks. The data wasn't human-written; it was carefully curated GPT-4 outputs showing step-by-step problem-solving. The lesson: data diversity, reasoning depth, and curation matter more than raw volume.

The GIGO Principle at Scale

Garbage In, Garbage Out applies with force to fine-tuning. Because you're adjusting weights that encode billions of parameters of world knowledge, even a small dataset of bad examples will noticeably degrade behavior. A commonly cited finding from OpenAI's InstructGPT paper: filtering out 1% of the highest-quality human feedback examples degraded model quality measurably more than cutting 30% of low-quality examples.

The implication: ruthless curation beats volume. A dataset of 500 excellent, diverse examples will outperform 10,000 mediocre ones for most task-specific fine-tuning goals.

Dataset Format — The Instruction Pair

For supervised fine-tuning (SFT), each example is a structured pair: an input (instruction + optional context) and the ideal output. Most frameworks use a standard chat format aligned to the model's expected prompt template.

# Standard JSONL format for SFT — one example per line
{
  "messages": [
    {"role": "system", "content": "You are a clinical documentation assistant..."},
    {"role": "user", "content": "Patient presents with chest pain..."},
    {"role": "assistant", "content": "{'soap': {'subjective': '...', 'objective': '...', 'assessment': '...', 'plan': '...'}}"}
  ]
}

How Much Data Do You Need?

There is no universal answer, but research has established useful heuristics:

Style / tone changes: 50–500 examples. Teaching a model to respond in your brand voice or always use bullet points. The base model already knows how; you're just shifting priors.
New task format: 500–5,000 examples. Consistent JSON output, specific classification schema, domain-specific extraction. Needs enough variety to generalize.
Deep domain knowledge: 10,000–100,000+ examples. Teaching genuine new domain expertise the model doesn't have. Medical subspecialties, proprietary legal frameworks, rare languages.
Preference alignment (DPO): 1,000–20,000 preference pairs. Each pair has a chosen and rejected response to the same prompt. Quality of contrast matters more than volume.

The Scaling Laws Insight

Chinchilla scaling laws (Hoffmann et al., DeepMind, 2022) showed that for pre-training, models are often undertrained relative to data. For fine-tuning, the dynamic reverses: most practitioners overtrain on too little data, producing models that memorize rather than generalize. If your eval loss plateaus and then rises, you've overfit — you need more data diversity, not more epochs.

Synthetic Data Generation

The most practical approach for most teams in 2024 is synthetic data via a strong teacher model. You define the task, write seed examples, then use GPT-4 or Claude to generate hundreds or thousands of variations. This is how Alpaca, Orca, and most open-source instruction-tuned models were built.

The risks: model collapse (when a model is trained on its own outputs recursively, diversity degrades) and hallucination amplification (the teacher model's errors become training signal). Best practice is to generate synthetically, then human-filter a random sample for quality.

# Generating synthetic training data with OpenAI API
import openai, json

def generate_example(seed_input):
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Generate a clinical SOAP note JSON..."},
            {"role": "user", "content": seed_input}
        ]
    )
    return response.choices[0].message.content

# Filter: spot-check 10% of generated examples manually
# Reject examples with hallucinations, wrong format, or repetition

Data Quality Checklist

Before training, run every dataset through these checks:

Diversity

Cover the input space

Do your examples span the range of real inputs? Edge cases, short inputs, long inputs, ambiguous cases?

Accuracy

Outputs are correct

Every output in the training set must be something you want the model to reproduce. One bad example pattern will propagate.

Consistency

Same rules everywhere

If example 1 formats dates as MM/DD/YYYY and example 500 uses ISO 8601, the model will average them — producing neither consistently.

Deduplication

No near-duplicates

Duplicate or near-duplicate examples cause overfitting to specific phrasings. Use MinHash or embedding similarity to deduplicate before training.

Lesson 2 Quiz

Building Your Training Dataset — 4 questions

1. Microsoft's Orca research showed that a 13B model could outperform larger models primarily because of:

Correct. Orca's key contribution was the training signal — GPT-4 generated detailed reasoning explanations, not just answers. The smaller model learned reasoning patterns, not just outputs.

Orca's breakthrough was data quality and format. Training on GPT-4's step-by-step reasoning traces — rather than just final answers — allowed a 13B model to internalize stronger reasoning patterns.

2. When fine-tuning for a new output format (e.g., always producing JSON), approximately how many high-quality examples are typically sufficient?

Right. Format and task-specific fine-tuning typically needs 500–5,000 diverse examples. The model already knows JSON; you're training it to always use it for your specific task structure.

For task format changes, 500–5,000 curated examples is the typical effective range. Deep domain knowledge requires 10K+, while style changes can be done with as few as 50–500.

3. Near-duplicate examples in your training data are most likely to cause:

Correct. Duplicate examples artificially inflate the gradient signal for specific phrasings. The model memorizes those patterns rather than learning the underlying task structure, hurting performance on unseen inputs.

Duplicates cause overfitting. When the model sees the same example many times, it learns to produce that specific output for that specific input phrasing — not the general task rule.

4. The primary risk of training a model on purely synthetic data generated by another model is:

Exactly right. Model collapse is a documented phenomenon where repeated cycles of synthetic generation reduce diversity. Additionally, any systematic errors in the teacher model's outputs get reinforced as "correct" behavior in the student.

The key risks are model collapse (diversity loss over generation cycles) and hallucination amplification (teacher errors become training signal). Best practice: generate synthetically, then manually filter a sample for quality.

Lab 2 — Dataset Builder & Critic

Design training examples and have them critiqued for quality, diversity, and format before you waste GPU time.

Your Task

You're building a training dataset for fine-tuning. Use this lab to draft examples, get them critiqued, and iterate until they're production-quality. The AI critic will check format consistency, diversity, output accuracy patterns, and flag common mistakes.

Describe the task you're building training data for
Paste 2–3 example input/output pairs and ask for a critique
Ask the critic to generate additional diverse examples following your pattern
Ask how to detect and remove near-duplicates from a larger set

Try: "I'm building training data to extract medication names and dosages from doctor notes as JSON. Here's my first example — [paste your example]. What's wrong with it?"

Dataset Builder & Critic

Lab 2

I'm your dataset quality critic. Describe your fine-tuning task and share some example input/output pairs. I'll evaluate them for format consistency, output correctness patterns, diversity coverage, and common issues that lead to poor model behavior. Let's make sure you don't waste compute on bad data.

Module 3 · Lesson 3

Running the Fine-Tune — Hyperparameters & LoRA in Practice

What to set, what to watch, and what will silently destroy your training run if you're not paying attention.

Which hyperparameters actually matter, and how do you know if your training run is going well?

In 2023, the Mistral team released Mistral 7B with a finding that surprised many practitioners: a relatively small model with grouped-query attention and a sliding window could be fine-tuned in hours on a single A100 to match LLaMA 2 13B on many benchmarks. The fine-tuning community quickly discovered that the default learning rates people were copying from pre-training literature were 10–100x too high for LoRA fine-tuning, causing the adapter weights to diverge and destroy alignment. Getting hyperparameters right was the difference between a useful model and a broken one.

The Key Hyperparameters for Fine-Tuning

Fine-tuning hyperparameters are different from pre-training ones. The model is already well-initialized; you're making small, targeted updates. Aggressive settings cause forgetting or divergence.

Learning Rate

1e-4 to 2e-4 (LoRA)

Full fine-tune: 1e-5 to 5e-5. LoRA adapters can handle higher LR since pretrained weights are frozen. Always use a warmup schedule.

LoRA Rank (r)

8, 16, or 32

r=8 for style/format tasks. r=16 for domain adaptation. r=32+ for deep task changes. Higher rank = more params = more capacity but more overfitting risk.

LoRA Alpha

Typically = rank × 2

Alpha scales the LoRA updates. alpha=16 with r=8, alpha=32 with r=16. Some practitioners set alpha=rank for conservative updates.

Batch Size

4–32 (effective)

Use gradient accumulation to simulate larger batches. Effective batch = per_device_batch × accumulation_steps × num_gpus.

Epochs

1–3 for most tasks

More than 3 epochs on task-specific data usually causes overfitting. Monitor eval loss — stop when it stops falling.

Target Modules

q_proj, v_proj at minimum

Apply LoRA to the query and value projection matrices. Adding k_proj, o_proj, and FFN layers increases capacity at higher parameter cost.

A Minimal LoRA Training Config

The following config using HuggingFace TRL and PEFT represents a reasonable starting point for fine-tuning a 7B model on a task-specific dataset of ~5,000 examples on one A100 80GB GPU:

from peft import LoraConfig, get_peft_model
from transformers import TrainingArguments
from trl import SFTTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    output_dir="./ft-output",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # effective batch = 32
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=25,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=200,
    load_best_model_at_end=True,
)

Reading Your Training Curves

Log training loss and validation loss every N steps. What you want to see:

Training loss falling steadily — the model is learning. If it spikes upward suddenly, your learning rate is too high or you have a corrupt batch.
Validation loss falling in parallel — generalization is working. If val loss diverges upward while train loss falls, you're overfitting. Stop early.
Both losses plateau after epoch 1–2 — common and expected. Additional epochs rarely help and often hurt. Stop here unless val loss is still falling.
Loss oscillates wildly — batch size is too small or learning rate is too high. Increase gradient accumulation steps or halve the learning rate.

Catastrophic Forgetting

Full fine-tuning on a narrow dataset can cause the model to "forget" general capabilities — it becomes excellent at your task but loses coherent responses to anything outside it. LoRA almost entirely prevents this because the pretrained weights are frozen. For full fine-tuning, regularization techniques like Elastic Weight Consolidation (EWC) or mixing general instruction data into your training set can help preserve base capabilities.

After Training — Merging and Serving

LoRA adapters are small files (tens to hundreds of MB) that sit alongside the frozen base model. At inference time, you can either keep them separate (load adapter on top of base) or merge them into a single set of weights for faster inference. Merging is irreversible but eliminates adapter loading overhead.

# Merge LoRA adapter into base model for production serving
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
peft_model = PeftModel.from_pretrained(base_model, "./ft-output/checkpoint-best")

merged = peft_model.merge_and_unload()  # combines adapter + base weights
merged.save_pretrained("./merged-model")   # single model ready for vLLM/Ollama

Lesson 3 Quiz

LoRA Hyperparameters & Training in Practice — 4 questions

1. What does the LoRA rank (r) parameter control?

Correct. Rank r determines the dimensions of matrices A (d×r) and B (r×d). Higher r = more parameters = more capacity to learn new patterns, but also more risk of overfitting on small datasets.

Rank r controls the inner dimension of the adapter matrices. A and B have shapes (d, r) and (r, d) respectively — so higher r means more trainable parameters and more representational capacity.

2. During training, your validation loss starts rising while training loss continues to fall. The correct action is:

Exactly right. Diverging validation loss while training loss falls is the definition of overfitting. Early stopping — using the best checkpoint from before divergence — is the standard remedy.

Rising validation loss while training loss falls means overfitting. The model is memorizing training examples rather than generalizing. Roll back to the checkpoint just before divergence began.

3. Why do LoRA fine-tuning runs typically use a higher learning rate (1e-4 to 2e-4) than full fine-tuning (1e-5 to 5e-5)?

Correct. Since the pretrained weights are frozen and only the adapter matrices A and B are updated, there's no risk of corrupting the base model's learned representations. The adapters start near zero and can be updated more aggressively.

The key insight is that LoRA freezes the pretrained weights. Since A and B start near zero and the base model weights can't be damaged, the adapters can be trained with larger learning rate steps than would be safe for full fine-tuning.

4. After merging a LoRA adapter into the base model using merge_and_unload(), what is the primary tradeoff?

Right. Merging eliminates the runtime overhead of loading an adapter on top of a base model, which improves inference latency. But the operation is irreversible — always save a copy of the separate adapter before merging.

The key tradeoff is irreversibility. Merging improves inference performance by eliminating adapter loading overhead, but you can no longer separate the base model from the fine-tuned behavior. Always preserve the original adapter files before merging.

Lab 3 — Hyperparameter Tuner

Describe your setup. Get a recommended LoRA config, explain every choice, and debug training problems.

Your Task

You have a model, a dataset, and a GPU. What hyperparameters should you use? This lab simulates the configuration design conversation you'd have with an expert ML engineer. Describe your constraints and get a reasoned config recommendation — then diagnose training problems from loss curves or error messages.

Tell the advisor your base model (7B, 13B, 70B), dataset size, and available GPU
Ask for a complete LoRA config with justification for each value
Describe a training problem (loss spike, overfitting, slow convergence) and ask for diagnosis
Ask how to set up eval metrics beyond loss to catch behavioral regressions

Try: "I'm fine-tuning Mistral 7B on 3,000 examples for JSON extraction. I have one A100 40GB. My training loss looks good but my eval outputs are inconsistently formatted. What's wrong?"

LoRA Hyperparameter Advisor

Lab 3

I'm your LoRA configuration and training diagnostics advisor. Tell me about your setup: which base model, how much data, what GPU hardware, and what you're trying to accomplish. I'll recommend a complete training config with reasoning for every hyperparameter, and help you diagnose any training problems you encounter.

Module 3 · Lesson 4

Evaluating & Deploying Your Fine-Tuned Model

Training loss going down doesn't mean the model is good. How to build real evaluation — and take your model from checkpoint to production.

How do you know if your fine-tuned model is actually better — and what can go wrong in deployment that didn't appear in training?

When Google deployed Med-PaLM 2 in 2023, they didn't rely on benchmark scores alone. They ran a structured evaluation called the USMLE-style benchmark and then conducted adversarial physician panels where doctors tried to find cases where the model's answers were dangerous, plausible-sounding but wrong. Only after this dual evaluation — automated metrics plus human red-teaming — did the model go into clinical pilot deployments. The automated metrics alone missed entire failure categories that physicians found in hours.

Why Loss Is Not Enough

A model with validation loss of 0.85 might be excellent or terrible — loss measures how well the model predicts the next token in your eval set, not whether its outputs are actually useful. You need task-specific evaluation that measures what you actually care about.

The gap between training metrics and real-world performance is one of the most consistent failure modes in applied fine-tuning. A model can memorize your eval set's format while still hallucinating values, breaking on edge cases, or failing on input distributions slightly different from training.

Building a Proper Evaluation Suite

Your eval suite should have at least three components:

Held-out task accuracy — A test split never seen during training. For structured outputs: exact match or schema validity rate. For text: ROUGE, BERTScore, or human preference ratings against a reference.
Regression testing — Does the fine-tuned model still perform well on tasks outside your fine-tuning domain? Run the base model benchmarks (MMLU, HellaSwag) and compare. A fine-tuned model that has catastrophically forgotten general reasoning is dangerous to deploy.
Red-team / adversarial eval — Actively try to break the model. Edge case inputs, ambiguous instructions, inputs that look similar to training data but have different correct answers. This reveals overfitting that loss metrics hide.

Automated Metrics

Fast, scalable

Exact match, F1, ROUGE-L, BERTScore, schema validity rate. Run on every checkpoint. Cheap but incomplete.

LLM-as-Judge

Scalable quality proxy

Use GPT-4 or Claude to score outputs against a rubric. Correlates with human judgment at ~0.8+. Fast and automatable.

Human Evaluation

Ground truth

Domain experts rate outputs blind. Expensive but required before production deployment in high-stakes domains.

A/B Testing

Production validation

Shadow the fine-tuned model alongside the existing system. Compare real-user outcomes before full rollout.

LLM-as-Judge — A Practical Pattern

One of the most scalable eval techniques: use a stronger model to score your fine-tuned model's outputs. This is how most serious fine-tuning teams evaluate at scale without paying for thousands of human annotations.

# LLM-as-judge evaluation pattern
def judge_output(input_text, model_output, reference_output):
    prompt = f"""
You are an expert evaluator. Score the model output from 1-5 on:
1. Accuracy (does it match the reference in substance?)
2. Format adherence (is the JSON schema correct?)
3. Completeness (are all fields present and populated?)

Input: {input_text}
Reference: {reference_output}
Model Output: {model_output}

Return JSON: {{"accuracy": N, "format": N, "completeness": N, "notes": "..."}}
"""
    return judge_model.complete(prompt)

# Run across your full test set, aggregate scores
# Flag outputs where any dimension scores below 3 for human review

Deployment Patterns

Once your model passes evaluation, you have several serving options depending on scale and infrastructure:

Ollama / LM Studio

Local / internal tools

Load your merged model locally. Perfect for internal tools, developer workflows, or low-volume applications. Zero cloud cost.

vLLM

High-throughput serving

Production-grade LLM server with PagedAttention for efficient GPU memory use. OpenAI-compatible API. Best for high-volume APIs.

Modal / RunPod

Serverless GPU

Deploy your fine-tuned model on serverless GPU infrastructure. Pay per request. Good for variable-load production APIs.

HuggingFace Inference

Managed endpoints

Push model to HuggingFace Hub, deploy as a dedicated inference endpoint. Simplest path to a production API with a fine-tuned model.

Monitoring in Production

A fine-tuned model is not a set-and-forget artifact. Production monitoring for fine-tuned models requires:

Output distribution tracking — Log samples of model outputs daily. If the distribution of formats, lengths, or field values shifts, something has changed upstream (input distribution drift). Failure rate alerting — Track schema validation failures, refused completions, or timeout rates. Spike in failures often indicates input drift or a deployment configuration issue. Periodic revalidation — Every 30–90 days, run your eval suite again on recent production samples. Real-world data distributions shift; your model may degrade without any code changes.

The Full Fine-Tuning Pipeline

Task definition → Data collection → Data curation → Format standardization → Train/val/test split → SFT or DPO training → Hyperparameter sweep → Checkpoint selection → Automated eval → Human eval (for high-stakes) → Merge + quantize → Deploy → Monitor → Retrain cycle. Every step is a place where quality can be lost or recovered. The teams that treat fine-tuning as a system — not a one-time event — consistently outperform those who treat it as a script to run once.

Lesson 4 Quiz

Evaluation & Deployment — 4 questions

1. Google's Med-PaLM 2 evaluation approach taught practitioners that:

Correct. Med-PaLM 2's evaluation process demonstrated that automated metrics alone miss entire failure categories. Adversarial human experts found dangerous edge cases in hours that benchmark suites didn't capture.

The Med-PaLM 2 lesson was specifically about evaluation completeness — physician red-teamers found failure modes that automated benchmarks missed entirely, showing that both approaches are necessary for high-stakes deployments.

2. The "LLM-as-judge" evaluation pattern is most valuable because:

Right. LLM-as-judge is valuable for scalability — it correlates with human judgment at ~0.8+ and can score thousands of outputs automatically, making comprehensive eval tractable without large annotation budgets.

LLM-as-judge's primary value is scalability. It approximates human judgment (typically ~0.8+ correlation) at a fraction of the cost, enabling comprehensive evaluation across large test sets. It doesn't replace human evaluation in high-stakes domains.

3. A fine-tuned model that performs well on your task but shows degraded scores on MMLU and HellaSwag compared to the base model is exhibiting:

Correct. Catastrophic forgetting occurs when fine-tuning on a narrow dataset overwrites weights encoding general capabilities. Regression testing against base benchmarks detects this before deployment. LoRA significantly reduces this risk.

Degraded performance on general benchmarks relative to the base model is catastrophic forgetting. Fine-tuning on narrow data can overwrite general reasoning. This is why regression testing is mandatory, and why LoRA is preferred — it preserves base weights entirely.

4. For high-throughput production serving of a fine-tuned 7B model, the recommended infrastructure is:

Right. vLLM is the standard for production LLM serving — its PagedAttention mechanism dramatically improves GPU memory utilization and throughput, and its OpenAI-compatible API makes integration straightforward.

For high-throughput production serving, vLLM is the industry standard. Its PagedAttention algorithm solves GPU memory fragmentation and enables much higher request throughput than naive implementations. Ollama is for local/dev use; vanilla Flask+transformers doesn't scale.

Lab 4 — Evaluation Suite Designer

Build a rigorous eval pipeline for your fine-tuned model. The AI will help you catch failures before they reach production.

Your Task

Your fine-tuned model is trained. Now prove it works — and find where it breaks. This lab guides you through designing a complete evaluation suite: automated metrics, LLM-as-judge rubrics, regression tests, and red-team adversarial cases. Then plan a deployment and monitoring strategy.

Describe your fine-tuned model's task and ask for an evaluation framework
Ask the advisor to write an LLM-as-judge prompt/rubric for your specific outputs
Request a set of adversarial test cases designed to expose common failure modes
Ask how to set up production monitoring with alerts for output distribution drift

Try: "My fine-tuned Mistral 7B extracts structured medication data from pharmacy notes. I need a full eval suite — automated metrics, a judge rubric, adversarial test cases, and a production monitoring plan."

Evaluation Suite Designer

Lab 4

I'm your evaluation and deployment advisor. Describe your fine-tuned model's task, the output format it produces, and the stakes involved. I'll help you design a complete evaluation pipeline — from automated metrics to human red-teaming to production monitoring — so you can ship with confidence rather than hope.

Module 3 — Fine-Tuning Test

15 questions across all four lessons · 80% to pass

1. Fine-tuning differs from pre-training primarily in that:

Correct.

Fine-tuning starts from pretrained weights and continues updating them on a smaller, targeted dataset — far less compute than training from scratch.

2. Which situation is LEAST appropriate for fine-tuning (and better handled by prompting + RAG)?

Correct. Dynamic, frequently-updated knowledge (weekly catalog changes) belongs in RAG, not fine-tuning — fine-tuning bakes in static knowledge that would need retraining to update.

Frequently changing information is better handled by RAG since fine-tuning bakes in static knowledge. The catalog scenario requires retrieval from a live data source, not model weight updates.

3. In LoRA, the output of an adapted layer is computed as:

Correct. LoRA adds a low-rank delta (A×B) to the frozen pretrained weights. The pretrained matrix is never modified.

LoRA computes W_output = W_pretrained + A@B. The pretrained weights are frozen; only the low-rank adapter matrices A and B are updated during training.

4. DPO (Direct Preference Optimization) requires which type of training data?

Correct. DPO trains on (prompt, chosen_response, rejected_response) triplets, directly maximizing the probability gap between preferred and rejected outputs without a separate reward model.

DPO requires preference pairs — for each prompt, you need both a preferred and a rejected completion. This allows DPO to optimize the model's behavior directly without training a separate reward model.

5. The Alpaca project's ~$600 fine-tune relied on which data source?

Correct. Alpaca used Self-Instruct with text-davinci-003 to generate 52K diverse instruction-following examples from 175 human-written seed tasks, at minimal cost.

Alpaca used the Self-Instruct pipeline — GPT-3.5 (text-davinci-003) generated 52K instruction examples from 175 seed tasks. This demonstrated that high-quality synthetic data from a strong teacher model could dramatically reduce fine-tuning cost.

6. For a fine-tuning task requiring only a consistent output style change, approximately how many high-quality examples are needed?

Correct. Style and tone changes require the fewest examples — the model already knows how to write; you're just shifting probability distributions toward your preferred patterns.

Style/tone changes need the fewest examples — roughly 50–500. The model already has full language capability; fine-tuning just shifts its output distribution toward your preferred style.

7. Model collapse in synthetic data generation primarily refers to:

Correct. When models train repeatedly on their own outputs, rare but important output patterns disappear and the model converges toward a narrower, less diverse output distribution.

Model collapse describes the progressive narrowing of output diversity when training on model-generated data across multiple generations. Important minority patterns disappear as the model reinforces its most common outputs.

8. What does a warmup schedule do during LoRA fine-tuning?

Correct. Since LoRA adapters are initialized near zero, the first gradient steps can be noisy. Warming up from near-zero LR stabilizes early training before ramping to the full rate.

Warmup gradually increases LR from near-zero to the target value over the first N steps. This prevents the large, unstable gradient updates that can occur when randomly-initialized adapter weights receive a full learning rate immediately.

9. Effective batch size in LoRA training is calculated as:

Correct. Gradient accumulation simulates larger batches by accumulating gradients over multiple forward passes before each optimizer step, enabling larger effective batch sizes on limited GPU memory.

Effective batch = per_device_batch × gradient_accumulation_steps × num_GPUs. Gradient accumulation is the key technique for simulating large batches on limited hardware — accumulate N small batches before stepping the optimizer.

10. Applying LoRA to q_proj and v_proj targets which component of the transformer?

Correct. q_proj (query) and v_proj (value) are projection matrices in the multi-head attention mechanism. Adding LoRA adapters here is the minimum recommended set — adding k_proj and o_proj increases capacity further.

q_proj and v_proj are the query and value projection matrices in the self-attention mechanism. These are the standard minimum targets for LoRA — empirically found to give good task-specific adaptation with minimal parameter overhead.

11. When training loss falls but validation loss rises, the standard corrective action is:

Correct. Diverging validation loss is the signature of overfitting. Early stopping with checkpoint restoration is the standard remedy — continuing to train will only worsen generalization.

Rising validation loss while training loss falls = overfitting. Stop training and restore the best checkpoint. More training makes it worse — the model is memorizing training examples rather than learning generalizable patterns.

12. The merge_and_unload() operation in PEFT:

Correct. merge_and_unload() adds the LoRA delta (A@B) directly into the pretrained weight matrices, producing a single stand-alone model. The adapter is no longer needed separately, but the operation is irreversible.

merge_and_unload() permanently absorbs the LoRA adapter into the base model weights (adds A@B to each W_pretrained), creating a single model file. Faster inference, but irreversible — always save the adapter separately first.

13. In the LLM-as-judge evaluation pattern, what is the judge model asked to do?

Correct. The judge model receives the input, the fine-tuned model's output, and typically a reference answer, then scores across structured dimensions like accuracy, format, and completeness.

LLM-as-judge uses a stronger model to score outputs against a rubric — evaluating dimensions like accuracy, format adherence, and completeness. This scales quality evaluation to thousands of examples without proportional human annotation cost.

14. vLLM's primary advantage over a standard transformers.generate() server for production use is:

Correct. vLLM's PagedAttention allocates KV-cache memory in non-contiguous pages, eliminating fragmentation and enabling much higher throughput — typically 10–24x over naive implementations.

vLLM's key innovation is PagedAttention, which treats KV-cache like virtual memory pages to eliminate fragmentation. This enables much higher GPU utilization and request throughput than standard transformers serving.

15. The most important reason to run regression tests (e.g., MMLU) after fine-tuning is:

Correct. Regression tests catch catastrophic forgetting — where fine-tuning on a narrow task degrades the model's general capabilities. A model that excels at your task but can no longer reason coherently about anything else is a deployment risk.

Regression testing's primary purpose is detecting catastrophic forgetting. Fine-tuning on narrow data can overwrite general capabilities. Comparing your fine-tuned model against its own base model on general benchmarks is the check — not comparison against commercial models.