Module 8 · Lesson 1

Picking Your Project: Constraints Are Your Friends

The gap between "I know deep learning" and "I built something with deep learning" is where most people stall. Let's close that gap.

How do you choose a project that's actually doable in a few weeks — and actually worth showing?

Priya is a junior at UC San Diego. She's finished Andrew Ng's deep learning specialization, built a few toy CNNs in Colab notebooks, and genuinely understands backpropagation. She's not faking it. Her internship interview at a mid-size ML startup is in three weeks and they've asked her to bring "a project you built yourself."

She opens a blank notebook and freezes. Every idea feels either too small ("a handwritten digit classifier? everyone's done that") or too large ("an autonomous driving perception system? that's a PhD dissertation"). She spends five days reading project idea lists instead of building anything. The interview happens. She shows the CNN she built following a tutorial. The interviewer says, "cool — what did you change about the architecture and why?" Priya has no answer.

This is the most common failure mode in this space. Not lack of knowledge. Lack of a specific, owned decision to defend.

Why "Just Pick Something" Is Actually Good Advice

The instinct to optimize project selection before building is a trap. You will not know which problems are interesting until you've hit a wall trying to solve one. The constraint — pick something you can finish in 3–4 weeks with the tools you already have — is not a limitation on your ambition. It's the thing that forces you to make real decisions instead of theoretical ones.

Here's the dirty secret nobody in the tutorial ecosystem tells you: a well-understood small project beats a half-finished ambitious one every single time — in interviews, in portfolios, in your own learning. Recruiters at companies like Duolingo, Notion, and Figma have said publicly that they're more impressed by someone who can explain exactly why they chose a 3-layer network over a 5-layer one than by someone who set up a diffusion model framework they can barely describe.

The goal of your first real project isn't to solve a hard problem. It's to own every decision you made — and be able to explain each one out loud to a skeptical peer.

The Three Filters for a Viable Project Idea

Run every idea through these three questions before committing:

Can you get real data in less than 2 hours? Kaggle, Hugging Face Datasets, UCI ML Repository, and scraped public APIs are all fair game. If the data pipeline is the hard part, you're doing data engineering — not deep learning. Both are valid; just name which one you're doing.

Can you state a clear success metric before training? Not "it should work well" — something specific. "Validation accuracy above 85%." "Mean absolute error below 5 degrees on the test set." Vague success criteria mean you can never finish, because there's always one more thing to try.

Is there at least one architectural choice you'll have to make yourself? This is the thing Priya's interviewer was actually testing. Not whether the model runs — whether you made deliberate design decisions. Dropout rate. Whether to use batch norm. Whether to use transfer learning and which layer to unfreeze. Any of these count.

If an idea clears all three filters, it's a legitimate project. If it fails any one of them, it's either a tutorial (fails #3) or a research paper (fails #1 or #2).

Project Archetypes That Actually Work

There are roughly four archetypes of project that hit the feasibility sweet spot for someone at your level. None of these are original — and that's fine. Originality in execution is what matters.

Archetype A — Custom Classifier

Build a CNN or fine-tuned ViT that classifies something you personally care about. Examples: local bird species from your backyard, skin conditions, product defects in manufacturing photos. The "personal care" part matters — it's what drives you past the first failed training run.

Archetype B — Sequence Predictor

Use an LSTM or Transformer to predict or generate something sequential: stock sentiment from Reddit threads, music genre from spectrograms, typing patterns for accessibility tools. Any domain where order matters and you have a temporal or sequential structure.

Archetype C — Fine-Tuned Foundation Model

Take a pretrained model (BERT, ResNet, Whisper, CLIP) and fine-tune it for a narrow specific task you define. The architectural choices are about what to freeze, what to unfreeze, and how to set up your training loop — all real decisions.

Archetype D — Generative Side Project

Build a small generative model — a conditional GAN, a VAE, or a tiny diffusion model — targeting a specific output domain. Constraints: images should be small (32x32 or 64x64), dataset should be narrow. Breadth kills generative projects.

Your peers scrolling LinkedIn are mostly showing Archetype C right now — fine-tuned BERT classifiers are everywhere in 2024 portfolios. That's not a reason to avoid it; it's a reason to make your explanation unusually clear and specific.

The Scope Document: One Page, Before You Write Code

Before touching a notebook, write — literally write, in a text file — a one-page scope document. It should answer five questions:

What is the task? One sentence, unambiguous. "Classify whether a Reddit comment about a stock is bullish, bearish, or neutral."
What is the dataset? Name it and link it. Describe its size, format, and any known issues (class imbalance, label noise, etc.).
What is the success metric? One number. State it before training. Write down what "good enough to ship" looks like.
What are the two or three architectural decisions you'll need to make? List them explicitly. This forces you to realize you have real choices — and primes you to think about them before the pressure of debugging.
What is the hardest thing that could go wrong? "Not enough data." "Class imbalance." "Model never converges." Naming the risk ahead of time means you have a contingency plan instead of a crisis.

This document takes 30 minutes to write. It will save you 10+ hours of aimless iteration. The act of writing it forces a kind of precision that looking at Kaggle notebooks never will.

Practical Takeaway

Before you open Colab or VS Code, write your scope document in a plain text file. Send it to one person — a classmate, a friend who codes, anyone — and ask them to read the first two lines and tell you what the project does. If they get it right, your scope is clear enough. If they don't, your task definition isn't precise enough yet.

Quiz — Lesson 1

Picking Your Project

1. A classmate says their project idea is "using deep learning to improve healthcare." Which of the three filters does this fail most obviously?

Right. "Improve healthcare" is not a measurable target. There's no number you could hit that would tell you the project is done. Scoping to something like "classify chest X-rays as normal or abnormal with AUC above 0.88 on this specific dataset" would pass the filter.

Not quite. The bigger issue is that there's no defined success metric — "improve healthcare" gives you no way to know when the project is finished or working.

2. Priya's core mistake in the scenario wasn't choosing the wrong project. It was:

Exactly. The interviewer wasn't testing whether the model worked — they were testing whether Priya had made real decisions she could defend. Building on top of a tutorial without changing anything means you can't answer "why did you do it this way?"

The architecture choice itself wasn't the problem. The problem was that she couldn't explain any of the choices she made — because she hadn't actually made them herself.

3. Which of these is the most useful thing a scope document does before you write any code?

Right. The scope document is a clarity tool, not a guarantee. Writing "validation accuracy above 85%" before training means you have a real stopping criterion and can evaluate whether changes actually help.

The scope document's primary value is forcing upfront precision — task, data, metric, risks. It doesn't lock you in (you can revise it), and it doesn't make models converge. But it does make aimless exploration much harder.

4. You want to build a bird species classifier for your regional bird species (about 40 species). Which project archetype fits best?

Correct. Image classification of a specific, bounded set of categories is a clean Archetype A project. The fact that it's 40 species instead of 1000 makes it more tractable, not less legitimate.

Bird species classification is an image classification task — Archetype A. Sequence predictors handle temporal data, and generative models produce new examples. This one calls for a classifier.

5. The lesson says a "well-understood small project beats a half-finished ambitious one." In an interview context, what's the most honest reason this is true?

Exactly right. The technical interview isn't really about the model — it's about whether you can reason about your own decisions. A finished small project gives you that; a half-finished large one gives you excuses.

The core issue is what gets tested in interviews: your ability to reason about what you built and why. You can only do that for things you actually finished and made real choices about.

Lab 1 — Scope Your Project

Work with an AI peer to sharpen your project idea into a real scope document.

Your Role: Project Founder

You're pitching a deep learning project idea to a technical peer who will push back hard on vague scope. Your job is to arrive at a clear task statement, dataset, success metric, and at least two architectural decisions you'll need to make.

The AI playing your peer will not accept "it should work well" as a success metric. It will ask follow-up questions until the scope is actually precise.

Start by describing a project idea you're interested in — any domain, any rough concept. The peer will help you run it through the three filters and write the scope.

Lab Partner

AI Peer · Project Scoping

Hey. Give me your rough project idea — doesn't have to be polished. One or two sentences on what you want to build and why. I'll help you figure out if it's actually viable and what the scope should look like.

Module 8 · Lesson 2

Data First: Getting, Cleaning, and Knowing Your Dataset

The model is 30% of the work. The data is 70%. Most tutorials have it exactly backwards.

What does it actually mean to "understand" your dataset — and why does it change every architectural decision you make?

Marcus is building a sentiment classifier for crypto Reddit (r/CryptoCurrency) posts as a side project he wants to show at his data science club. He pulls 50,000 posts from Pushshift, skims a few rows, calls them "roughly balanced between positive and negative," and starts training a fine-tuned BERT model.

Two days later, his validation accuracy won't break 62% no matter what he tries — learning rate sweeps, different optimizers, dropout variations. He posts in his club's Slack asking for help debugging his architecture. Three people suggest different batch sizes. Nobody asks him to look at his data distribution.

When someone finally asks him to print value_counts() on his labels, the answer is devastating: 73% neutral, 19% positive, 8% negative. His model had learned to output "neutral" for almost everything and still got 62% accuracy. The architecture was fine. The data was never understood.

The Data Audit: What You Should Know Before Training Anything

Before you train a single epoch, you should be able to answer these questions about your dataset from memory — not by looking them up:

Class distribution. What fraction of your samples belong to each class? If any class is less than 15% of your data, you have a class imbalance problem that needs a strategy — weighted loss, oversampling, or both.
Sample count per split. How many training, validation, and test examples do you have after splitting? "5,000 total" is meaningless. "3,500 train / 750 validation / 750 test" is a plan.
Input dimensions. For images: resolution range and whether they're consistent. For text: token length distribution (mean, 95th percentile). For tabular: feature count and data types. Knowing this tells you whether you need resizing, truncation, or padding strategies.
Label quality. Spot-check 50 examples manually. Do the labels make sense? Are there obvious mislabels? A 5% label error rate is considered typical for crowd-sourced data — it's not catastrophic, but you should know it exists.
Train/test leakage risk. Is there any reason examples from the same "unit" (same person, same recording session, same day) could appear in both train and test? If yes, you need to split at the unit level, not the sample level.

This audit takes 1–2 hours. Not doing it is how you end up like Marcus — spending two days debugging an architecture that isn't broken.

Handling Class Imbalance Without Overcorrecting

Class imbalance is one of those problems where the worst move is ignoring it and the second-worst move is overcorrecting. Let's be specific about what actually works.

Strategy: Weighted Loss

In PyTorch, pass weight=class_weights to CrossEntropyLoss. Compute weights as 1 / class_frequency and normalize. This is the least disruptive and most commonly correct approach. Try this first before touching your data.

Strategy: Oversampling Minority

Duplicate or augment minority class examples so each class is roughly equal in the training set. Risk: if you oversample without augmentation, you overfit to minority examples. Use with augmentation or use imbalanced-learn's SMOTE for tabular data.

Strategy: Undersampling Majority

Remove majority class examples to balance distribution. Simple and fast, but throws away real data. Only worth trying if your majority class is so large that training takes too long — otherwise you're making your problem harder for no gain.

Strategy: Adjust Your Metric

Sometimes the right move isn't fixing the data — it's fixing the metric. Accuracy on imbalanced data is misleading. Switch to F1 score (macro-averaged), ROC-AUC, or precision-recall curves depending on which errors cost more in your domain.

Most of your peers will reach for oversampling first because it feels intuitive. Weighted loss is usually cleaner and harder to mess up — make it your default.

Data Augmentation: What It Actually Does and What It Doesn't

Data augmentation is often presented as a magic free-data trick. It's useful, but it has real limits that are worth being precise about.

What augmentation does: It creates transformed versions of existing examples during training, forcing the model to learn invariances. For images: random crops, flips, color jitter, rotations. For text: synonym replacement, back-translation. For audio: pitch shift, time stretch. The result is a model that generalizes better to the natural variation in your domain.

What augmentation doesn't do: It doesn't create new information. If your training set has 200 examples of one class, you can augment to 2,000 training passes of that class, but the underlying information content hasn't changed. You're still fitting to the same 200 underlying examples. Augmentation fights overfitting; it doesn't fix data scarcity.

Watch Out

Apply augmentation only to your training set. Augmenting validation or test sets is a mistake — you want to measure performance on real, unmodified examples. And for some augmentations (like horizontal flips of medical images), make sure the transformation is actually realistic for your domain before using it.

For your project: pick 2–3 augmentation techniques that are domain-appropriate and apply them consistently. Documenting which augmentations you used and why is exactly the kind of "deliberate decision" that makes your project defensible.

Building a Reproducible Data Pipeline

The last thing anyone wants to think about when excited to train their model is pipeline reproducibility. But "I lost my preprocessing code and can't recreate the cleaned dataset" is a more common horror story than you'd think — especially six months after the project when you're trying to show it in an interview.

Fix your random seeds before splitting data. torch.manual_seed(42), numpy.random.seed(42), random.seed(42). Write these at the top of every notebook that touches data.
Save your split indices or use a fixed split file. Don't just re-run a random split and hope it's the same. Save the train/val/test indices as a .npy or .csv file. This is a 3-line fix that eliminates a whole class of leakage bugs.
Log your preprocessing steps in plain English in a README or notebook cell. What did you do to clean the data? Why? Future-you reading this six months from now will be grateful.
Version your raw data separately from your processed data. Store them in separate folders. Never overwrite raw data. It's your ground truth.

Practical Takeaway

The next time you load a dataset, spend 20 minutes on the audit: print shape, value_counts, check for nulls, visualize 10 random examples from each class. Make this a ritual before you write a single model layer. You will catch something every time.

Quiz — Lesson 2

Data First

1. Marcus's model stuck at 62% accuracy despite architecture changes. The real cause was:

Exactly. With 73% of samples labeled "neutral," a model that always predicts neutral gets 73% accuracy — but Marcus saw 62%, meaning his model was actually performing worse than the trivial baseline. The problem was never the architecture.

The issue was in the data, not the model. With a 73/19/8 class split, a model can "cheat" by always predicting the majority class. Marcus needed to address class imbalance before anything else.

2. You have image data and suspect your validation set might contain augmented versions of training images. This is a problem called:

Right. Applying augmentation before splitting (instead of only during training) causes data leakage. Your validation metric will look better than it should because the model has effectively seen versions of those examples before.

This is a data leakage problem. If augmented versions of training examples appear in validation, your validation accuracy is optimistically biased — you're not actually measuring generalization to unseen data.

3. Your dataset has 5% label errors (common for crowd-sourced data). The best immediate response is:

Correct. 5% label noise is considered typical, not catastrophic. You don't need to fix it to build a useful project — but you do need to know it's there and disclose it. Manually correcting all of it usually isn't worth the time at this project scale.

5% label noise is actually typical for crowd-sourced data. The right move is to acknowledge it, spot-check enough examples to understand what's mislabeled and why, and include it as a known limitation — not to start over or try to correct everything.

4. You're classifying medical skin images. Your instinct is to use horizontal flipping as an augmentation. Should you?

Right — and this is the kind of domain reasoning that makes your project defensible. For skin lesions, horizontal flips are generally fine because the condition doesn't have left/right asymmetry diagnostically. But for other medical images (e.g., chest X-rays showing organ positions), flipping could create unrealistic examples.

The key question is whether the augmentation creates realistic examples. For skin lesions, horizontal flips are usually fine. For something like text with directionality or medical images with structural asymmetry, flipping could create unrealistic training examples.

5. Why does saving your train/val/test split indices as a file matter more than just re-running the split with the same random seed?

Exactly. Random number generation can be affected by numpy/scikit-learn version changes, OS differences, or even import order. Saving indices explicitly is the only way to guarantee you're measuring the same thing across runs and environments.

Library version changes can change how random seeds produce sequences, even if the seed value is the same. Saving split indices explicitly means you're guaranteed to be comparing experiments on the same data, regardless of environment.

Lab 2 — Audit Your Dataset

Work through a realistic data audit with an AI peer who will ask uncomfortable questions about your data.

Your Role: Data Analyst

You've collected (or chosen) a dataset for your project. Your job is to describe it to your AI peer, who will run through the data audit questions and push you to identify any problems — class imbalance, leakage risk, label noise, augmentation choices — before you write a single model layer.

If you don't have a real dataset yet, use the Reddit crypto sentiment example from the lesson (50,000 posts, 73% neutral, 19% positive, 8% negative).

Describe your dataset: where it's from, rough size, what the labels look like, and how you plan to split it. The peer will audit it with you.

Lab Partner

AI Peer · Data Audit

Walk me through your dataset. Where did you get it, how big is it, what are the labels, and do you know the class distribution yet? I'll push back on anything that looks risky before you start training.

Module 8 · Lesson 3

Training, Debugging, and Knowing When Your Model Is Actually Learning

Most training runs don't fail catastrophically. They fail quietly — and you need to know what quiet failure looks like.

How do you tell the difference between a model that's learning and one that's memorizing noise?

Diego is training a binary classifier to detect AI-generated text for a class project at his university. His training loss drops beautifully — from 0.68 to 0.12 over 20 epochs. He screenshots the loss curve and adds it to his presentation. He's feeling good.

Then he plots his validation loss. It drops for 4 epochs, then starts climbing at epoch 5. By epoch 20 it's back at 0.52. His training accuracy is 98%. His validation accuracy is 67%.

The model memorized his training set. He has no early stopping, no regularization strategy, and he never looked at his validation curve while training. He spent two more days trying to fix it by adding more layers — which made overfitting worse. By the time his class presentation happened, his model was demonstrably broken.

The frustrating part? He had all the information he needed to catch this at epoch 5. He just wasn't watching the right things.

The Signals You Should Monitor During Every Training Run

You should be watching four curves during every training run, not one:

Training loss. Should decrease. If it's flat or oscillating wildly, your learning rate is wrong — probably too high. If it decreases but very slowly, learning rate may be too low or your model capacity is too small.

Validation loss. Should decrease alongside training loss early on, then level off. If it starts increasing while training loss decreases, you're overfitting. This is your primary early-stopping signal.

Training accuracy. Should increase. If it plateaus far below your target metric, you have a capacity or learning rate problem. If it hits 100% very quickly, you're either overfitting or your task is too easy to be interesting.

Validation accuracy (or your chosen metric). This is the only curve that actually matters for your final model. Track it at every epoch. Your best model checkpoint is the one with the best validation metric — not the one from the last epoch.

If you can only watch one: watch validation loss. The gap between training and validation loss is your overfitting thermometer.

Diagnosing What's Actually Wrong

Training failures map to a short list of causes. Being systematic about diagnosis saves hours.

Symptom: Loss doesn't decrease at all

Check: learning rate too low, gradient not flowing (dying ReLU, gradient vanishing), data not batching correctly, loss function wrong for the task. Start by printing your first batch and loss before and after one backward pass.

Symptom: Loss oscillates wildly

Learning rate too high. Drop it by 10x and watch what happens. If that's too slow, use a learning rate scheduler — cosine annealing or ReduceLROnPlateau are solid defaults.

Symptom: Val loss diverges from train loss early

Overfitting. Try: adding dropout (0.3–0.5 in fully connected layers), weight decay in your optimizer (1e-4 is a common start), reducing model capacity, or more data augmentation.

Symptom: Both losses plateau too early

Underfitting. Model capacity may be too small, learning rate too low, or you're hitting a data limitation. Try increasing model depth or width by one step. If that doesn't help, the ceiling may be your data quality.

The Single Best Debugging Move

Before changing anything in your model, verify your training loop on a tiny dataset — like 10 examples. If your model can overfit to 10 examples (get near-zero loss on 10 examples after enough epochs), your training loop is correct. If it can't overfit to 10 examples, something is wrong with the data, loss, or gradient flow — not the architecture.

Early Stopping and Model Checkpointing

Early stopping is not an admission that you couldn't train long enough. It's a regularization technique that almost always improves generalization — and it's one of the architectural decisions you should be able to defend in your scope document.

The basic setup: monitor validation loss (or your metric) after each epoch. Save a checkpoint whenever validation performance improves. If validation hasn't improved for N consecutive epochs (patience = N), stop training and load the best checkpoint. Patience values of 5–15 are typical depending on how noisy your validation metric is.

In PyTorch Lightning, early stopping is a callback. In raw PyTorch, you implement it yourself in about 15 lines — which is actually worth doing once, because it forces you to understand what "save model state" means in practice: saving model.state_dict(), optimizer state, and epoch number, so you can resume or evaluate from that exact point.

Practical Takeaway

Add a validation loop and checkpoint saving to your training loop before your first real training run — not as an afterthought. You will always be glad you did. You will never be glad you didn't. And when someone asks "how did you prevent overfitting?" — you now have a real answer with specific implementation details.

What Your Peers Are Missing in Their Training Loops

The most common things missing from first-time deep learning projects in 2024 are, in roughly this order: no validation loop (only tracking training loss), no model checkpointing, no learning rate scheduling, and no gradient clipping for RNN/LSTM projects. You don't need all of these for every project — but you should make a conscious decision about each one.

Gradient clipping is worth a note: if you're training any recurrent architecture (LSTM, GRU), gradient clipping is effectively mandatory. Without it, exploding gradients are common and very difficult to distinguish from a learning rate problem. In PyTorch, it's one line: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). Add it before every optimizer step in RNN training. No exceptions.

The difference between "I trained a model" and "I trained a model I understand" is whether you can point to each of these choices and explain the value you set and why. That's the thing that makes your project actually defensible.

Quiz — Lesson 3

Training, Debugging, and Knowing When Your Model Is Learning

1. Diego's model had 98% training accuracy and 67% validation accuracy. This is a textbook example of:

Correct. A 31-point gap between training and validation accuracy, combined with validation loss rising while training loss falls, is the defining overfitting signature. The model learned the training set's noise, not its signal.

The large gap between training performance (98%) and validation performance (67%) — with validation loss rising while training loss falls — is the classic overfitting pattern. The model memorized instead of generalized.

2. Your training loss is oscillating wildly between epochs — going up and down by large amounts. The most likely cause is:

Right. Wild oscillation in loss is almost always a learning rate problem. When the learning rate is too high, gradient steps overshoot minima and bounce around. Dropping by 10x is the standard first diagnostic move.

Wildly oscillating loss is a learning rate signature. When the learning rate is too high, each gradient step overshoots the minimum, causing the loss to bounce. Drop learning rate by 10x and see if oscillation stabilizes.

3. Before changing your model architecture to debug a training problem, what should you do first?

Exactly. If your model can't overfit 10 examples, the training loop itself is broken — data pipeline, loss function, or gradient flow. Overfitting 10 examples confirms the mechanics work, so you can trust larger-scale experiments.

The "overfit 10 examples" test is the most diagnostic first step. If you can't get near-zero loss on 10 examples after enough steps, something is wrong with the training loop mechanics — not the architecture. Fix that first.

4. You're training an LSTM on sequential text data. Your loss explodes to NaN after a few batches. The most likely fix is:

Right. NaN loss in recurrent networks is a classic exploding gradient signature. Gradient clipping with max_norm=1.0 is the standard fix. It prevents any single gradient step from being so large it destroys learned weights.

NaN loss in LSTMs is almost always exploding gradients. Gradient clipping (clip_grad_norm_ before each optimizer step) is the fix. This is why the lesson calls it "effectively mandatory" for recurrent architectures.

5. Your best model checkpoint should be saved from:

Correct. The model that performs best on validation is the model you want to deploy and evaluate. Training often continues past the optimal point (overfitting), so the final checkpoint is usually worse than the best validation checkpoint.

The best checkpoint is defined by validation performance, not training duration. After the overfitting point, further training makes training loss lower but validation performance worse. The best-validation checkpoint is your actual best model.

Lab 3 — Debug a Training Run

Diagnose a broken training scenario with an AI peer who will guide you through the diagnostic process.

Your Role: ML Engineer on Call

Your AI peer will present you with a broken training scenario — a set of loss/accuracy curves, hyperparameter choices, and symptoms. Your job is to diagnose what's wrong and propose a concrete fix. The peer will push back if your diagnosis is imprecise or your fix won't actually address the root cause.

You can also bring in your own training problems if you're actively working on a project — describe your curves and symptoms and get real diagnostic help.

Tell the peer whether you want to debug a scenario they give you, or whether you have your own training issue to work through. Then we'll dig in.

Lab Partner

AI Peer · Training Debugger

Ready to debug. Do you want me to give you a scenario to diagnose — training curves, model setup, the works — or do you have a real training problem from your own project you want to work through? Either way, we're going to get specific.

Module 8 · Lesson 4

Evaluating, Presenting, and Actually Finishing Your Project

A project that runs isn't done. A project you can explain, evaluate honestly, and show to someone is done.

What separates a project worth showing from one that just technically works?

Leila spent three weeks building an image classifier that identifies 12 species of houseplants from photos. It works. Validation accuracy is 84%. She trained it, it converges, the code runs. She uploads the notebook to GitHub and adds it to her resume under "Projects."

At a portfolio review session at her university's career fair, a recruiter from a plant-care app startup opens her notebook. They spend four minutes scrolling. Then they ask: "What does the model get wrong? Which classes does it confuse most?" Leila answers, "I haven't looked at that specifically." The recruiter nods politely and moves on.

The thing is: Leila's model was genuinely good. But she hadn't evaluated it — she'd trained it. Those are different things. A confusion matrix and five minutes of error analysis would have answered that question and made the entire project defensible.

Evaluation vs. Validation: The Distinction That Matters

Validation accuracy during training is a tool for making architectural decisions. It is not your final performance number. The distinction:

Validation set is used during training to tune hyperparameters, select the best checkpoint, and decide when to stop. It's touched multiple times. This means it has a small amount of implicit "leakage" — you've made decisions based on it.
Test set is touched exactly once, after all training and tuning decisions are made. The number it gives you is your honest, reported performance. If you ever use test set performance to make a training decision, you've compromised it — it's no longer an honest estimate.

A lot of student projects on GitHub don't have a proper test set at all. They train, validate, and then call validation accuracy "test accuracy." When you maintain the distinction, you're doing ML correctly — and you can say so.

The Honest Number

Your final reported metric should come from your held-out test set, run exactly once, after you've declared your model final. Write it down. Put it in your README. It's the only number that means what you say it means.

Error Analysis: What Your Model Gets Wrong and Why

Error analysis is the step almost nobody does on a first project, which means doing it makes you look unusually serious. Here's what it involves:

Confusion matrix. For a multi-class problem, print the full confusion matrix. Look at which classes get confused with which. For Leila's houseplant classifier: if pothos and golden pothos are constantly confused, that's a domain-specific similarity problem — and it's interesting to explain in your README.

High-confidence wrong predictions. Pull out the examples where your model was wrong but confident (softmax output > 0.9 for the wrong class). These are your worst failures. Look at them manually. Is there a pattern — bad lighting, unusual angles, mislabeled ground truth?

Class-level precision and recall. For each class, compute precision (of everything you called this class, what fraction was right?) and recall (of all true examples of this class, what fraction did you catch?). Some classes might have excellent precision but poor recall — this tells you something specific about your data or model.

Write two sentences explaining the main failure mode. Something like: "The model most commonly confuses snake plants with ZZ plants when photographed in low light. These two species have similar leaf color and texture at low resolution; higher resolution input or a specialized preprocessing step for lighting normalization could address this." That's what informed error analysis looks like.

Writing a README That Answers the Real Questions

Your README is your project's front door. Most student READMEs are installation instructions and a description of the task. That's not enough. Here's what a good project README covers:

One-sentence task description. What does it do, specifically? Not "classifies images" but "classifies 12 houseplant species from photos with 84% test accuracy."
Dataset. Where from, how big, how you handled any imbalance or noise issues.
Architecture decision rationale. Why ResNet-18 and not ResNet-50? Why did you freeze the first 3 layers and fine-tune the rest? Even one paragraph of genuine reasoning here is powerful.
Training setup. Optimizer, learning rate, epochs, whether you used early stopping, batch size. Not because the reader needs to reproduce it — but because it signals you understand what you did.
Honest performance. Test accuracy, confusion matrix screenshot, and your error analysis summary. Including your failure mode shows intellectual honesty — it's more impressive than just showing the wins.
What you'd do differently. "If I had more time, I'd collect 3x more examples of the most confused classes and experiment with test-time augmentation." This signals you know what the project's limits are — which is a senior skill.

Your peers are mostly posting notebooks with one-line READMEs. A README that answers all six of these is a rare thing that will get noticed by anyone who looks at your portfolio seriously.

The Presentation: What You Should Be Able to Say Out Loud

Regardless of whether you're presenting in a class, at a career fair, or in an interview, you should be able to give a 3-minute verbal summary of your project that answers these questions without looking at any notes:

What is the task and why is it interesting or useful?
What data did you use and what were the main data challenges?
What architecture did you use and why that choice over the alternatives?
How did you train it — what were the key hyperparameter decisions?
How did it perform, and what does it get wrong?
What would you change with more time or data?

Practice this out loud. Not in your head — out loud, to another person or to a wall. The first time you try it, you'll discover the gaps: the places where you say "um, I'm not sure actually" and realize you don't own that piece of the project as well as you thought.

Practical Takeaway

After finishing your model, spend one focused hour on error analysis — confusion matrix, high-confidence failures, per-class metrics — and write two sentences explaining the main failure mode. Add it to your README. Do this before adding anything else. It transforms your project from "something that runs" to "something you understand." That's the actual finish line.

Quiz — Lesson 4

Evaluating, Presenting, and Finishing

1. Leila's recruiter interaction revealed which specific gap in her project?

Right. 84% accuracy is actually solid. The problem was she couldn't answer the follow-up: "what does it confuse?" That question requires error analysis — confusion matrix, per-class performance, failure pattern identification — none of which she'd done.

The accuracy was fine. The issue was the absence of error analysis. She could describe what the model did, but not what it got wrong or why — which is the thing that signals genuine project ownership.

2. You tune your hyperparameters based on test set performance, then report test accuracy as your final metric. What's the problem?

Correct. Once you make any decision based on test set performance, that performance number is optimistically biased. You've implicitly selected for the hyperparameters that happened to work on this particular test set — which may not generalize to truly unseen data.

The test set's only value is as a one-time, final measurement after all decisions are locked in. Using it for hyperparameter decisions contaminates it — your reported number will be optimistically biased and won't generalize the way it claims to.

3. Your confusion matrix shows that class A (rare in your dataset) has low recall. What is the most likely explanation?

Right. Low recall on a rare class almost always traces back to class imbalance. The model learned to not predict this class often because predicting it rarely was penalized less during training. This is exactly why weighted loss or oversampling during training matters.

Low recall on a rare class is the classic class imbalance symptom. The model barely predicts that class, so it misses most true examples of it. This connects directly back to Lesson 2 — class imbalance strategies exist precisely to address this outcome.

4. In your README, you include a section titled "What I'd do differently with more time." A skeptical recruiter would most likely interpret this as:

Exactly. Engineers who understand their own limitations are more trustworthy than ones who present only wins. "What I'd do differently" shows you know what the ceiling is and have thought about how to push past it — which is exactly the kind of thinking senior engineers do.

Intellectual honesty about limitations is a senior signal, not a weakness. Anyone who has built something knows it has limits. Pretending it doesn't makes you look naive; explaining those limits clearly makes you look like someone who actually understands what they built.

5. You're about to present your project verbally at a career fair and realize mid-explanation that you can't clearly answer "why did you choose that architecture?" What should you do before the next opportunity?

Right. The gap you discovered — not owning the "why" of your architectural choice — is fixable with deliberate practice. Write out the reasoning, say it out loud multiple times, get challenged on it by a peer. The practice is the point.

Discovering you can't explain a decision is valuable feedback. The fix is to work out the actual reasoning (why ResNet-18 vs. ResNet-50? why this learning rate range?), articulate it explicitly, and practice the full 3-minute pitch until follow-up questions feel manageable.

Lab 4 — Defend Your Project

Simulate a real project review with an AI peer who will ask the hard questions.

Your Role: Project Owner

Your AI peer will simulate a technical recruiter reviewing your deep learning project — asking about architecture decisions, evaluation methodology, failure modes, and what you'd do differently. Your job is to answer clearly and specifically.

If you have a real project to defend, use it. If not, use Leila's houseplant classifier scenario: 12-class image classifier, ResNet-18 fine-tuned, 84% test accuracy, no confusion matrix analysis done yet.

Either describe your own project for the peer to interrogate, or say "use the houseplant scenario" and the peer will start asking questions about Leila's project as if you built it.

Lab Partner

AI Peer · Project Reviewer

Alright, I'm playing technical recruiter here. I've got your project open — either describe it to me or say "use the houseplant scenario." I'll ask the questions a serious reviewer would ask. Don't give me marketing language. Give me specifics.

Module 8 Test

15 questions · Pass at 80% · Covers all four lessons

1. Which of the following project ideas passes all three viability filters from Lesson 1?

Correct. This passes all three: dataset is accessible (Yelp reviews via Hugging Face), success metric is concrete (F1 ≥ 0.82), and there are real architectural decisions (what to freeze, learning rate, sequence length).

The BERT fine-tuning option passes all three filters: accessible data, concrete metric, real architectural decisions. The others are too vague to have a defined success condition or accessible dataset within 2 hours.

2. A scope document should be written:

Right. The scope document's value is forcing precision before you're in the middle of debugging — when it's too late to easily change direction. Writing it first makes the entire project more efficient.

The scope document is a pre-code artifact. Its purpose is to force precision about task, data, and metric before any training choices are made. Doing it after exploration defeats its primary purpose.

3. You're building a multi-class classifier. After printing value_counts() on your labels, you find one class has 8% of the samples. Your first response should be:

Correct. Weighted loss is the recommended first move for class imbalance — it addresses the optimization problem directly without touching the data, which reduces the risk of introducing new bugs.

Weighted loss is the recommended first step for class imbalance. It's a single-line change in PyTorch, it doesn't alter your data, and it directly adjusts how the model penalizes errors on minority classes.

4. Applying data augmentation to your validation set is a mistake because:

Right. Validation exists to measure generalization to real inputs. Augmenting it creates a mismatch between what you measure and what the model will face at deployment — and can make weak models look stronger than they are.

The purpose of validation is to simulate real-world performance. If you augment it, you're measuring performance on transformed inputs — not the actual inputs the model will face. The metric becomes misleading.

5. To confirm your training loop is mechanically correct before running a full training job, you should:

Correct. The "overfit 10 examples" test is the most reliable quick sanity check for a training loop. If you can drive loss near zero on 10 examples, the mechanics — data batching, loss function, backward pass, optimizer step — are all working.

Overfitting 10 examples is the canonical training loop sanity check. If you can't overfit 10 examples, there's a fundamental issue with your loop — not your model capacity or data quantity.

6. Training loss is 0.08. Validation loss is 0.71. The gap has been widening since epoch 3. The correct intervention is:

Right. A large and widening train/val loss gap is the overfitting signature. The fix is regularization — dropout, weight decay, augmentation — not more capacity or more epochs, both of which would make overfitting worse.

The widening gap between very low training loss and high validation loss means overfitting. Adding capacity or epochs makes this worse. Regularization (dropout, weight decay, augmentation) is the correct intervention.

7. For an LSTM model, gradient clipping is described as "effectively mandatory" because:

Correct. Recurrent architectures unroll over long sequences, creating very long gradient paths that can explode. The symptom (NaN loss) looks like a learning rate problem. Clipping prevents the explosion before it starts.

Recurrent networks are specifically prone to gradient explosion due to long unrolled computation graphs. Gradient clipping is the standard defense, and its absence is one of the most common causes of NaN loss in LSTM training.

8. Your best model checkpoint should come from the epoch with:

Correct. After the overfitting point, training continues to improve on the training set while validation degrades. The best validation checkpoint is the actual best model — not the most-trained one.

Validation metric is the only thing that matters for selecting a checkpoint. Training continues past the optimal point, so the final epoch is usually not the best model for generalization.

9. What distinguishes a test set from a validation set in a well-run ML project?

Exactly right. The test set's integrity depends on it being used only once, at the end. Using it during tuning converts it into a second validation set — and your reported number becomes an optimistic overestimate.

The critical distinction is frequency of use. Validation set is used many times during training for tuning. Test set is used exactly once, at the end, to produce the honest final number. They are not interchangeable.

10. High-confidence wrong predictions (model confident but incorrect) are especially valuable for error analysis because:

Right. High-confidence failures reveal systematic model biases — the model has learned something wrong and is very sure about it. These are more informative than low-confidence errors, which might just be ambiguous examples.

High-confidence wrong predictions are your model's systematic failures — not random noise. The model "knows" something wrong, which means a specific pattern in the data or architecture is causing it. These are the most diagnostic examples to examine manually.

11. You save your train/val/test split indices as a .npy file instead of re-running the random split with the same seed each time. The primary reason this is better is:

Correct. Environment reproducibility is fragile with seeds alone. Saved indices are explicit — they always refer to the same examples regardless of what version of numpy or scikit-learn you're using.

Saved indices are immune to environment changes that affect random number generation. Seeds alone are not enough when library versions can change how sequences are generated. Explicit indices are the only truly robust approach.

12. Your confusion matrix shows class "rare_species" has very low recall (0.31) but reasonable precision (0.72). The most likely interpretation is:

Exactly. Low recall / decent precision means the model is conservative about predicting this class — when it does, it's often right, but it's missing most of the true positives. This pattern often traces back to class imbalance during training.

Low recall with reasonable precision means: the model is shy about predicting this class. When it does predict it, it's usually correct (precision is decent), but it's missing most actual examples (recall is low). This is a typical class imbalance outcome.

13. A classmate argues that including a "what I'd do differently" section in a README makes a project look unfinished. What's the strongest counter-argument?

Correct. Intellectual honesty about limitations is a marker of engineering maturity. Anyone who has built something real knows it has limits. The interesting question is whether the builder understands those limits clearly enough to articulate them.

The best counter-argument is about engineering maturity. Every real system has limitations. The question is whether the engineer understands them — and being able to articulate "what I'd fix next" shows exactly that kind of self-aware, forward-looking thinking.

14. You're asked in an interview: "Why did you use ResNet-18 instead of ResNet-50?" You answer: "ResNet-18 was faster to train on my hardware and the task had only 12 classes, so I didn't need the extra capacity — and validation accuracy confirmed it wasn't the bottleneck." This answer is:

Exactly right. This answer combines three real justifications: task complexity (12 classes doesn't need 50 layers), resource constraints (a valid real-world consideration), and empirical validation (you checked, and capacity wasn't the bottleneck). That's how you defend an architectural decision.

This is a strong answer. It combines task reasoning, resource constraints, and empirical evidence. You don't need to have tried every alternative — you need to show that your choice was deliberate and that you checked whether it was adequate. This answer does both.

15. The most important thing that separates "a project that runs" from "a project worth showing" is:

Exactly right. This is the thesis of the entire module. Ownership of every decision — task scope, data choices, architecture rationale, training decisions, honest evaluation — is what makes a project genuinely defensible. Everything else is secondary.

The ability to explain every decision is the real finish line. Clean code is nice. High accuracy is nice. But the thing that distinguishes a builder from a tutorial-follower is whether they can defend their choices under scrutiny — and that's what this whole module has been building toward.